A local Model Context Protocol (MCP) server for comparing Markdown, plain text, DOCX, PDF, PPTX, XLSX, and EPUB documents — with structured JSON hunks, unified diffs, Markdown reports, comprehensive text statistics, replacement preview/apply, and a converter dispatcher that runs Microsoft MarkItDown (default) or IBM Docling / pandoc under the hood.
A DiffChecker-style document comparison toolkit for Claude Desktop, Claude Code, Cursor, and any other MCP-aware AI assistant.
There are MCP servers that convert documents to Markdown. There are MCP servers that do raw text diff. But nothing wires the two together with structured JSON output, per-side word/character/sentence statistics, replacement extraction (the "may → shall × 3" view), and a preview-first find/replace primitive. This is that.
It runs locally, has no remote API dependency by default, and follows the same DiffChecker-style conventions agents and humans already expect.
| Tool | What it does |
|---|---|
compare_text |
Diff two strings → JSON hunks, unified diff, Markdown report, per-side stats, replacements roll-up. |
compare_markdown |
Same, with Markdown-aware normalization (frontmatter, smart-quotes, etc.). |
compare_files |
Diff two local files (.md/.txt/.docx/.pdf/.pptx/.xlsx/.epub). Non-text formats are converted to Markdown first via the chosen converter. |
normalize_to_markdown |
Convert a single file to Markdown (optionally written to disk). |
text_stats |
Counts for a single string: characters, words, sentences, paragraphs, lines, reading & speaking time. |
count_file |
Same as text_stats, but against a file path (converts DOCX/PDF first). |
find_replace_text |
Preview-first find-and-replace on a string. Defaults to dry_run=true. Supports regex and a list of operations. Returns line + column of every match. |
replace_in_file |
Preview-first find-and-replace on a file. For DOCX/PDF, content is converted to Markdown first; result must be written to output_path (cannot write back to DOCX/PDF). For direct-text files, overwrite_in_place=True is allowed. |
summarize_diff |
Diff plus a short natural-language summary of what changed — useful for agent-driven document review. |
| Converter | Best for | Notes |
|---|---|---|
markitdown (default) |
DOCX, PPTX, HTML, in-process speed | Microsoft's MarkItDown — fast, great on Office formats, weak on complex PDFs (heading hierarchy & tables suffer per independent benchmarks). |
docling (opt-in) |
PDFs, scientific papers, multi-column layouts, tables | IBM's Docling — much higher fidelity on PDF (pip install docling). |
pandoc (opt-in) |
Round-trip-safe DOCX ↔ Markdown | Requires pandoc on PATH (brew install pandoc) plus pip install pypandoc. |
Pick per call with converter="docling" etc., or set the default with MARKITDOWN_DIFF_DEFAULT_CONVERTER.
Diff tools return:
{
"summary": {
"added_lines": 2, "removed_lines": 2, "modified_hunks": 1,
"unchanged_lines": 1, "similarity": 0.333,
"before_words": 9, "after_words": 11, "words_delta": 2,
"before_characters": 38, "after_characters": 47, "characters_delta": 9,
"before_sentences": 2, "after_sentences": 2,
"before_paragraphs": 1, "after_paragraphs": 1,
"replacements": 1
},
"inputs": {
"before": { "source": "...", "format": "md", "converted": false, "stats": { ... } },
"after": { "source": "...", "format": "docx", "converted": true, "converter": "markitdown", "stats": { ... } }
},
"replacements": [
{ "removed": "cat", "added": "dog", "count": 1 }
],
"diffs": [
{
"type": "modified",
"before_start": 1, "before_end": 1, "after_start": 1, "after_end": 1,
"before": "The cat sat on the mat.",
"after": "The dog sat on the mat.",
"tokens": [
{ "type": "unchanged", "text": "The " },
{ "type": "removed", "text": "cat" },
{ "type": "added", "text": "dog" },
{ "type": "unchanged", "text": " sat on the mat." }
]
}
],
"unified_diff": "@@ ... @@\n-The cat ...\n+The dog ...",
"markdown_report": "## Diff Summary\n..."
}The schema is jsondiffpatch-inspired so downstream tools can interop without translation.
# Base install (MarkItDown converter only)
pip install markitdown-diff-mcp
# With Docling for high-fidelity PDF
pip install 'markitdown-diff-mcp[docling]'
# With pandoc support
pip install 'markitdown-diff-mcp[pandoc]'
# Everything
pip install 'markitdown-diff-mcp[all]'markitdown-diff-mcp # stdio MCP server (for Claude Desktop, etc.)Or run the single-file server directly:
uv run --script server.py # uses the inline script header- Local files only; absolute paths required.
MARKITDOWN_DIFF_ALLOWED_ROOTSis opt-in, not on by default. Leave it unset and the server reads any path your user account can read (which is the same access surface as your shell, your editor, and every other local MCP). Set it (colon-separated dirs) only if you want to bound the blast radius — useful when running the server for someone else, sharing it across agents, or sandboxing per project.- File size cap default 25 MB (
MARKITDOWN_DIFF_MAX_FILE_BYTES). - Output character cap default 5 MB (
MARKITDOWN_DIFF_MAX_OUTPUT_CHARS). - Find/replace defaults to
dry_run=true. Writes require explicit opt-in (output_pathoroverwrite_in_place=True).
- v0.2: folder/directory compare, moved-block detection, HTML side-by-side report, Markdown AST-aware diffs (
markdown-it-py+ GFM tables). - v0.3: optional integration with
docx-mcp/safe-docxfor true Word-track-changes export. - Possible later: spreadsheet cell-by-cell diff via openpyxl, image visual diff (out of scope for v1; complex).
- Microsoft MarkItDown — conversion engine
- IBM Docling — high-fidelity PDF/layout parser
- DiffChecker — UX conventions for stats, replacements, and preview-first behavior
- Model Context Protocol — Anthropic + community spec
MIT — see LICENSE.
{ "mcpServers": { "markitdown-diff": { "command": "uvx", "args": ["--from", "git+https://github.com/justinritchie/markitdown-diff-mcp", "markitdown-diff-mcp"], "env": { "MARKITDOWN_DIFF_DEFAULT_CONVERTER": "markitdown" // MARKITDOWN_DIFF_ALLOWED_ROOTS is opt-in — leave unset for full // user-level read access; set to "/dir1:/dir2" to sandbox reads. } } } }