difflib C accelerator#44
Open
blhsing wants to merge 3 commits into
Open
Conversation
Move the pure-Python implementation of difflib to Lib/_pydifflib.py and turn Lib/difflib.py into a thin shim that re-exports its public API. This mirrors the layout used by decimal/_pydecimal, datetime/_pydatetime, and pickle/_pickle, where the public module dispatches to a faster C implementation when available and the pure-Python module is preserved as a self-contained reference for alternative Python implementations. No public behaviour change. ``Match`` is constructed with ``module='difflib'`` so its qualified name matches the public module. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Introduce Modules/_difflibmodule.c, a heap-type C extension that implements __init__, set_seqs/set_seq1/set_seq2, find_longest_match, get_matching_blocks, get_opcodes, and ratio for SequenceMatcher. The inner DP loop and the full Ratcliff-Obershelp recursion run on int32 label arrays with zero Python C-API calls in the hot path; codepoint- keyed lookup tables short-circuit per-element dict probes for str and bytes inputs. Output is bit-identical to the pure-Python implementation including tie-breaks. Lib/difflib.py grows a small subclass that inherits the slow-path methods (quick_ratio, real_quick_ratio, get_grouped_opcodes) from the pure-Python class; this is a no-op when the accelerator is not built. Build wiring: configure.ac registers the module via PY_STDLIB_MOD_SIMPLE and Modules/Setup.stdlib.in references _difflibmodule.c. configure must be regenerated with autoreconf before this lands. Typical workloads run 5-25x faster than pure Python; the bytes path up to ~70x. See Lib/test/test_difflib.py for cross-implementation tests. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
When the _difflib C accelerator is built, programmatically generate a parallel ``*_PurePython`` TestCase for each existing test class so the same suite covers both implementations. Pure-Python coverage is obtained by patching ``difflib.SequenceMatcher`` to ``_pydifflib.SequenceMatcher`` in setUp / restoring it in tearDown; internal helpers like ``unified_diff`` and ``ndiff`` resolve ``SequenceMatcher`` on ``difflib`` at call time, so patching the module attribute covers the whole pipeline. This mirrors the dual-implementation test pattern used by test_decimal (C* / Py* class pairs) without requiring every existing test method to be parameterised. ``test_html_diff`` also gets a single-line fix: it depended on ``HtmlDiff._default_prefix`` starting at 0, which only held because it ran first. Resetting the counter at the top of the test makes it order-independent. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
A C accelerator for
difflib.SequenceMatcherThis is a summary of an exploratory project to add a C accelerator for
difflib.SequenceMatcher, modeled after thedecimal/_pydecimal/_decimalpattern. Output is bit-identical to the pure-Python implementation, including tie-breaks; the win comes from constant-factor optimizations (the algorithm is still Ratcliff-Obershelp).Context
The thread at https://discuss.python.org/t/algorithmic-complexity-of-difflib/105844 covers the asymptotic angle — a suffix-automaton implementation for pathological inputs. This work is the orthogonal half: same algorithm, faster constant factor. The two should compose — pathological inputs flip to SAM, typical workloads stay on the C-accelerated Ratcliff-Obershelp.
Preserving the pure-Python implementation
The pure-Python
SequenceMatcheris preserved verbatim as a reference/fallback. The layout mirrorsdecimal/datetime/pickle:Lib/_pydifflib.py— the existing pure-Python implementation (no behavior change), with__name__ = "difflib"so reprs and theMatchnamedtuple's qualified name look right.Lib/difflib.py— a thin shim:from _pydifflib import *brings in everything (Differ,HtmlDiff,unified_diff,ndiff,get_close_matches, …)_calculate_ratio,_mdiff,_format_range_unified,_format_range_context) that existing tests referencetry: from _difflib import SequenceMatcher as _CSequenceMatcher+ subclasses it to inheritquick_ratio,real_quick_ratio,get_grouped_opcodesfrom the pure-Python class. When_difflibis unavailable, the shim is a no-op andSequenceMatcherstays the pure-Python class.Alternative Python implementations can either use
_pydifflibdirectly or ship their own equivalent shim.Optimization phases
Each phase preserves the algorithm; each row shows what changed and why.
find_longest_match;j2lenreplaced with paired int arrays + generation counter (no per-row clear)jdict probe in the inner DP loop.chain_b; type-specialized iteration ofbforstr/list/tuple/byteschain_bbecame >50% of remaining cost after phase 1; ~25ksetdefault+appendin Python.int32label arrays (a_lbl,a_dp,b_lbl,junk_mask)ratio()each cross the Python/C boundary; extension passes didPyObject_RichCompareBoolper element.cp_full[]/cp_dp[]arrays forstrb;strafast-path reads codepoint and indexes the arraystrinputs we paidPyUnicode_FromOrdinal+ dict-probe per element on bothaandb.alloca/memset), skip max-cp scan for UCS1 stringsPyLong_FromLongper element under autojunk; pure-Python's per-int allocator costs dominated for the bytes workload.Phase 1 is reproduced by monkey-patching
_pydifflib.SequenceMatcher.find_longest_matchwith an early C port that still talks to the Pythonb2jdict; phase 2 also swaps in a C__chain_b; phase 3 disables the codepoint-keyed lookup tables on top of the current code; phase 4 disables only the phase-5-specific changes (bytes fast-path and the UCS1 max-codepoint shortcut). All five measurements run on the same release-mode in-tree CPython 3.16-dev build, with pyperf (20 processes × 3 values per workload, default warmup, system tuned viapyperf system tune).Wall time (pyperf mean ± stddev) for
SequenceMatcher(...).ratio()+.get_opcodes():Speedup over pure-Python (means only, derived from the row above):
Reading the table: each phase adds an optimization that monotonically helps the workloads it targets. Phase 3 (label-array DP + recursion-in-C) is the single largest jump on int/line workloads; phase 4 (codepoint-keyed tables for
str) drives the doubling onstrworkloads; phase 5 contributes a measurable but smaller bytes win, and on the other workloads it's at the noise floor (within ±0.5 σ of phase 4). The pyperf standard deviations are <3% in every cell, so the small phase-4-vs-phase-5 swings onstr_2k,str_8k, andlines_difflibare well inside measurement noise — not a regression.Benchmark hardware and software
All numbers in this PR description were collected on the same machine and software configuration:
sudo pyperf system tune(Turbo Boost disabled,performanceCPU governor, ASLR off, IRQ affinity pinned).mainat commit5f4fbc10f68, built--enable-sharedrelease mode (no--with-pydebug),-O3. Python 3.12 numbers use the system Python 3.12.3 in a dedicated venv on the same host.Benchmark workloads
Six workloads, all comparing two sequences that are mostly identical with a small percentage of perturbations — the realistic case difflib is designed for:
b2jkeys are 1-char strings._pydifflib.pyas a list of ~2,100 lines vs the same list with every 50th line replaced by a comment. Line-level diff over Python source.b2jkeys are Python ints.bytesof 8,000 random bytes vs the same with about 5% mutated. Tests the bytes fast-path; pure-Python pays aPyLong_FromLongper element when buildingb2j.The benchmark measures end-to-end
SequenceMatcher(None, a, b)followed bys.ratio()ands.get_opcodes(), withmin()over 15 repetitions to dampen noise (5 forbytes_8kbecause pure Python is slow). All six workloads haveautojunkenabled (the default).Ours vs the original pure-Python on CPython 3.16 dev
Measured on the in-tree 3.16 dev build (release-mode, no
--with-pydebug) where the patch lives, using pyperf (20 processes × 3 values per workload). Pure Python is_pydifflib.SequenceMatcher(the same code as the existingdifflib.SequenceMatcher); the accelerated column is the C-backeddifflib.SequenceMatcherafter the shim picks up_difflib:Bench script (pyperf, ours vs pure-Python on the in-tree 3.16 dev build)
The script below is the same one used for the 3.12 4-way comparison; it picks the implementation via the
DIFFLIB_IMPLenv var, so the same script measures any combination ofpy(stdlibdifflib),pyours(the pure-Python_pydifflib),ours(the in-tree_difflibC accelerator),cdifflib, orcydifflib. For the 3.16-dev table above I ran it twice:Apples-to-apples comparison vs PyPI alternatives (CPython 3.12)
Two existing C accelerators on PyPI:
cdifflib1.2.9 (hand-written C) andcydifflib1.2.0 (Cython-generated). The comparison is run on CPython 3.12 (not the in-tree 3.16-dev branch where the patch lives) becausecydifflib's wheel build is CMake-based and currently fails to configure against 3.16-dev — it cannot be tested on that runtime. To remove "different Python version" as a confound, I rebuilt the new accelerator against the system Python 3.12 headers and installed all four implementations (pure-Python,cdifflib,cydifflib, ours) into a single Python 3.12 venv so every column below shares the same interpreter, GC, and pure-Python baseline. All measurements use pyperf (20 processes × 3 values per workload, system tuned viapyperf system tune).Wall time (pyperf mean ± stddev), all on Python 3.12:
Speedup over pure-Python (means only), all measured on Python 3.12:
Bench script (pyperf, 4-way comparison on CPython 3.12)
Same
bench_pyperf.pyas above; in the 3.12 venv where all four implementations are importable, setDIFFLIB_IMPLto one ofpy,cdifflib,cydifflib,oursand run once per impl:Observations:
cydifflibregresses on char-level diff:str_8k_similarruns at 0.12× of pure Python (≈8× slower). Reproducible across pyperf processes (stddev = 0 ms), not a measurement artifact. Looks like per-element Cython wrapping overhead is bad when elements are single codepoints.cdifflibis consistent at ~1.6–2.2×: a modest constant-factor win that targetsfind_longest_matchandget_matching_blocksonly, leavingchain_bin Python.What lets ours go further
cdifflibandcydifflibboth keep the existing data shape:b2jis adict[elt, list[int]], and the inner loop doesPyDict_GetItemon Python-object keys per outeri. The biggest single change in this work is restructuring the state itself:bis assigned a small integer label atchain_btime.int32_tarrays (a_lbl,b_lbl,a_dp,junk_mask) replace the per-position object lookups.get_matching_blocksrecursion (queue + sort + collapse) runs in C, with the DP loop and extension passes operating purely on those int32 arrays — no Python C-API calls in the hot path.strandbytes, codepoint-keyed lookup tables (cp_full[256]for UCS1 /cp_full[max+1]for wider strings;cp_full[256]for bytes) replace the dict-probe entirely. This is what makes the bytes case 13× rather than ~3×.Neither
cdifflibnorcydifflibdoes the integer-label transformation; that explains most of the remaining gap.Integration details
Modules/_difflibmodule.c(heap type, per-interpreter GIL supported,PyType_GetModuleByDeffor subclass-safe state lookup), wired viaPY_STDLIB_MOD_SIMPLE([_difflib])inconfigure.acand a one-line entry inModules/Setup.stdlib.in.__init__,set_seqs,set_seq1,set_seq2,find_longest_match,get_matching_blocks,get_opcodes,ratio) use/*[clinic input]blocks; argument parsing is generated intoModules/clinic/_difflibmodule.c.h.impl-detailblock added toDoc/library/difflib.rstmentions the C accelerator, the_pydifflibreference, bit-identical output, and the typical 5–25× speedup range.Lib/test/test_difflib.pyprogrammatically generates a*_PurePythonvariant of each existing TestCase that monkey-patchesdifflib.SequenceMatcherto_pydifflib.SequenceMatcherfor the duration of the test (similar in spirit totest_decimal'sC*/Py*class pairs, but non-invasive — no test method changes). With the accelerator built,test_difflibruns 100 tests (61 C-path + 39 Py-path); without it, the 60 original tests run as before.Comparison vs GNU
diff(1)on line workloadsGNU
diffis a different program (and a different algorithm — Myers diff, O((n+m)D) where D is the edit distance, not Ratcliff-Obershelp); the comparison only makes sense on line-level inputs because GNUdiffdoesn't natively handle character, integer, or byte sequences. The GNUdiffcolumn includes its full real-world cost: process spawn + reading both files from disk + writing the unified diff to /dev/null. The Python columns measure the in-process workdifflib.unified_diff(a_lines, b_lines, lineterm='')does (the same workTools/build/stable_abi.py, doctest's failure output, andunittest.assertMultiLineEqualdo under the hood). All numbers are pyperf mean ± stddev:Reading the table:
diffend-to-end. GNUdiff's subprocess + file-IO floor (~1–2 ms) is the same order of magnitude as the actual diff work at this size, so the in-process Python+C path comes out ahead.diffpulls ahead, ~1.8× faster. That's Myers diff's asymptotic edge over Ratcliff-Obershelp showing up at scale. Pure-Pythondifflibis 20× slower than GNU diff here; ours closes most of that gap but not all of it.Why the crossover, briefly: GNU
diffuses Myers diff — for "mostly identical" inputs, edit distance D is small so its O((n+m)D) behaves like O(n+m) — linear in n. Ratcliff-Obershelp is super-linear in expected case. Holding the perturbation rate constant, scaling n by 15× (2k → 32k lines) grows ours' time by ~22× and GNU diff's algorithm time by ~9×. Combined with GNU diff's fixed ~1 ms subprocess overhead, the crossover lands around n ≈ 10–15k lines.This is well above the sizes CPython callers actually hit (doctest failures,
assertMultiLineEqual, the 2,833-line stable-ABI manifest), so for real workloads we're in the "GNU diff loses to us because of IO floor" zone. The discussion thread linked at the top covers the asymptotic angle (suffix-automaton work forautojunk=False) — that's the orthogonal path that would close the remaining gap on >10k-line inputs.Bench script (pyperf, GNU diff vs `difflib.unified_diff`)
The script selects the implementation via
DIFFLIB_IMPL(gnudiff,pyours, orours) so all three columns above share the same input generation and pyperf invocation pattern.Where the wall-clock savings actually show up
The most compelling wins are in workloads that run difflib in a loop or over a large input batch — typical of third-party tools that do similarity scoring, repo-wide diffing, or clustering. The numbers below are wall-clock time for a single run on this machine (one process; pyperf isn't necessary since the variance is small relative to the deltas):
unified_diff(2,003 .py files, fullLib/recursive)ratio()calls)Real third-party tools and use cases that map onto each row:
simian, Python ports ofjscpd), plagiarism checkers, duplicate-test detectors in large pytest suites, fuzzy record linkage (recordlinkage, simplerdedupe-style flows), document near-duplicate detection in scrapers.unified_diff— pre-commit hooks emitting per-file diffs, bulk-refactor preview tools (libcstcodemods, custom AST rewriters), snapshot test runners (pytest-snapshot,syrupy) comparing 1000+ snapshots, golden-master CI suites,2to3-family migration tools.drain3, journald analysis), customer-support ticket triage, error monitoring grouping similar stack traces, FAQ Q&A matching, near-duplicate-payload caching at API edges.The 9.7× on short-text workloads is the largest because short strings hit the codepoint-keyed UCS1 fast path (no
PyDict_GetItemper character) and benefit fully from the integer-label DP.Bench script (wall-clock real-world workloads)
Run twice — once with
DIFFLIB_IMPL=pyours(forces the pure-PythonSequenceMatcher) and once withDIFFLIB_IMPL=ours(the C accelerator). The pure-Python path is enforced by restoring_pydifflib.SequenceMatcherto the original Python class after the shim has rebound it.Where the accelerator does not help much
Worth flagging honestly:
get_close_matchesover very large vocabularies (e.g., 500k PyPI package names for a "did you mean?" suggestion): ~1.0×. The loop is dominated byquick_ratio/real_quick_ratiofilters that this work doesn't accelerate; they still run in pure Python. Tools likepip's typo suggestions, IPython tab completion, and similarget_close_matches-driven UX paths see essentially no benefit. Porting the two filter methods to C would be a small follow-up patch.black --diffand similar formatters: ~1.0×. The unified_diff portion is dominated by AST parsing and re-printing, not by diff computation. Formatters that emit diffs as a side product won't speed up materially.unittest/pytesttest-suite wall time on green runs: ~1.0×.assertEqualonly calls intodifflib.ndiffon failure. On the difflib-using subset of the cpython test suite (test_difflib,test_doctest,test_genericalias, etc.) the per-test speedup is 1.0–3× ontest_difflibitself and within noise everywhere else, because test bodies do many things besides diffing. The wins are per-failure, not per-test-run.