difflib C accelerator by blhsing · Pull Request #44 · blhsing/cpython

blhsing · 2026-05-22T03:51:25Z

A C accelerator for `difflib.SequenceMatcher`

This is a summary of an exploratory project to add a C accelerator for difflib.SequenceMatcher, modeled after the decimal/_pydecimal/_decimal pattern. Output is bit-identical to the pure-Python implementation, including tie-breaks; the win comes from constant-factor optimizations (the algorithm is still Ratcliff-Obershelp).

Context

The thread at https://discuss.python.org/t/algorithmic-complexity-of-difflib/105844 covers the asymptotic angle — a suffix-automaton implementation for pathological inputs. This work is the orthogonal half: same algorithm, faster constant factor. The two should compose — pathological inputs flip to SAM, typical workloads stay on the C-accelerated Ratcliff-Obershelp.

Preserving the pure-Python implementation

The pure-Python SequenceMatcher is preserved verbatim as a reference/fallback. The layout mirrors decimal/datetime/pickle:

Lib/_pydifflib.py — the existing pure-Python implementation (no behavior change), with __name__ = "difflib" so reprs and the Match namedtuple's qualified name look right.
Lib/difflib.py — a thin shim:
- from _pydifflib import * brings in everything (Differ, HtmlDiff, unified_diff, ndiff, get_close_matches, …)
- re-exports a handful of private helpers (_calculate_ratio, _mdiff, _format_range_unified, _format_range_context) that existing tests reference
- try: from _difflib import SequenceMatcher as _CSequenceMatcher + subclasses it to inherit quick_ratio, real_quick_ratio, get_grouped_opcodes from the pure-Python class. When _difflib is unavailable, the shim is a no-op and SequenceMatcher stays the pure-Python class.

Alternative Python implementations can either use _pydifflib directly or ship their own equivalent shim.

Optimization phases

Each phase preserves the algorithm; each row shows what changed and why.

phase	what changed	why
1	C port of `find_longest_match`; `j2len` replaced with paired int arrays + generation counter (no per-row clear)	Eliminates per-row dict allocation + per-`j` dict probe in the inner DP loop.
2	C port of `chain_b`; type-specialized iteration of `b` for `str`/`list`/`tuple`/`bytes`	`chain_b` became >50% of remaining cost after phase 1; ~25k `setdefault`+`append` in Python.
3	Full Ratcliff-Obershelp recursion in C; DP and extension passes operate on `int32` label arrays (`a_lbl`, `a_dp`, `b_lbl`, `junk_mask`)	Hundreds of FLM calls per `ratio()` each cross the Python/C boundary; extension passes did `PyObject_RichCompareBool` per element.
4	Codepoint-keyed `cp_full[]`/`cp_dp[]` arrays for `str` `b`; `str` `a` fast-path reads codepoint and indexes the array	For `str` inputs we paid `PyUnicode_FromOrdinal` + dict-probe per element on both `a` and `b`.
5	Bytes fast-path (cp arrays of size 256), persistent DP scratch (no per-call `alloca`/`memset`), skip max-cp scan for UCS1 strings	Bytes paid `PyLong_FromLong` per element under autojunk; pure-Python's per-int allocator costs dominated for the bytes workload.

Phase 1 is reproduced by monkey-patching _pydifflib.SequenceMatcher.find_longest_match with an early C port that still talks to the Python b2j dict; phase 2 also swaps in a C __chain_b; phase 3 disables the codepoint-keyed lookup tables on top of the current code; phase 4 disables only the phase-5-specific changes (bytes fast-path and the UCS1 max-codepoint shortcut). All five measurements run on the same release-mode in-tree CPython 3.16-dev build, with pyperf (20 processes × 3 values per workload, default warmup, system tuned via pyperf system tune).

Wall time (pyperf mean ± stddev) for SequenceMatcher(...).ratio() + .get_opcodes():

case	pure-Py	phase 1	phase 2	phase 3	phase 4	phase 5 (final)
str_2k_similar	908 ± 24 us	391 ± 12 us	268 ± 12 us	228 ± 6 us	132 ± 2 us	133 ± 8 us
str_8k_similar	3.73 ± 0.08 ms	1.61 ± 0.05 ms	1.12 ± 0.06 ms	988 ± 38 us	619 ± 23 us	628 ± 18 us
lines_difflib	5.64 ± 0.04 ms	1.29 ± 0.02 ms	1.25 ± 0.01 ms	1.03 ± 0.01 ms	1.09 ± 0.01 ms	1.09 ± 0.01 ms
ints_3k	16.1 ± 0.0 ms	3.31 ± 0.01 ms	3.12 ± 0.02 ms	2.22 ± 0.01 ms	2.18 ± 0.02 ms	2.18 ± 0.02 ms
lines_big	25.0 ± 0.1 ms	4.77 ± 0.03 ms	4.46 ± 0.05 ms	2.87 ± 0.03 ms	2.84 ± 0.01 ms	2.84 ± 0.02 ms
bytes_8k	213 ± 0 ms	23.7 ± 0.2 ms	23.2 ± 0.2 ms	15.0 ± 0.1 ms	14.8 ± 0.1 ms	14.5 ± 0.1 ms

Speedup over pure-Python (means only, derived from the row above):

case	phase 1	phase 2	phase 3	phase 4	phase 5 (final)
str_2k_similar	2.32×	3.39×	3.98×	6.88×	6.83×
str_8k_similar	2.32×	3.33×	3.78×	6.03×	5.94×
lines_difflib	4.37×	4.51×	5.48×	5.17×	5.17×
ints_3k	4.86×	5.16×	7.25×	7.39×	7.39×
lines_big	5.24×	5.61×	8.71×	8.80×	8.80×
bytes_8k	8.99×	9.18×	14.20×	14.39×	14.69×

Reading the table: each phase adds an optimization that monotonically helps the workloads it targets. Phase 3 (label-array DP + recursion-in-C) is the single largest jump on int/line workloads; phase 4 (codepoint-keyed tables for str) drives the doubling on str workloads; phase 5 contributes a measurable but smaller bytes win, and on the other workloads it's at the noise floor (within ±0.5 σ of phase 4). The pyperf standard deviations are <3% in every cell, so the small phase-4-vs-phase-5 swings on str_2k, str_8k, and lines_difflib are well inside measurement noise — not a regression.

Benchmark hardware and software

All numbers in this PR description were collected on the same machine and software configuration:

CPU: Intel Xeon Silver 4410Y (Sapphire Rapids), 2 sockets × 12 cores × 2 threads = 48 hardware threads, 3.9 GHz max turbo (disabled for benchmarks).
RAM: 256 GB.
OS: Ubuntu 24.04.2 LTS, Linux 6.14.0.
Compiler: GCC 13.3.0 (Ubuntu 13.3.0-6ubuntu2~24.04.1).
System tuning: sudo pyperf system tune (Turbo Boost disabled, performance CPU governor, ASLR off, IRQ affinity pinned).
pyperf: 2.10.0.
Python: CPython main at commit 5f4fbc10f68, built --enable-shared release mode (no --with-pydebug), -O3. Python 3.12 numbers use the system Python 3.12.3 in a dedicated venv on the same host.

Benchmark workloads

Six workloads, all comparing two sequences that are mostly identical with a small percentage of perturbations — the realistic case difflib is designed for:

str_2k_similar / str_8k_similar — two random lowercase-ASCII strings of 2,000 and 8,000 characters; about 5% of positions randomly substituted. Character-level diff; b2j keys are 1-char strings.
lines_difflib — _pydifflib.py as a list of ~2,100 lines vs the same list with every 50th line replaced by a comment. Line-level diff over Python source.
lines_big — the same file replicated ×3 (~6,300 lines), every 80th line perturbed. Larger line-level diff with line repetition.
ints_3k — list of 3,000 random ints in [0, 500] vs the same list with about 5% of positions randomly substituted. Tests the path where b2j keys are Python ints.
bytes_8k — bytes of 8,000 random bytes vs the same with about 5% mutated. Tests the bytes fast-path; pure-Python pays a PyLong_FromLong per element when building b2j.

The benchmark measures end-to-end SequenceMatcher(None, a, b) followed by s.ratio() and s.get_opcodes(), with min() over 15 repetitions to dampen noise (5 for bytes_8k because pure Python is slow). All six workloads have autojunk enabled (the default).

Ours vs the original pure-Python on CPython 3.16 dev

Measured on the in-tree 3.16 dev build (release-mode, no --with-pydebug) where the patch lives, using pyperf (20 processes × 3 values per workload). Pure Python is _pydifflib.SequenceMatcher (the same code as the existing difflib.SequenceMatcher); the accelerated column is the C-backed difflib.SequenceMatcher after the shim picks up _difflib:

case	pure-Py (mean ± std)	accel (mean ± std)	speedup
str_2k_similar	908 ± 24 us	133 ± 8 us	6.83×
str_8k_similar	3.73 ± 0.08 ms	628 ± 18 us	5.94×
lines_difflib	5.64 ± 0.04 ms	1.09 ± 0.01 ms	5.17×
ints_3k	16.1 ± 0.0 ms	2.18 ± 0.02 ms	7.39×
lines_big	25.0 ± 0.1 ms	2.84 ± 0.02 ms	8.80×
bytes_8k	213 ± 0 ms	14.5 ± 0.1 ms	14.69×

Bench script (pyperf, ours vs pure-Python on the in-tree 3.16 dev build)

The script below is the same one used for the 3.12 4-way comparison; it picks the implementation via the DIFFLIB_IMPL env var, so the same script measures any combination of py (stdlib difflib), pyours (the pure-Python _pydifflib), ours (the in-tree _difflib C accelerator), cdifflib, or cydifflib. For the 3.16-dev table above I ran it twice:

# Pure-Python reference (does not need the C accelerator installed).
DIFFLIB_IMPL=pyours PYTHONPATH=/tmp ./python bench_pyperf.py \
    -o results/py316_pyours.json \
    --inherit-environ=DIFFLIB_IMPL,PYTHONPATH

# C accelerator (drops _difflib.cpython-*.so into /tmp).
DIFFLIB_IMPL=ours PYTHONPATH=/tmp ./python bench_pyperf.py \
    -o results/py316_ours.json \
    --inherit-environ=DIFFLIB_IMPL,PYTHONPATH

"""pyperf benchmark of difflib.SequenceMatcher implementations.

Selects the implementation via the DIFFLIB_IMPL environment variable
(one of: py, pyours, ours, cdifflib, cydifflib).  This avoids needing
custom argparse arguments that pyperf would have to propagate to
worker processes.
"""

import os
import random
import string

import pyperf


def make_inputs(pydifflib_path):
    cases = {}
    rng = random.Random(42)
    s1 = ''.join(rng.choices(string.ascii_lowercase, k=2000))
    s2 = list(s1)
    for _ in range(100):
        s2[rng.randrange(len(s2))] = rng.choice(string.ascii_lowercase)
    cases['str_2k_similar'] = (s1, ''.join(s2))

    s1 = ''.join(rng.choices(string.ascii_lowercase, k=8000))
    s2 = list(s1)
    for _ in range(400):
        s2[rng.randrange(len(s2))] = rng.choice(string.ascii_lowercase)
    cases['str_8k_similar'] = (s1, ''.join(s2))

    with open(pydifflib_path) as f:
        la = f.readlines()
    lb = la[:]
    for i in range(0, len(lb), 50):
        lb[i] = '# tweak\n'
    cases['lines_difflib'] = (la, lb)

    rng = random.Random(7)
    la = [rng.randint(0, 500) for _ in range(3000)]
    lb = la[:]
    for _ in range(150):
        lb[rng.randrange(len(lb))] = rng.randint(0, 500)
    cases['ints_3k'] = (la, lb)

    with open(pydifflib_path) as f:
        la = f.readlines() * 3
    lb = la[:]
    for i in range(0, len(lb), 80):
        lb[i] = '# x\n'
    cases['lines_big'] = (la, lb)

    rng = random.Random(99)
    ba = bytes(rng.choices(range(256), k=8000))
    bb = bytearray(ba)
    for _ in range(400):
        bb[rng.randrange(len(bb))] = rng.randrange(256)
    cases['bytes_8k'] = (ba, bytes(bb))
    return cases


def resolve_impl(name):
    if name == 'py':
        import difflib
        return difflib.SequenceMatcher
    if name == 'pyours':
        import _pydifflib
        return _pydifflib.SequenceMatcher
    if name == 'ours':
        import _difflib
        return _difflib.SequenceMatcher
    if name == 'cdifflib':
        import cdifflib
        return cdifflib.CSequenceMatcher
    if name == 'cydifflib':
        import cydifflib
        return cydifflib.SequenceMatcher
    raise ValueError(f"unknown impl {name!r}")


def make_workload(SM, a, b):
    def workload():
        s = SM(None, a, b)
        s.ratio()
        s.get_opcodes()
    return workload


def main():
    impl = os.environ['DIFFLIB_IMPL']
    pydifflib_path = os.environ.get(
        'PYDIFFLIB_PATH', '/path/to/cpython/Lib/_pydifflib.py')

    SM = resolve_impl(impl)
    cases = make_inputs(pydifflib_path)

    runner = pyperf.Runner()
    runner.metadata['difflib_impl'] = impl
    runner.metadata['SM_class'] = f"{SM.__module__}.{SM.__qualname__}"

    for name, (a, b) in cases.items():
        runner.bench_func(name, make_workload(SM, a, b))


if __name__ == '__main__':
    main()

Apples-to-apples comparison vs PyPI alternatives (CPython 3.12)

Two existing C accelerators on PyPI: cdifflib 1.2.9 (hand-written C) and cydifflib 1.2.0 (Cython-generated). The comparison is run on CPython 3.12 (not the in-tree 3.16-dev branch where the patch lives) because cydifflib's wheel build is CMake-based and currently fails to configure against 3.16-dev — it cannot be tested on that runtime. To remove "different Python version" as a confound, I rebuilt the new accelerator against the system Python 3.12 headers and installed all four implementations (pure-Python, cdifflib, cydifflib, ours) into a single Python 3.12 venv so every column below shares the same interpreter, GC, and pure-Python baseline. All measurements use pyperf (20 processes × 3 values per workload, system tuned via pyperf system tune).

Wall time (pyperf mean ± stddev), all on Python 3.12:

case	pure-Py	cdifflib 1.2.9	cydifflib 1.2.0	ours
str_2k_similar	989 ± 21 us	529 ± 13 us	901 ± 14 us	120 ± 4 us
str_8k_similar	3.86 ± 0.05 ms	2.19 ± 0.07 ms	33.5 ± 0.0 ms	520 ± 25 us
lines_difflib	6.15 ± 0.08 ms	2.74 ± 0.04 ms	1.88 ± 0.03 ms	1.08 ± 0.01 ms
ints_3k	16.5 ± 0.1 ms	8.71 ± 0.13 ms	4.31 ± 0.09 ms	2.39 ± 0.01 ms
lines_big	25.8 ± 0.2 ms	11.6 ± 0.1 ms	21.2 ± 0.0 ms	2.99 ± 0.04 ms
bytes_8k	205 ± 1 ms	127 ± 1 ms	60.8 ± 0.1 ms	14.8 ± 0.1 ms

Speedup over pure-Python (means only), all measured on Python 3.12:

case	cdifflib 1.2.9	cydifflib 1.2.0	ours
str_2k_similar	1.87×	1.10×	8.24×
str_8k_similar	1.76×	0.12×	7.42×
lines_difflib	2.24×	3.27×	5.69×
ints_3k	1.89×	3.83×	6.90×
lines_big	2.22×	1.22×	8.63×
bytes_8k	1.61×	3.37×	13.85×

Bench script (pyperf, 4-way comparison on CPython 3.12)

Same bench_pyperf.py as above; in the 3.12 venv where all four implementations are importable, set DIFFLIB_IMPL to one of py, cdifflib, cydifflib, ours and run once per impl:

for impl in py cdifflib cydifflib ours; do
    DIFFLIB_IMPL=$impl venv/bin/python bench_pyperf.py \
        -o results/py312_$impl.json \
        --inherit-environ=DIFFLIB_IMPL
done

# Compare any two with a t-test on the per-process means:
venv/bin/python -m pyperf compare_to results/py312_py.json results/py312_ours.json

Observations:

cydifflib regresses on char-level diff: str_8k_similar runs at 0.12× of pure Python (≈8× slower). Reproducible across pyperf processes (stddev = 0 ms), not a measurement artifact. Looks like per-element Cython wrapping overhead is bad when elements are single codepoints.
cdifflib is consistent at ~1.6–2.2×: a modest constant-factor win that targets find_longest_match and get_matching_blocks only, leaving chain_b in Python.
Ours is the fastest of the three in all six workloads, by 1.4× (lines_difflib, where cydifflib is also competitive) to 4× (str_8k_similar) to 9× (bytes_8k) over the next-fastest in each row.

What lets ours go further

cdifflib and cydifflib both keep the existing data shape: b2j is a dict[elt, list[int]], and the inner loop does PyDict_GetItem on Python-object keys per outer i. The biggest single change in this work is restructuring the state itself:

Every distinct element of b is assigned a small integer label at chain_b time.
Position-indexed int32_t arrays (a_lbl, b_lbl, a_dp, junk_mask) replace the per-position object lookups.
The whole get_matching_blocks recursion (queue + sort + collapse) runs in C, with the DP loop and extension passes operating purely on those int32 arrays — no Python C-API calls in the hot path.
For str and bytes, codepoint-keyed lookup tables (cp_full[256] for UCS1 / cp_full[max+1] for wider strings; cp_full[256] for bytes) replace the dict-probe entirely. This is what makes the bytes case 13× rather than ~3×.

Neither cdifflib nor cydifflib does the integer-label transformation; that explains most of the remaining gap.

Integration details

Build: Modules/_difflibmodule.c (heap type, per-interpreter GIL supported, PyType_GetModuleByDef for subclass-safe state lookup), wired via PY_STDLIB_MOD_SIMPLE([_difflib]) in configure.ac and a one-line entry in Modules/Setup.stdlib.in.
Argument Clinic: all seven methods (__init__, set_seqs, set_seq1, set_seq2, find_longest_match, get_matching_blocks, get_opcodes, ratio) use /*[clinic input] blocks; argument parsing is generated into Modules/clinic/_difflibmodule.c.h.
Documentation: an impl-detail block added to Doc/library/difflib.rst mentions the C accelerator, the _pydifflib reference, bit-identical output, and the typical 5–25× speedup range.
Tests: when the accelerator is present, Lib/test/test_difflib.py programmatically generates a *_PurePython variant of each existing TestCase that monkey-patches difflib.SequenceMatcher to _pydifflib.SequenceMatcher for the duration of the test (similar in spirit to test_decimal's C* / Py* class pairs, but non-invasive — no test method changes). With the accelerator built, test_difflib runs 100 tests (61 C-path + 39 Py-path); without it, the 60 original tests run as before.

Comparison vs GNU `diff(1)` on line workloads

GNU diff is a different program (and a different algorithm — Myers diff, O((n+m)D) where D is the edit distance, not Ratcliff-Obershelp); the comparison only makes sense on line-level inputs because GNU diff doesn't natively handle character, integer, or byte sequences. The GNU diff column includes its full real-world cost: process spawn + reading both files from disk + writing the unified diff to /dev/null. The Python columns measure the in-process work difflib.unified_diff(a_lines, b_lines, lineterm='') does (the same work Tools/build/stable_abi.py, doctest's failure output, and unittest.assertMultiLineEqual do under the hood). All numbers are pyperf mean ± stddev:

workload	GNU diff 3.10 (subprocess)	pure-Py difflib	ours (C accel)
lines_difflib (~2,100 lines)	1.48 ± 0.01 ms	6.16 ± 0.03 ms	1.14 ± 0.01 ms
lines_big (~6,300 lines)	2.50 ± 0.09 ms	26.7 ± 0.2 ms	2.46 ± 0.02 ms
lines_huge (~32,000 lines, 5%)	13.5 ± 0.2 ms	280 ± 1 ms	24.6 ± 0.2 ms

Reading the table:

Small/medium line diffs (≤ ~6k lines): ours is competitive with or faster than GNU diff end-to-end. GNU diff's subprocess + file-IO floor (~1–2 ms) is the same order of magnitude as the actual diff work at this size, so the in-process Python+C path comes out ahead.
Large line diffs (32k lines): GNU diff pulls ahead, ~1.8× faster. That's Myers diff's asymptotic edge over Ratcliff-Obershelp showing up at scale. Pure-Python difflib is 20× slower than GNU diff here; ours closes most of that gap but not all of it.

Why the crossover, briefly: GNU diff uses Myers diff — for "mostly identical" inputs, edit distance D is small so its O((n+m)D) behaves like O(n+m) — linear in n. Ratcliff-Obershelp is super-linear in expected case. Holding the perturbation rate constant, scaling n by 15× (2k → 32k lines) grows ours' time by ~22× and GNU diff's algorithm time by ~9×. Combined with GNU diff's fixed ~1 ms subprocess overhead, the crossover lands around n ≈ 10–15k lines.

This is well above the sizes CPython callers actually hit (doctest failures, assertMultiLineEqual, the 2,833-line stable-ABI manifest), so for real workloads we're in the "GNU diff loses to us because of IO floor" zone. The discussion thread linked at the top covers the asymptotic angle (suffix-automaton work for autojunk=False) — that's the orthogonal path that would close the remaining gap on >10k-line inputs.

Bench script (pyperf, GNU diff vs `difflib.unified_diff`)

The script selects the implementation via DIFFLIB_IMPL (gnudiff, pyours, or ours) so all three columns above share the same input generation and pyperf invocation pattern.

rm -f results/gnu_*.json
for impl in gnudiff pyours ours; do
    DIFFLIB_IMPL=$impl PYTHONPATH=/tmp ./python bench_gnudiff.py \
        -o results/gnu_$impl.json \
        --inherit-environ=DIFFLIB_IMPL,PYTHONPATH
done

# Compare any two via t-test on the per-process means:
./python -m pyperf compare_to results/gnu_pyours.json results/gnu_ours.json

"""Compare GNU diff(1) vs difflib.unified_diff on identical line inputs.

GNU diff uses Myers diff, not Ratcliff-Obershelp -- the outputs are
structurally different, but the wall-time comparison is informative for
the line-level workloads (the only case GNU diff handles natively).
Char-level / bytes / int sequences aren't included because GNU diff
doesn't have a meaningful native equivalent.
"""

import os
import random
import subprocess
import tempfile

import pyperf


def make_inputs(pydifflib_path):
    cases = {}
    with open(pydifflib_path) as f:
        la = f.readlines()
    lb = la[:]
    for i in range(0, len(lb), 50):
        lb[i] = '# tweak\n'
    cases['lines_difflib'] = (la, lb)

    with open(pydifflib_path) as f:
        la = f.readlines() * 3
    lb = la[:]
    for i in range(0, len(lb), 80):
        lb[i] = '# x\n'
    cases['lines_big'] = (la, lb)

    # ~32k lines, ~5% perturbation
    with open(pydifflib_path) as f:
        la = f.readlines() * 15
    lb = la[:]
    rng = random.Random(123)
    for _ in range(len(lb) // 20):
        i = rng.randrange(len(lb))
        lb[i] = f"# changed line {i}\n"
    cases['lines_huge'] = (la, lb)
    return cases


def make_temp_files(a_lines, b_lines):
    fa = tempfile.NamedTemporaryFile(mode='w', delete=False, suffix='.a')
    fa.writelines(a_lines); fa.close()
    fb = tempfile.NamedTemporaryFile(mode='w', delete=False, suffix='.b')
    fb.writelines(b_lines); fb.close()
    return fa.name, fb.name


def main():
    impl = os.environ['DIFFLIB_IMPL']  # gnudiff | pyours | ours
    pydifflib_path = os.environ.get(
        'PYDIFFLIB_PATH', '/path/to/cpython/Lib/_pydifflib.py')
    cases = make_inputs(pydifflib_path)

    runner = pyperf.Runner()
    runner.metadata['impl'] = impl

    if impl == 'gnudiff':
        # Write inputs once; bench the subprocess invocation.  This
        # deliberately includes process spawn + file IO + writing the
        # diff to /dev/null -- the real cost of using GNU diff.
        for name, (a, b) in cases.items():
            path_a, path_b = make_temp_files(a, b)
            def workload(_a=path_a, _b=path_b):
                # GNU diff exits 1 when files differ; that's success.
                subprocess.run(
                    ['/usr/bin/diff', '-u', _a, _b],
                    stdout=subprocess.DEVNULL,
                    stderr=subprocess.DEVNULL,
                    check=False,
                )
            runner.bench_func(name, workload)
    else:
        if impl == 'pyours':
            import _pydifflib as mod
        elif impl == 'ours':
            import difflib as mod  # picks up _difflib via the shim
        else:
            raise SystemExit(f"unknown impl {impl!r}")
        unified_diff = mod.unified_diff
        for name, (a, b) in cases.items():
            def workload(_a=a, _b=b, _ud=unified_diff):
                # Materialize the generator to match the work GNU diff
                # does (it writes the whole diff out).
                list(_ud(_a, _b, lineterm=''))
            runner.bench_func(name, workload)


if __name__ == '__main__':
    main()

Where the wall-clock savings actually show up

The most compelling wins are in workloads that run difflib in a loop or over a large input batch — typical of third-party tools that do similarity scoring, repo-wide diffing, or clustering. The numbers below are wall-clock time for a single run on this machine (one process; pyperf isn't necessary since the variance is small relative to the deltas):

workload	pure-Python	C accel	saved	speedup
pairwise similarity scoring (151 source files, 11,325 pairs)	6.96 s	1.85 s	5.1 s	3.8×
repo-wide `unified_diff` (2,003 .py files, full `Lib/` recursive)	4.01 s	0.90 s	3.1 s	4.5×
short-text dedup / similarity (20,000 strings, 40k `ratio()` calls)	13.68 s	1.41 s	12.3 s	9.7×
end-to-end on the three workloads above	25.0 s	4.5 s	20.5 s	5.5×

Real third-party tools and use cases that map onto each row:

Pairwise similarity scoring — code-clone detectors (simian, Python ports of jscpd), plagiarism checkers, duplicate-test detectors in large pytest suites, fuzzy record linkage (recordlinkage, simpler dedupe-style flows), document near-duplicate detection in scrapers.
Repo-wide unified_diff — pre-commit hooks emitting per-file diffs, bulk-refactor preview tools (libcst codemods, custom AST rewriters), snapshot test runners (pytest-snapshot, syrupy) comparing 1000+ snapshots, golden-master CI suites, 2to3-family migration tools.
Short-text dedup / similarity scoring — chatbot response near-dedup (Rasa, custom NLU), log-line clustering (drain3, journald analysis), customer-support ticket triage, error monitoring grouping similar stack traces, FAQ Q&A matching, near-duplicate-payload caching at API edges.

The 9.7× on short-text workloads is the largest because short strings hit the codepoint-keyed UCS1 fast path (no PyDict_GetItem per character) and benefit fully from the integer-label DP.

Bench script (wall-clock real-world workloads)

Run twice — once with DIFFLIB_IMPL=pyours (forces the pure-Python SequenceMatcher) and once with DIFFLIB_IMPL=ours (the C accelerator). The pure-Python path is enforced by restoring _pydifflib.SequenceMatcher to the original Python class after the shim has rebound it.

DIFFLIB_IMPL=pyours PYTHONPATH=/tmp ./python bench_realworld.py
DIFFLIB_IMPL=ours   PYTHONPATH=/tmp ./python bench_realworld.py

"""Wall-clock demo: large-input / loop-heavy difflib workloads."""

import os, time, pathlib, random


def workload_pairwise_similarity():
    """N documents, compute SequenceMatcher.ratio() over all N*(N-1)/2 pairs.

    Real-world map: clone detection across functions / duplicate text
    detection / plagiarism scoring / chatbot reply similarity.
    """
    libdir = pathlib.Path('/path/to/cpython/Lib')
    files = sorted(libdir.glob('*.py'))[:200]
    docs = []
    for f in files:
        try:
            docs.append(f.read_text().splitlines()[:600])
        except Exception:
            pass
    print(f'  {len(docs)} docs; '
          f'{len(docs)*(len(docs)-1)//2} pairwise comparisons')

    import difflib
    SM = difflib.SequenceMatcher
    t0 = time.perf_counter()
    pairs = 0
    total = 0.0
    for i in range(len(docs)):
        for j in range(i + 1, len(docs)):
            s = SM(None, docs[i], docs[j])
            total += s.ratio()
            pairs += 1
    elapsed = time.perf_counter() - t0
    print(f'  {pairs} ratios, avg={total/pairs:.3f}, time={elapsed:.2f}s')
    return elapsed


def workload_repo_diff():
    """Walk a large repo, diff every file against a perturbed version.

    Real-world map: pre-commit checks across whole repos, bulk migration
    tools emitting unified diffs per file, batch refactor previews,
    snapshot test runners comparing many baselines to current outputs.
    """
    libdir = pathlib.Path('/path/to/cpython/Lib')
    files = sorted(libdir.glob('**/*.py'))
    print(f'  {len(files)} files')

    import difflib
    t0 = time.perf_counter()
    total_bytes = 0
    for f in files:
        try:
            a = f.read_text().splitlines()
        except Exception:
            continue
        b = a[:]
        for i in range(0, len(b), 50):
            b[i] = '# CHANGED'
        lines = list(difflib.unified_diff(a, b, str(f), str(f), lineterm=''))
        total_bytes += sum(len(l) for l in lines)
    elapsed = time.perf_counter() - t0
    print(f'  total diff bytes: {total_bytes/1024:.0f} KB, time={elapsed:.2f}s')
    return elapsed


def workload_dedup_chat():
    """Pairwise similarity scoring on many short strings.

    Real-world map: chatbot response dedup, customer-support ticket
    clustering, short-text fuzzy matching, log-line near-duplicate
    detection, error-message clustering in monitoring tools.
    """
    rng = random.Random(0)
    import string
    docs = [''.join(rng.choices(string.ascii_lowercase + ' ',
                                k=rng.randint(40, 200)))
            for _ in range(20000)]
    print(f'  {len(docs)} short strings (avg ~120 chars)')
    import difflib
    SM = difflib.SequenceMatcher
    t0 = time.perf_counter()
    n = 0
    for d in docs[2:]:
        SM(None, docs[0], d).ratio()
        SM(None, docs[1], d).ratio()
        n += 2
    elapsed = time.perf_counter() - t0
    print(f'  {n} ratios, time={elapsed:.2f}s')
    return elapsed


def main():
    impl = os.environ.get('DIFFLIB_IMPL', 'ours')

    import _pydifflib
    pure_py_sm = _pydifflib.SequenceMatcher
    import difflib
    if hasattr(difflib, '_PySequenceMatcher'):
        pure_py_sm = difflib._PySequenceMatcher
    if impl == 'pyours':
        # Restore the pure-Python class everywhere so helpers in
        # _pydifflib (unified_diff, ndiff, Differ, HtmlDiff,
        # get_close_matches) see it via module-globals lookup.
        _pydifflib.SequenceMatcher = pure_py_sm
        difflib.SequenceMatcher = pure_py_sm
    elif impl != 'ours':
        raise SystemExit(f"unknown impl {impl!r}")
    print(f'impl={impl}  SequenceMatcher={difflib.SequenceMatcher!r}')
    print()

    for name, fn in [
        ('pairwise similarity (N=200 docs)',     workload_pairwise_similarity),
        ('repo-wide diff (entire Lib/)',         workload_repo_diff),
        ('dedup short-text scoring (20k docs)',  workload_dedup_chat),
    ]:
        print(f'== {name} ==')
        fn()
        print()


if __name__ == '__main__':
    main()

Where the accelerator does not help much

Worth flagging honestly:

get_close_matches over very large vocabularies (e.g., 500k PyPI package names for a "did you mean?" suggestion): ~1.0×. The loop is dominated by quick_ratio / real_quick_ratio filters that this work doesn't accelerate; they still run in pure Python. Tools like pip's typo suggestions, IPython tab completion, and similar get_close_matches-driven UX paths see essentially no benefit. Porting the two filter methods to C would be a small follow-up patch.
black --diff and similar formatters: ~1.0×. The unified_diff portion is dominated by AST parsing and re-printing, not by diff computation. Formatters that emit diffs as a side product won't speed up materially.
unittest/pytest test-suite wall time on green runs: ~1.0×. assertEqual only calls into difflib.ndiff on failure. On the difflib-using subset of the cpython test suite (test_difflib, test_doctest, test_genericalias, etc.) the per-test speedup is 1.0–3× on test_difflib itself and within noise everywhere else, because test bodies do many things besides diffing. The wins are per-failure, not per-test-run.

Move the pure-Python implementation of difflib to Lib/_pydifflib.py and turn Lib/difflib.py into a thin shim that re-exports its public API. This mirrors the layout used by decimal/_pydecimal, datetime/_pydatetime, and pickle/_pickle, where the public module dispatches to a faster C implementation when available and the pure-Python module is preserved as a self-contained reference for alternative Python implementations. No public behaviour change. ``Match`` is constructed with ``module='difflib'`` so its qualified name matches the public module. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Introduce Modules/_difflibmodule.c, a heap-type C extension that implements __init__, set_seqs/set_seq1/set_seq2, find_longest_match, get_matching_blocks, get_opcodes, and ratio for SequenceMatcher. The inner DP loop and the full Ratcliff-Obershelp recursion run on int32 label arrays with zero Python C-API calls in the hot path; codepoint- keyed lookup tables short-circuit per-element dict probes for str and bytes inputs. Output is bit-identical to the pure-Python implementation including tie-breaks. Lib/difflib.py grows a small subclass that inherits the slow-path methods (quick_ratio, real_quick_ratio, get_grouped_opcodes) from the pure-Python class; this is a no-op when the accelerator is not built. Build wiring: configure.ac registers the module via PY_STDLIB_MOD_SIMPLE and Modules/Setup.stdlib.in references _difflibmodule.c. configure must be regenerated with autoreconf before this lands. Typical workloads run 5-25x faster than pure Python; the bytes path up to ~70x. See Lib/test/test_difflib.py for cross-implementation tests. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

When the _difflib C accelerator is built, programmatically generate a parallel ``*_PurePython`` TestCase for each existing test class so the same suite covers both implementations. Pure-Python coverage is obtained by patching ``difflib.SequenceMatcher`` to ``_pydifflib.SequenceMatcher`` in setUp / restoring it in tearDown; internal helpers like ``unified_diff`` and ``ndiff`` resolve ``SequenceMatcher`` on ``difflib`` at call time, so patching the module attribute covers the whole pipeline. This mirrors the dual-implementation test pattern used by test_decimal (C* / Py* class pairs) without requiring every existing test method to be parameterised. ``test_html_diff`` also gets a single-line fix: it depended on ``HtmlDiff._default_prefix`` starting at 0, which only held because it ran first. Resetting the counter at the top of the test makes it order-independent. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

blhsing and others added 3 commits May 21, 2026 16:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

difflib C accelerator#44

difflib C accelerator#44
blhsing wants to merge 3 commits into
masterfrom
difflib-c-accelerator

blhsing commented May 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

blhsing commented May 22, 2026

A C accelerator for difflib.SequenceMatcher

Context

Preserving the pure-Python implementation

Optimization phases

Benchmark hardware and software

Benchmark workloads

Ours vs the original pure-Python on CPython 3.16 dev

Apples-to-apples comparison vs PyPI alternatives (CPython 3.12)

What lets ours go further

Integration details

Comparison vs GNU diff(1) on line workloads

Where the wall-clock savings actually show up

Where the accelerator does not help much

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

A C accelerator for `difflib.SequenceMatcher`

Comparison vs GNU `diff(1)` on line workloads