Skip to content

WIP XML.jl v0.4: Rewrite of internals, streaming tokenizer, XPath support, and bug fixes #54

Open
joshday wants to merge 18 commits intoJuliaComputing:mainfrom
joshday:main
Open

WIP XML.jl v0.4: Rewrite of internals, streaming tokenizer, XPath support, and bug fixes #54
joshday wants to merge 18 commits intoJuliaComputing:mainfrom
joshday:main

Conversation

@joshday
Copy link
Copy Markdown
Contributor

@joshday joshday commented Mar 6, 2026

Summary of Changes

I revived an old rewrite I had halfway finished with the help of Claude Code. It produced some good results!

  • Major rewrite of XML.jl's internals that addresses many open issues
  • Self-contained src/XMLTokenizer.jl module for speedy tokenization
  • Node{T} now parameterized by the string storage type, enabling quick reads via SubString or StringViews.jl
  • StringViews extension — XML.mmap("file.xml", LazyNode) for memory-mapped parsing of very large files
  • XPath support — xpath(node, path) with a practical subset of XPath 1.0
  • Greatly expanded test suite — 243 libxml2 test cases, pugixml and libexpat compatibility tests, W3C conformance tests

Downstream

@TimG1964 you are likely the most impacted with these changes. The Downstream.yml action does indicate a failure in XLSX.jl tests related to Raw no longer existing. I'd appreciate your review here! I'm happy to submit a PR for a fix in XLSX.jl so that its ready to go before this gets merged.

Addressed Issues

Benchmarks: See benchmarks/compare.jl

Here (SS) refers to using SubString{String} as storage type.

julia --project=. benchmarks/compare.jl
============================================================
  XML.jl Benchmark Comparison
  Current (dev) vs v0.3.8
============================================================

Running dev benchmarks... done
Setting up v0.3.8 worktree... done
Running v0.3.8 benchmarks... done

------------------------------------------------------------

  Parse (small)
          v0.3.8      0.114 ms
             dev     0.0335 ms  (70.6% faster)

  Parse (small, SS)
          v0.3.8           n/a
             dev     0.0285 ms

  Parse (medium)
          v0.3.8   634.7153 ms
             dev   161.0888 ms  (74.6% faster)

  Parse (medium, SS)
          v0.3.8           n/a
             dev   151.3025 ms

  Write (small)
          v0.3.8     0.0227 ms
             dev     0.0176 ms  (22.4% faster)

  Write (medium)
          v0.3.8   118.1504 ms
             dev     77.619 ms  (34.3% faster)

  Read file (medium)
          v0.3.8   645.5785 ms
             dev   170.8398 ms  (73.5% faster)

  Collect tags (small)
          v0.3.8     0.0005 ms
             dev     0.0006 ms  (10.3% slower)

  Collect tags (medium)
          v0.3.8    21.0988 ms
             dev    11.1532 ms  (47.1% faster)

============================================================

@TimG1964
Copy link
Copy Markdown
Contributor

TimG1964 commented Mar 8, 2026

Hey @joshday . I've only had a very superficial look so far but it looks great. Thanks!

In terms of impact on XLSX.jl, I think it looks significant. It isn't just Raw. Since @nhz2 first suggested using Raw, I've known it was internal and therefore subject to change. On first inspection, I think the rework involved should be manageable.

More of a challenge will be the removal of prev and next, which are currently exported functions. I rely on these for fundamental elements of XLSX.jl like the sheetrow and tablerow iterators, and for reading and writing the XML files from/to the zip archive .xlsx file.

These obviously aren't insuperable, but will likely need a bit of time while I get to grips with xpath and tokenizer. Optimistic me thinks the new functionality will simplify the code of XLSX.jl, but I usually find things are considerably harder than I first imagine! I'll feedback more when I've had a bit more of a go at getting XLSX.jl working.

Thanks,

Tim

@TimG1964
Copy link
Copy Markdown
Contributor

TimG1964 commented Apr 2, 2026

I'm happy to submit a PR for a fix in XLSX.jl so that its ready to go before this gets merged.

Hi @joshday, I've been a bit distracted recently by transferring XLSX.jl to JuliaData and subsequently making a v0.11 release, but my attention will be back on this again after the Easter break. I have to say I'd welcome any PR you could make on XLSX.jl to help facilitate this upgrade.

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

3 participants