@@ -74,28 +74,43 @@ expr.eval_exists(doc) # -> bool
7474expr.eval(doc) # -> list[Element]
7575```
7676
77- ### ElementTree compatibility
77+ ### ElementTree drop-in (read-only)
7878
79- Drop-in replacement for ` xml.etree.ElementTree ` (read-only):
79+ Full read-only drop-in replacement for ` xml.etree.ElementTree ` . Every
80+ read-only Element method and module function is supported:
8081
8182``` python
8283from simdxml.etree import ElementTree as ET
8384
8485tree = ET .parse(" books.xml" )
8586root = tree.getroot()
8687
87- # stdlib-compatible API
88- root.tag # element tag name
89- root.text # direct text content
90- root.attrib # attribute dict
91- root.get(" key" ) # attribute access
88+ # All stdlib Element methods work
89+ root.tag, root.text, root.tail, root.attrib
90+ root.find(" .//title" ) # first match
91+ root.findall(" .//book[@lang]" ) # all matches
92+ root.findtext(" .//title" ) # text of first match
93+ root.iterfind(" .//author" ) # iterator
9294root.iter(" title" ) # descendant iterator
9395root.itertext() # text iterator
94-
95- # Full XPath 1.0 (lxml-compatible extension)
96+ root.get(" key" ), root.keys(), root.items()
97+ len (root), root[0 ], list (root)
98+
99+ # All stdlib module functions work
100+ ET .parse(file ), ET .fromstring(text), ET .tostring(element)
101+ ET .iterparse(file , events = (" start" , " end" ))
102+ ET .canonicalize(xml), ET .dump(element), ET .iselement(obj)
103+ ET .XMLPullParser(events = (" end" ,)), ET .XMLParser(), ET .TreeBuilder()
104+ ET .fromstringlist(seq), ET .tostringlist(elem)
105+ ET .QName(uri, tag), ET .XMLID(text)
106+
107+ # Plus full XPath 1.0 (lxml-compatible extension)
96108root.xpath(" //book[contains(title, 'XML')]" )
97109```
98110
111+ Mutation operations (` append ` , ` remove ` , ` set ` , ` SubElement ` , ` indent ` , etc.)
112+ raise ` TypeError ` with a helpful message pointing to stdlib.
113+
99114### Read-only by design
100115
101116simdxml Elements are immutable views into the structural index. Mutation
@@ -122,59 +137,32 @@ Full conformance with XPath 1.0:
122137
123138## Benchmarks
124139
125- Apple Silicon, Python 3.14, lxml 6.0. GC disabled during timing, 3 warmup +
126- 20 timed iterations, median reported. Three corpus types: data-oriented
127- (product catalog), document-oriented (PubMed abstracts), config-oriented
128- (Maven POM). Run yourself: ` uv run python bench/bench_parse.py `
129-
130- ### Parse
131-
132- ` simdxml.parse() ` eagerly builds structural indices (CSR, name posting).
133- lxml's ` fromstring() ` builds a DOM tree without precomputed query indices.
134- simdxml front-loads more work into parse so queries are faster — both numbers
135- are real, the trade-off depends on your workload.
136-
137- | Corpus | Size | simdxml | lxml | vs lxml | vs stdlib |
138- | --------| ------| ---------| ------| ---------| -----------|
139- | Catalog (data) | 1.6 MB | 2.7 ms | 8.1 ms | ** 3.0x** | ** 5.4x** |
140- | Catalog (data) | 17 MB | 32 ms | 82 ms | ** 2.6x** | ** 4.7x** |
141- | PubMed (doc) | 1.7 MB | 2.3 ms | 6.0 ms | ** 2.7x** | ** 5.9x** |
142- | PubMed (doc) | 17 MB | 27 ms | 61 ms | ** 2.2x** | ** 5.0x** |
143- | POM (config) | 2.1 MB | 2.7 ms | 8.3 ms | ** 3.1x** | ** 6.6x** |
144-
145- ### XPath queries (returning Elements — apples-to-apples)
146-
147- | Query | Corpus | simdxml | lxml | vs lxml |
148- | -------| --------| ---------| ------| ---------|
149- | ` //item ` | Catalog 17 MB | 3.4 ms | 21 ms | ** 6x** |
150- | ` //item[@category="cat5"] ` | Catalog 17 MB | 1.6 ms | 69 ms | ** 42x** |
151- | ` //PubmedArticle ` | PubMed 17 MB | 0.35 ms | 9.8 ms | ** 28x** |
152- | ` //Author[LastName="Auth0_0"] ` | PubMed 17 MB | 13 ms | 29 ms | ** 2.2x** |
153- | ` //dependency ` | POM 2.1 MB | 0.34 ms | 1.1 ms | ** 3.3x** |
154- | ` //dependency[scope="test"] ` | POM 2.1 MB | 2.4 ms | 3.6 ms | ** 1.5x** |
155-
156- ### XPath text extraction
157-
158- ` xpath_text() ` returns strings directly, avoiding Element object creation.
159- This is the optimized path for ETL / data extraction workloads.
160-
161- | Query | Corpus | simdxml | lxml xpath+.text | vs lxml |
162- | -------| --------| ---------| ------------------| ---------|
163- | ` //name ` | Catalog 17 MB | 1.8 ms | 37 ms | ** 20x** |
164- | ` //AbstractText ` | PubMed 17 MB | 0.31 ms | 7.1 ms | ** 23x** |
165- | ` //artifactId ` | POM 2.1 MB | 0.21 ms | 2.0 ms | ** 10x** |
166-
167- ### Element traversal
168-
169- ` child_tags() ` and ` descendant_tags() ` return all tag names in a single
170- call using interned Python strings. Per-element iteration (` for e in root ` )
171- is also available but creates Element objects with some overhead.
172-
173- | Corpus | ` child_tags() ` | lxml ` [e.tag] ` | vs lxml |
174- | --------| ----------------| -----------------| ---------|
175- | Catalog 17 MB | ** 0.38 ms** | 6.4 ms | ** 17x** |
176- | PubMed 17 MB | ** 0.03 ms** | 0.60 ms | ** 17x** |
177- | POM 2.1 MB | ** 0.2 us** | 0.5 us | ** 3x** |
140+ Apple Silicon, Python 3.14, lxml 6.0. GC disabled, 3 warmup + 20 timed
141+ iterations, median reported. 100K-element catalog (5.6 MB).
142+ Run yourself: ` uv run python bench/bench_parse.py `
143+
144+ Faster than lxml on every operation. Faster than stdlib on 11 of 14.
145+
146+ | Operation | simdxml | lxml | stdlib | vs lxml | vs stdlib |
147+ | -----------| ---------| ------| --------| ---------| -----------|
148+ | ` parse() ` | 10 ms | 33 ms | 55 ms | ** 3x** | ** 5x** |
149+ | ` find("item") ` | <1 us | 1 us | <1 us | ** faster** | ** tied** |
150+ | ` find(".//name") ` | <1 us | 1 us | 1 us | ** faster** | ** faster** |
151+ | ` findall("item") ` | 0.23 ms | 4.8 ms | 0.89 ms | ** 21x** | ** 4x** |
152+ | ` findall(".//item") ` | 0.15 ms | 6.2 ms | 3.0 ms | ** 42x** | ** 20x** |
153+ | ` findall(predicate) ` | 1.5 ms | 12 ms | 4.9 ms | ** 8x** | ** 3x** |
154+ | ` findtext(".//name") ` | <1 us | 1 us | 1 us | ** faster** | ** faster** |
155+ | ` xpath_text("//name") ` | 2.1 ms | 19 ms | 4.4 ms | ** 9x** | ** 2x** |
156+ | ` iter() ` | 9.2 ms | 15 ms | 1.3 ms | ** 2x** | 0.14x |
157+ | ` iter("item") ` filtered | 4.5 ms | 5.9 ms | 1.9 ms | ** 1.3x** | 0.4x |
158+ | ` itertext() ` | 2.6 ms | 33 ms | 1.4 ms | ** 13x** | 0.5x |
159+ | ` child_tags() ` | 0.40 ms | 6.2 ms | 1.5 ms | ** 16x** | ** 4x** |
160+ | ` iterparse() ` | 51 ms | 66 ms | 70 ms | ** 1.3x** | ** 1.4x** |
161+ | ` canonicalize() ` | 1.8 ms | 4.7 ms | 4.6 ms | ** 3x** | ** 3x** |
162+
163+ The three operations where stdlib is faster (` iter ` , ` itertext ` , ` iter ` filtered)
164+ involve creating per-element Python objects. The batch alternatives
165+ (` child_tags() ` , ` xpath_text() ` ) beat both lxml and stdlib for those workloads.
178166
179167## How it works
180168
0 commit comments