Experiments with Concrete Syntax Tree (CST) representations using srcML. These tools convert Python source code to various tree representations and back, with perfect fidelity.
srcML parses source code into an XML-based CST that preserves all syntax including whitespace, comments, and formatting. This enables lossless roundtripping: code → tree → code.
We provide three serialization formats:
| Format | To Tree | From Tree |
|---|---|---|
| XML | dir_to_xml.py |
xml_to_stdout.py |
| JSON | dir_to_json.py |
json_to_stdout.py |
| Turtle (RDF) | dir_to_turtle.py |
turtle_to_stdout.py, turtle_to_files.py |
pip install pylibsrcml rdflibAlso requires srcML 1.1.0+ installed on the system.
The native srcML format. Suitable for XPath/XSLT processing.
python3 dir_to_xml.py <directory> [output.xml]python3 xml_to_stdout.py <archive.xml>python3 dir_to_xml.py example2 example2.xml
python3 xml_to_stdout.py example2.xmlOutput:
=== main.py ===
"""Main application that uses the utils module."""
...
=== utils.py ===
"""Utility functions for data processing."""
...
A JSON tree representation. Suitable for JavaScript/Python processing.
python3 dir_to_json.py <directory> [output.json]python3 json_to_stdout.py <archive.json>python3 dir_to_json.py example2 example2.json
python3 json_to_stdout.py example2.jsonOutput:
[
{"path": "main.py", "source": "..."},
{"path": "utils.py", "source": "..."}
]{
"tag": "function",
"text": "def ",
"attrs": {"type": "string"},
"children": [
{"tag": "name", "text": "hello"},
{"tag": "#text", "text": "\n"}
]
}An RDF representation using the Turtle syntax. Suitable for graph databases and SPARQL queries.
python3 dir_to_turtle.py <directory> [output.ttl]To stdout as JSON array:
python3 turtle_to_stdout.py <archive.ttl>To files in a directory:
python3 turtle_to_files.py <archive.ttl> <output_dir>python3 dir_to_turtle.py example2 example2.ttl
python3 turtle_to_files.py example2.ttl output/Uses the namespace http://trustgraph.ai/cst# with URI-based node identifiers (no blank nodes):
@prefix cst: <http://trustgraph.ai/cst#> .
@prefix node: <http://trustgraph.ai/cst/node/> .
node:main.py a cst:File ;
cst:path "main.py" ;
cst:root node:main.py/n1 .
node:main.py/n1 a cst:unit ;
cst:attr_language "Python" ;
cst:children ( node:main.py/n2 node:main.py/n3 ) .
node:main.py/n2 a cst:function ;
cst:text "def " ;
cst:children ( ... ) .Child ordering is preserved using rdf:List.
This demonstrates using the RDF representation for code transformation. This uses a transformation with a set of English -> French translation strings.
example3/ contains a simple animal guessing game with English strings:
# game.py
print("Think of an animal...")
if ask("Does it have four legs?"):
...
print("I win!")-
Convert to Turtle
python3 dir_to_turtle.py example3 example3.ttl
-
Translate strings
python3 translate_to_french.py example3.ttl example3_french.ttl
-
Convert back to source files
python3 turtle_to_files.py example3_french.ttl example3_french/
The output in example3_french/ contains valid Python with French strings:
# game.py
print("Pensez à un animal...")
if ask("A-t-il quatre pattes ?"):
...
print("J'ai gagné !")# animals.py
ANIMALS = {
"chien": {"four_legs": True, ...},
"chat": {"four_legs": True, ...},
...
}The translate_to_french.py script:
- Loads the Turtle graph with rdflib
- Finds all
cst:texttriples containing translatable strings - Replaces English strings with French translations
- Writes out a new Turtle file
The same approach works for any CST transformation:
- Renaming variables/functions
- Adding instrumentation
- Code analysis
- Refactoring
For the Turtle/RDF format, string literals are normalized on ingestion using ast.literal_eval() and reconstructed on emission. This makes transformations cleaner since you work with actual string values rather than source syntax.
Ingestion: "hello\n" → hello + newline (actual value)
Emission: Reconstructs valid Python string literals with double quotes
Special handling:
- F-strings (
f"...") pass through unchanged (not normalized) - Docstrings with
format="docstring"emit as triple-quoted"""..."""
srcML only marks function/class docstrings with format="docstring". Module-level docstrings are not marked, so they emit as regular double-quoted strings:
# Original
"""Module docstring."""
# After roundtrip
"Module docstring."The code remains valid Python - only the quote style differs. This is a srcML limitation, not a bug in the roundtrip tools.
| Directory | Description |
|---|---|
example1/ |
Single file sample |
example2/ |
Multi-file with imports |
example3/ |
Animal guessing game (English) |