Skip to content

trustgraph-ai/experimental-code-translation

Repository files navigation

CST Roundtrip Experiments

Experiments with Concrete Syntax Tree (CST) representations using srcML. These tools convert Python source code to various tree representations and back, with perfect fidelity.

Overview

srcML parses source code into an XML-based CST that preserves all syntax including whitespace, comments, and formatting. This enables lossless roundtripping: code → tree → code.

We provide three serialization formats:

Format To Tree From Tree
XML dir_to_xml.py xml_to_stdout.py
JSON dir_to_json.py json_to_stdout.py
Turtle (RDF) dir_to_turtle.py turtle_to_stdout.py, turtle_to_files.py

Dependencies

pip install pylibsrcml rdflib

Also requires srcML 1.1.0+ installed on the system.


XML Roundtrip

The native srcML format. Suitable for XPath/XSLT processing.

Convert directory to XML archive

python3 dir_to_xml.py <directory> [output.xml]

Convert XML back to source (stdout with headers)

python3 xml_to_stdout.py <archive.xml>

Example

python3 dir_to_xml.py example2 example2.xml
python3 xml_to_stdout.py example2.xml

Output:

=== main.py ===
"""Main application that uses the utils module."""
...

=== utils.py ===
"""Utility functions for data processing."""
...

JSON Roundtrip

A JSON tree representation. Suitable for JavaScript/Python processing.

Convert directory to JSON

python3 dir_to_json.py <directory> [output.json]

Convert JSON back to source (JSON array output)

python3 json_to_stdout.py <archive.json>

Example

python3 dir_to_json.py example2 example2.json
python3 json_to_stdout.py example2.json

Output:

[
  {"path": "main.py", "source": "..."},
  {"path": "utils.py", "source": "..."}
]

JSON Structure

{
  "tag": "function",
  "text": "def ",
  "attrs": {"type": "string"},
  "children": [
    {"tag": "name", "text": "hello"},
    {"tag": "#text", "text": "\n"}
  ]
}

Turtle (RDF) Roundtrip

An RDF representation using the Turtle syntax. Suitable for graph databases and SPARQL queries.

Convert directory to Turtle

python3 dir_to_turtle.py <directory> [output.ttl]

Convert Turtle back to source

To stdout as JSON array:

python3 turtle_to_stdout.py <archive.ttl>

To files in a directory:

python3 turtle_to_files.py <archive.ttl> <output_dir>

Example

python3 dir_to_turtle.py example2 example2.ttl
python3 turtle_to_files.py example2.ttl output/

RDF Structure

Uses the namespace http://trustgraph.ai/cst# with URI-based node identifiers (no blank nodes):

@prefix cst: <http://trustgraph.ai/cst#> .
@prefix node: <http://trustgraph.ai/cst/node/> .

node:main.py a cst:File ;
    cst:path "main.py" ;
    cst:root node:main.py/n1 .

node:main.py/n1 a cst:unit ;
    cst:attr_language "Python" ;
    cst:children ( node:main.py/n2 node:main.py/n3 ) .

node:main.py/n2 a cst:function ;
    cst:text "def " ;
    cst:children ( ... ) .

Child ordering is preserved using rdf:List.


Code Transformation Demo: English to French

This demonstrates using the RDF representation for code transformation. This uses a transformation with a set of English -> French translation strings.

The Example

example3/ contains a simple animal guessing game with English strings:

# game.py
print("Think of an animal...")
if ask("Does it have four legs?"):
    ...
print("I win!")

Translation Pipeline

  1. Convert to Turtle

    python3 dir_to_turtle.py example3 example3.ttl
  2. Translate strings

    python3 translate_to_french.py example3.ttl example3_french.ttl
  3. Convert back to source files

    python3 turtle_to_files.py example3_french.ttl example3_french/

Result

The output in example3_french/ contains valid Python with French strings:

# game.py
print("Pensez à un animal...")
if ask("A-t-il quatre pattes ?"):
    ...
print("J'ai gagné !")
# animals.py
ANIMALS = {
    "chien": {"four_legs": True, ...},
    "chat": {"four_legs": True, ...},
    ...
}

How It Works

The translate_to_french.py script:

  1. Loads the Turtle graph with rdflib
  2. Finds all cst:text triples containing translatable strings
  3. Replaces English strings with French translations
  4. Writes out a new Turtle file

The same approach works for any CST transformation:

  • Renaming variables/functions
  • Adding instrumentation
  • Code analysis
  • Refactoring

String Normalization (Turtle format)

For the Turtle/RDF format, string literals are normalized on ingestion using ast.literal_eval() and reconstructed on emission. This makes transformations cleaner since you work with actual string values rather than source syntax.

Ingestion: "hello\n"hello + newline (actual value)

Emission: Reconstructs valid Python string literals with double quotes

Special handling:

  • F-strings (f"...") pass through unchanged (not normalized)
  • Docstrings with format="docstring" emit as triple-quoted """..."""

Known Limitation

srcML only marks function/class docstrings with format="docstring". Module-level docstrings are not marked, so they emit as regular double-quoted strings:

# Original
"""Module docstring."""

# After roundtrip
"Module docstring."

The code remains valid Python - only the quote style differs. This is a srcML limitation, not a bug in the roundtrip tools.


Examples

Directory Description
example1/ Single file sample
example2/ Multi-file with imports
example3/ Animal guessing game (English)

About

Hacking around with some code translation

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages