Skip to content

Implement end-to-end string language detection and parser reporting#11

Open
AntoineBastide47 wants to merge 16 commits intomainfrom
string-language-detection
Open

Implement end-to-end string language detection and parser reporting#11
AntoineBastide47 wants to merge 16 commits intomainfrom
string-language-detection

Conversation

@AntoineBastide47
Copy link
Copy Markdown
Contributor

@AntoineBastide47 AntoineBastide47 commented Apr 12, 2026

Summary

This PR adds end-to-end string language detection to the LLVM pipeline.

The new classifier inspects extracted string literals and assigns them to higher-level categories such as URL, Email, FilePath, FormatString, SQL, JSON, YAML, HTML, XML, CSS, Regex, Shell, HexData, BinaryData, PseudoBinaryData, PlainText, and a few code-like fallback buckets. The goal is to make string-bucket output more useful for analysis, debugging, and downstream reporting.

Pseudo here means that there is also plain text or random junk with the classifier. For example, this is a PseudoUrl:
"example domain: https://example.com"

Beyond the classifier itself, this PR also wires the result into the parser/stats flow, adds debug output support, introduces a training/generation script for the model-backed part of the system, and adds a broad regression suite around both direct classification and parser integration.

Extracted samples can be saved to disk using the --debug-languages option, this will create a result folder with the parsed repository's results.

Recommendation: run make train build before testing.

What changed

  • Added the public string-language classification API and runtime implementation.
  • Integrated classification into parser processing and reporting.
  • Split detection logic into dedicated classifier modules under LLVM/src/classifiers/ to keep the heuristics maintainable.
  • Added training/generation tooling and documentation for the classifier model.
  • Added regression tests for direct string classification, parser fixtures, and debug output.

Important files

  • LLVM/src/LanguageClassifier.cpp
    Main classifier entry point. Contains the overall decision flow, classifier ordering, and model-backed scoring logic.
  • LLVM/include/LanguageClassifier.hpp
    Public API for classification results and the language/category enum.
  • LLVM/src/Parser.cpp
    Wires classification into the parser pipeline and reporting/debug output flow.
  • LLVM/src/classifiers/Internal.hpp
    Shared helper logic and detector interfaces used across the classifier modules.
  • LLVM/src/classifiers/EmailClassifier.cpp
    Handles email, pseudo-email, and related edge cases.
  • LLVM/src/classifiers/URLClassifier.cpp
    Handles URL and pseudo-URL detection.
  • LLVM/src/classifiers/PlainTextClassifier.cpp
    Contains the plain-text fallback heuristics, which are important for avoiding false positives.
  • LLVM/src/classifiers/YAMLClassifier.cpp
    Representative structured-text detector; similar logic for other structured formats lives alongside it in LLVM/src/classifiers/.
  • LLVM/tests/language_detection_test.cpp
    Main regression suite for direct string classification behavior.
  • LLVM/tests/parser_fixture_test.cpp
    Verifies parser-level integration and bucket behavior on fixture inputs.
  • LLVM/tests/debug_language_output_test.cpp
    Verifies the debug output generated from classified string buckets.
  • LLVM/train/generate_model.py
    Training/generation script for the model-backed part of the classifier pipeline.

Training

  • make train (requires Python 3.8+)

Testing

  • make test

@AntoineBastide47 AntoineBastide47 changed the title String language detection Implement end-to-end string language detection and parser reporting Apr 17, 2026
@AntoineBastide47 AntoineBastide47 marked this pull request as ready for review April 17, 2026 16:23
@phamelink
Copy link
Copy Markdown
Contributor

Hey, I'm gonna start reviewing this. You said in an email a while back that since it's a lot of lines you're gonna do several PRs. Is this still the case ?

@AntoineBastide47
Copy link
Copy Markdown
Contributor Author

Hey, I'm gonna start reviewing this. You said in an email a while back that since it's a lot of lines you're gonna do several PRs. Is this still the case ?

Hi, we discussed this on Friday with Bryan and concluded it would be quicker to do it in one go. Good luck soldier 🫡
(sorry for the high volume I just got lost in the code)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants