Implement end-to-end string language detection and parser reporting by AntoineBastide47 · Pull Request #11 · dedis/matchertext

AntoineBastide47 · 2026-04-12T15:08:08Z

Summary

This PR adds end-to-end string language detection to the LLVM pipeline.

The new classifier inspects extracted string literals and assigns them to higher-level categories such as URL, Email, FilePath, FormatString, SQL, JSON, YAML, HTML, XML, CSS, Regex, Shell, HexData, BinaryData, PseudoBinaryData, PlainText, and a few code-like fallback buckets. The goal is to make string-bucket output more useful for analysis, debugging, and downstream reporting.

Pseudo here means that there is also plain text or random junk with the classifier. For example, this is a PseudoUrl:
"example domain: https://example.com"

Beyond the classifier itself, this PR also wires the result into the parser/stats flow, adds debug output support, introduces a training/generation script for the model-backed part of the system, and adds a broad regression suite around both direct classification and parser integration.

Extracted samples can be saved to disk using the --debug-languages option, this will create a result folder with the parsed repository's results.

Recommendation: run make train build before testing.

What changed

Added the public string-language classification API and runtime implementation.
Integrated classification into parser processing and reporting.
Split detection logic into dedicated classifier modules under LLVM/src/classifiers/ to keep the heuristics maintainable.
Added training/generation tooling and documentation for the classifier model.
Added regression tests for direct string classification, parser fixtures, and debug output.

Important files

LLVM/src/LanguageClassifier.cpp
Main classifier entry point. Contains the overall decision flow, classifier ordering, and model-backed scoring logic.
LLVM/include/LanguageClassifier.hpp
Public API for classification results and the language/category enum.
LLVM/src/Parser.cpp
Wires classification into the parser pipeline and reporting/debug output flow.
LLVM/src/classifiers/Internal.hpp
Shared helper logic and detector interfaces used across the classifier modules.
LLVM/src/classifiers/EmailClassifier.cpp
Handles email, pseudo-email, and related edge cases.
LLVM/src/classifiers/URLClassifier.cpp
Handles URL and pseudo-URL detection.
LLVM/src/classifiers/PlainTextClassifier.cpp
Contains the plain-text fallback heuristics, which are important for avoiding false positives.
LLVM/src/classifiers/YAMLClassifier.cpp
Representative structured-text detector; similar logic for other structured formats lives alongside it in LLVM/src/classifiers/.
LLVM/tests/language_detection_test.cpp
Main regression suite for direct string classification behavior.
LLVM/tests/parser_fixture_test.cpp
Verifies parser-level integration and bucket behavior on fixture inputs.
LLVM/tests/debug_language_output_test.cpp
Verifies the debug output generated from classified string buckets.
LLVM/train/generate_model.py
Training/generation script for the model-backed part of the classifier pipeline.

Training

make train (requires Python 3.8+)

Testing

make test

…matchertext into string-language-detection

phamelink · 2026-04-22T15:58:55Z

Hey, I'm gonna start reviewing this. You said in an email a while back that since it's a lot of lines you're gonna do several PRs. Is this still the case ?

AntoineBastide47 · 2026-04-22T17:18:00Z

Hey, I'm gonna start reviewing this. You said in an email a while back that since it's a lot of lines you're gonna do several PRs. Is this still the case ?

Hi, we discussed this on Friday with Bryan and concluded it would be quicker to do it in one go. Good luck soldier 🫡
(sorry for the high volume I just got lost in the code)

AntoineBastide47 added 15 commits March 24, 2026 19:10

feat: add basic model for detection

de52ef3

feat: add real tests

abe29d8

feat: add more docs + more tolerant python detection

d39ca27

feat: better language fine tunning

12155ea

Delete LanguageModel.generated.hpp

d9b21c1

feat: ignore generated file

0456a12

feat: add email, pseudo url and pseudo email detection

6c8b83b

refactor: split out classifiers into files

2a84d56

feat: fix tests + better timing logs

b370264

feat: even more fine tunning

98acb57

Merge branch 'main' into string-language-detection

3898b87

Delete generate_model.cpython-314.pyc

4b3a068

Update .gitignore

332f125

Merge branch 'string-language-detection' of https://github.com/dedis/…

6838c3c

…matchertext into string-language-detection

refactor: split up Internal.hpp

02bfd50

AntoineBastide47 force-pushed the string-language-detection branch from 1c081a7 to 02bfd50 Compare April 12, 2026 16:10

AntoineBastide47 closed this Apr 12, 2026

AntoineBastide47 reopened this Apr 17, 2026

AntoineBastide47 self-assigned this Apr 17, 2026

AntoineBastide47 requested review from bford and phamelink April 17, 2026 16:06

Update Metrics.md

c9c7e43

AntoineBastide47 changed the title ~~String language detection~~ Implement end-to-end string language detection and parser reporting Apr 17, 2026

AntoineBastide47 marked this pull request as ready for review April 17, 2026 16:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement end-to-end string language detection and parser reporting#11

Implement end-to-end string language detection and parser reporting#11
AntoineBastide47 wants to merge 16 commits intomainfrom
string-language-detection

AntoineBastide47 commented Apr 12, 2026 •

edited

Loading

Uh oh!

phamelink commented Apr 22, 2026

Uh oh!

AntoineBastide47 commented Apr 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

AntoineBastide47 commented Apr 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What changed

Important files

Training

Testing

Uh oh!

phamelink commented Apr 22, 2026

Uh oh!

AntoineBastide47 commented Apr 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

AntoineBastide47 commented Apr 12, 2026 •

edited

Loading