Implement end-to-end string language detection and parser reporting#11
Open
AntoineBastide47 wants to merge 16 commits intomainfrom
Open
Implement end-to-end string language detection and parser reporting#11AntoineBastide47 wants to merge 16 commits intomainfrom
AntoineBastide47 wants to merge 16 commits intomainfrom
Conversation
…matchertext into string-language-detection
1c081a7 to
02bfd50
Compare
Contributor
|
Hey, I'm gonna start reviewing this. You said in an email a while back that since it's a lot of lines you're gonna do several PRs. Is this still the case ? |
Contributor
Author
Hi, we discussed this on Friday with Bryan and concluded it would be quicker to do it in one go. Good luck soldier 🫡 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR adds end-to-end string language detection to the LLVM pipeline.
The new classifier inspects extracted string literals and assigns them to higher-level categories such as URL, Email, FilePath, FormatString, SQL, JSON, YAML, HTML, XML, CSS, Regex, Shell, HexData, BinaryData, PseudoBinaryData, PlainText, and a few code-like fallback buckets. The goal is to make string-bucket output more useful for analysis, debugging, and downstream reporting.
Pseudo here means that there is also plain text or random junk with the classifier. For example, this is a PseudoUrl:
"example domain: https://example.com"
Beyond the classifier itself, this PR also wires the result into the parser/stats flow, adds debug output support, introduces a training/generation script for the model-backed part of the system, and adds a broad regression suite around both direct classification and parser integration.
Extracted samples can be saved to disk using the
--debug-languagesoption, this will create aresultfolder with the parsed repository's results.Recommendation: run
make train buildbefore testing.What changed
Important files
Main classifier entry point. Contains the overall decision flow, classifier ordering, and model-backed scoring logic.
Public API for classification results and the language/category enum.
Wires classification into the parser pipeline and reporting/debug output flow.
Shared helper logic and detector interfaces used across the classifier modules.
Handles email, pseudo-email, and related edge cases.
Handles URL and pseudo-URL detection.
Contains the plain-text fallback heuristics, which are important for avoiding false positives.
Representative structured-text detector; similar logic for other structured formats lives alongside it in LLVM/src/classifiers/.
Main regression suite for direct string classification behavior.
Verifies parser-level integration and bucket behavior on fixture inputs.
Verifies the debug output generated from classified string buckets.
Training/generation script for the model-backed part of the classifier pipeline.
Training
Testing