Skip to content

Conversation

@asgerf
Copy link
Contributor

@asgerf asgerf commented Dec 1, 2025

Improves the detection of minified files, by classifying files with an average line length over 200 as minified.

Minifiers typically compile a file to one long line, but sometimes there's a copyright comment in front of it, so we need to account for some normal-looking lines. Sometimes multiple such files are concatenated in a bundle.

Note that we already skip certain files based on filenames, such as *.min.js, but previously we did not detect minification based on contents.

@github-actions github-actions bot added the JS label Dec 1, 2025
@asgerf asgerf force-pushed the js/detect-minified-files branch from 25eaf27 to 38eef8a Compare December 3, 2025 08:23
@asgerf asgerf marked this pull request as ready for review December 5, 2025 09:42
@asgerf asgerf requested review from a team as code owners December 5, 2025 09:42
Copilot AI review requested due to automatic review settings December 5, 2025 09:42
Copilot finished reviewing on behalf of asgerf December 5, 2025 09:44
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves the detection of minified JavaScript files by classifying files with an average line length greater than 200 characters as minified. Previously, only filename-based detection (e.g., *.min.js) was used. The new content-based detection can be disabled by setting the environment variable CODEQL_EXTRACTOR_JAVASCRIPT_ALLOW_MINIFIED_FILES=true.

  • Adds a new isMinified() method that calculates average line length by counting line breaks (handling both \n and \r\n)
  • Introduces a skipReason field to track why files are skipped during extraction
  • Makes the minified file filtering configurable via environment variable

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
javascript/extractor/src/com/semmle/js/extractor/ScriptExtractor.java Implements the core minified file detection logic based on average line length
javascript/extractor/src/com/semmle/js/extractor/ParseResultInfo.java Adds support for tracking skip reasons when files are not extracted
javascript/extractor/src/com/semmle/js/extractor/FileExtractor.java Updates extraction flow to handle and log skipped files
javascript/extractor/src/com/semmle/js/extractor/ExtractorConfig.java Adds configuration option to allow/disallow minified files
javascript/extractor/src/com/semmle/js/extractor/EnvironmentVariables.java Adds environment variable parsing for minified file handling
javascript/extractor/src/com/semmle/js/extractor/AutoBuild.java Applies the allowMinified configuration to both filename-based and content-based filtering
javascript/ql/src/change-notes/2025-12-05-skip-minified-files.md Documents the new behavior in release notes
javascript/ql/test/query-tests/filters/ClassifyFiles/ClassifyFiles.expected Updates test expectations to reflect that json-like.js is now detected as minified
javascript/ql/test/query-tests/Security/CWE-400/ReDoS/regexplib/dates.js Adds padding lines to prevent this test file from being classified as minified

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

asgerf and others added 2 commits December 5, 2025 11:14
….java

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
tausbn
tausbn previously approved these changes Dec 5, 2025
Copy link
Contributor

@tausbn tausbn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One minor suggestion, otherwise LGTM! 👍

@miketsprague
Copy link

miketsprague commented Dec 5, 2025

I'm wondering if there's some classes of cases we're going to miss by doing that (i.e. I would imagine URLs could result in >200 length lines, which are potentially prone to injection?), and if: 1) we'd be able to find this out when rolling it out; or 2) have a slightly different heuristic to avoid this?

I don't have a strong opinion here but something like:

const apiEndpoint = "https://api.example.com/v2/users/profile/settings/notifications/preferences/email?includeArchived=true&format=json&apiKey=abc123def456";
let blah = someOtherNormalJSStuff();
if(blah) {
  blah.obviouslyNotMinifiedJS();
}

Would be considered minified. No clue as to what the data looks like in terms of distribution of line lengths

@tausbn
Copy link
Contributor

tausbn commented Dec 5, 2025

I'm wondering if there's some classes of cases we're going to miss by doing that (i.e. I would imagine URLs could result in >200 length lines, which are potentially prone to injection?), and if: 1) we'd be able to find this out when rolling it out; or 2) have a slightly different heuristic to avoid this?

I don't have a strong opinion here but something like:

const apiEndpoint = "https://api.example.com/v2/users/profile/settings/notifications/preferences/email?includeArchived=true&format=json&apiKey=abc123def456";
let blah = someOtherNormalJSStuff();
if(blah) {
  blah.obviouslyNotMinifiedJS();
}

Would be considered minified. No clue as to what the data looks like in terms of distribution of line lengths

I don't think that's quite right. Your code snippet is ~240 characters and 5 line breaks, for an average line length of 48 -- well below the limit.

To get above an average of 200, you either need a few really long lines, or a lot of >200 character lines.

@miketsprague
Copy link

miketsprague commented Dec 5, 2025

Ya you're right thanks @tausbn -- my brain forgot what an average is and read it just as "max" 🙃 . A 200 average length one for the above example would basically a helper js file that just had URL definitions hah

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants