Skip to content

Fix: CodeQL ReDoS vulnerability in ULMFiT replace_url#1400

Open
chanitnan0jr wants to merge 1 commit intoPyThaiNLP:devfrom
chanitnan0jr:fix-ulmfit-redos
Open

Fix: CodeQL ReDoS vulnerability in ULMFiT replace_url#1400
chanitnan0jr wants to merge 1 commit intoPyThaiNLP:devfrom
chanitnan0jr:fix-ulmfit-redos

Conversation

@chanitnan0jr
Copy link
Copy Markdown

Description:

This PR addresses a critical Security Vulnerability (Regular Expression Denial of Service - ReDoS) flagged by CodeQL in pythainlp/ulmfit/preprocess.py.

The Issue:

The URL_PATTERN contained a nested greedy quantifier (?:[^\s()<>{}\[\]]+)+. When processing an invalid URL suffix containing repetitive non-space/non-bracket characters (e.g., !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!), the regex engine suffers from catastrophic exponential backtracking (O(2N)), leading to CPU exhaustion and potential DoS.

The Fix:

Removed the inner greedy quantifier +, simplifying the pattern to (?:[^\s()<>{}\[\]])+. This forces the regex engine to evaluate the string in linear time (O(N)), enabling it to fail-fast without altering the intended URL matching logic.

@chanitnan0jr chanitnan0jr force-pushed the fix-ulmfit-redos branch 2 times, most recently from 977a848 to ac08838 Compare April 4, 2026 23:13
@sonarqubecloud
Copy link
Copy Markdown

sonarqubecloud bot commented Apr 4, 2026

Quality Gate Failed Quality Gate failed

Failed conditions
D Maintainability Rating on New Code (required ≥ A)

See analysis details on SonarQube Cloud

Catch issues before they fail your Quality Gate with our IDE extension SonarQube for IDE

@chanitnan0jr
Copy link
Copy Markdown
Author

@sonarqubecloud The complexity issue flagged here is due to the exhaustive list of TLDs (.com|.net|.org|...) inherent to the original ULMFiT URL_PATTERN (ported from fastai).
Since this PR only aims to fix the ReDoS vulnerability by removing a nested quantifier (without altering the core matching logic), refactoring this entire pattern to satisfy the complexity rule would change the expected upstream behavior.

@bact bact added the security label Apr 5, 2026
@coveralls
Copy link
Copy Markdown

Coverage Status

coverage: 66.633%. remained the same
when pulling ac08838 on chanitnan0jr:fix-ulmfit-redos
into 3152a36 on PyThaiNLP:dev.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants