Fix: CodeQL ReDoS vulnerability in ULMFiT replace_url#1400
Open
chanitnan0jr wants to merge 1 commit intoPyThaiNLP:devfrom
Open
Fix: CodeQL ReDoS vulnerability in ULMFiT replace_url#1400chanitnan0jr wants to merge 1 commit intoPyThaiNLP:devfrom
chanitnan0jr wants to merge 1 commit intoPyThaiNLP:devfrom
Conversation
… ReDoS exponential backtracking
977a848 to
ac08838
Compare
|
Author
|
@sonarqubecloud The complexity issue flagged here is due to the exhaustive list of TLDs (.com|.net|.org|...) inherent to the original ULMFiT URL_PATTERN (ported from fastai). |
wannaphong
approved these changes
Apr 12, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.




Description:
This PR addresses a critical Security Vulnerability (Regular Expression Denial of Service - ReDoS) flagged by CodeQL in pythainlp/ulmfit/preprocess.py.
The Issue:
The URL_PATTERN contained a nested greedy quantifier (?:[^\s()<>{}\[\]]+)+. When processing an invalid URL suffix containing repetitive non-space/non-bracket characters (e.g., !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!), the regex engine suffers from catastrophic exponential backtracking (O(2N)), leading to CPU exhaustion and potential DoS.
The Fix:
Removed the inner greedy quantifier +, simplifying the pattern to (?:[^\s()<>{}\[\]])+. This forces the regex engine to evaluate the string in linear time (O(N)), enabling it to fail-fast without altering the intended URL matching logic.