Skip to content

Fix: Replace surrogates in Unicode code points#164

Open
Synrom wants to merge 2 commits intoscrapy:masterfrom
Synrom:fix/replace_surrogate_in_unicode_code_points
Open

Fix: Replace surrogates in Unicode code points#164
Synrom wants to merge 2 commits intoscrapy:masterfrom
Synrom:fix/replace_surrogate_in_unicode_code_points

Conversation

@Synrom
Copy link
Copy Markdown

@Synrom Synrom commented May 7, 2026

This PR fixes a bug discovered by oss-fuzz. You can find the reproducer here.

The program crashes because surrogates in Unicode escape sequences are just ignored.
That leads to a crash when the string containing surrogates is passed to ascii_lower:

def ascii_lower(string: str) -> str:
    """Lower-case, but only in the ASCII range."""
    return string.encode("utf8").lower().decode("utf8")
UnicodeEncodeError: 'utf-8' codec can't encode character '\udddd' in position 3: surrogates not allowed

This PR fixes this bug, by replacing surrogates with U+FFFD REPLACEMENT CHARACTER (�) in the _replace_unicode function. This is corresponding to the spec.

It also adds a reproducing test case. Also the oss-fuzz reproducer is patched with this PR.

@codecov
Copy link
Copy Markdown

codecov Bot commented May 7, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 96.83%. Comparing base (743c6e5) to head (0109175).

Additional details and impacted files
@@            Coverage Diff             @@
##           master     #164      +/-   ##
==========================================
+ Coverage   96.61%   96.83%   +0.22%     
==========================================
  Files           3        3              
  Lines         885      885              
  Branches      136      136              
==========================================
+ Hits          855      857       +2     
+ Misses         14       13       -1     
+ Partials       16       15       -1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@wRAR
Copy link
Copy Markdown
Member

wRAR commented May 7, 2026

Please run pre-commit.

@Synrom
Copy link
Copy Markdown
Author

Synrom commented May 7, 2026

Sry, I ran pre-commit now :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants