Skip to content

(code is entirely generated by AI.)Preserve non-UTF-8 query values when cleaning fields#520

Open
dongfengweixiao wants to merge 1 commit into
ClearURLs:masterfrom
dongfengweixiao:fix/preserve-non-utf8-query-values
Open

(code is entirely generated by AI.)Preserve non-UTF-8 query values when cleaning fields#520
dongfengweixiao wants to merge 1 commit into
ClearURLs:masterfrom
dongfengweixiao:fix/preserve-non-utf8-query-values

Conversation

@dongfengweixiao

@dongfengweixiao dongfengweixiao commented Jun 15, 2026

Copy link
Copy Markdown

Query parameter values that use a non-UTF-8 encoding (e.g. GBK on 1688.com, Big5, Shift-JIS) were being corrupted whenever ClearURLs rewrote a URL. Searching "usb转串口" on 1688 turned
keywords=usb%D7%AA%B4%AE%BF%DA into
keywords=usb%D7%AA%EF%BF%BD%EF%BF%BD%EF%BF%BD%EF%BF%BD.

Root cause: removeFieldsFormURL round-tripped every field through URLSearchParams, which decodes percent-encoded bytes as UTF-8. Bytes that are invalid UTF-8 (the high bytes of GBK-encoded text) became the U+FFFD replacement character, which then re-encodes as %EF%BF%BD.

Fix: filter the raw query string byte-for-byte via the new removeFieldsFromQuery(), which decodes only the parameter keys for rule matching and never decodes or re-encodes the values. Values that are not themselves removed tracking fields now survive with their exact original byte sequence, so GBK/Big5/Shift-JIS values are no longer damaged.

  • Add percentDecodeBytes() and removeFieldsFromQuery() in core_js/tools.js
  • Switch removeFieldsFormURL to operate on the raw query string
  • Remove the now-unused urlSearchParamsToString()

Fragments, raw rules, redirections, logging, and change detection are unchanged.

#398

@dongfengweixiao dongfengweixiao force-pushed the fix/preserve-non-utf8-query-values branch 4 times, most recently from fe1267b to 1d34062 Compare June 15, 2026 10:44
Query parameter values that use a non-UTF-8 encoding (e.g. GBK on
1688.com, Big5, Shift-JIS) were being corrupted whenever ClearURLs
rewrote a URL. Searching "usb转串口" on 1688 turned
keywords=usb%D7%AA%B4%AE%BF%DA into
keywords=usb%D7%AA%EF%BF%BD%EF%BF%BD%EF%BF%BD%EF%BF%BD.

Root cause: removeFieldsFormURL round-tripped every field through
URLSearchParams, which decodes percent-encoded bytes as UTF-8. Bytes
that are invalid UTF-8 (the high bytes of GBK-encoded text) became the
U+FFFD replacement character, which then re-encodes as %EF%BF%BD.

Fix: filter the raw query string byte-for-byte via the new
removeFieldsFromQuery(), which decodes only the parameter keys for rule
matching and never decodes or re-encodes the values. Values that are not
themselves removed tracking fields now survive with their exact original
byte sequence, so GBK/Big5/Shift-JIS values are no longer damaged.

- Add percentDecodeBytes() and removeFieldsFromQuery() in core_js/tools.js
- Switch removeFieldsFormURL to operate on the raw query string
- Remove the now-unused urlSearchParamsToString()

Fragments, raw rules, redirections, logging, and change detection are
unchanged.

Co-Authored-By: Claude <noreply@anthropic.com>
@dongfengweixiao dongfengweixiao force-pushed the fix/preserve-non-utf8-query-values branch from 1d34062 to 156813f Compare June 15, 2026 10:52
@sonarqubecloud

Copy link
Copy Markdown

@dongfengweixiao dongfengweixiao changed the title Preserve non-UTF-8 query values when cleaning fields (code is entirely generated by AI.)Preserve non-UTF-8 query values when cleaning fields Jun 15, 2026
@dongfengweixiao

Copy link
Copy Markdown
Author

@wxy 如果有时间是否可以帮忙看下这个bug,及ai生成的代码是否能够解决该问题?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant