Skip to content

fix: replacing specific unicode characters with all unicode letters#30

Merged
chrisbottin merged 3 commits into
chrisbottin:masterfrom
HackZers7:fix/expanded-unicode
May 21, 2026
Merged

fix: replacing specific unicode characters with all unicode letters#30
chrisbottin merged 3 commits into
chrisbottin:masterfrom
HackZers7:fix/expanded-unicode

Conversation

@HackZers7
Copy link
Copy Markdown

What Changed

The closing-tag parsing regex was updated from:

^<\/[\w-:.\u00C0-\u00FF]+\s*>

to:

^<\/[\p{L}\w\-:.]+\s*>/u

The previous range \u00C0-\u00FF only covers a limited subset of Latin characters.
The new pattern uses \p{L} in Unicode mode (u), which properly supports letters from many writing systems and improves parsing for international tag names.

@chrisbottin
Copy link
Copy Markdown
Owner

Thanks @HackZers7 for your contribution.

Can you also add a test similar to https://github.com/chrisbottin/xml-parser/blob/master/test/index.ts#L383 to cover the support of additional unicode letters?

@HackZers7
Copy link
Copy Markdown
Author

Added a test to check the parsing of various languages. An error that occurred for Oriental languages has also been fixed.

The test data is synthetic, generated using GPT.

@chrisbottin chrisbottin self-requested a review May 21, 2026 20:53
@chrisbottin chrisbottin merged commit 2a118b2 into chrisbottin:master May 21, 2026
1 check passed
@chrisbottin
Copy link
Copy Markdown
Owner

Thanks @HackZers7 for your contribution, the PR is merged and a new version 4.1.6 has been published.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants