Conversation
The SFM importer's character-validity check allowed only the BMP XML ranges ([0x20-0xD7FF], [0xE000-0xFFFD]) and omitted the supplementary planes ([0x10000-0x10FFFF]). .NET strings encode those code points as UTF-16 surrogate pairs, so each surrogate code unit was flagged invalid and removed, corrupting any imported data containing characters above U+FFFF (e.g. Wancho, Gothic, Adlam, emoji). Recognize a valid high+low surrogate pair as a single supplementary character and preserve it; lone/unpaired surrogates are still removed. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
The SFM importer's character-validity check allowed only the BMP XML ranges (
[0x20-0xD7FF],[0xE000-0xFFFD]) and omitted the supplementary planes ([0x10000-0x10FFFF]). .NET strings encode those code points as UTF-16 surrogate pairs, so each surrogate code unit was flagged invalid and removed — corrupting any imported data containing characters above U+FFFF (e.g. Wancho, Gothic, Adlam, emoji).This produced errors like "SFM 'de' contains character value 0xD800, which is invalid and has been removed" and silently dropped the affected characters from the import.
Fix
In
Sfm2Xml.Converter.ProcessSFMandData, recognize a valid high+low surrogate pair as a single supplementary-plane character and preserve it. Lone/unpaired surrogates are still removed (they remain invalid XML).Testing
ConverterPreservesSupplementaryPlaneCharactersimports an SFM entry with supplementary-plane (Wancho) characters and asserts they survive the conversion. All Sfm2XmlTests pass..db+.mapfiles. Before: 14 invalid-character errors; after: 0 errors, with all supplementary code points (Gothic, Adlam, emoji) preserved intact and the belly/navel synonym cross-references correct.The conversion-to-XML stage (where the bug lived and where the reported errors originated) is fully validated. End-to-end GUI import was not automated in this change.
Notes
LT-22020 (Adlam block import) shares this exact root cause and is also resolved by this fix, but is intentionally not referenced here — it is not in the current sprint and will be tested/closed separately once a build is available.
🤖 Generated with Claude Code
This change is