Skip to content

LT-20644: Fix SFM import stripping upper Unicode plane chars#960

Open
mark-sil wants to merge 1 commit into
mainfrom
LT-20644
Open

LT-20644: Fix SFM import stripping upper Unicode plane chars#960
mark-sil wants to merge 1 commit into
mainfrom
LT-20644

Conversation

@mark-sil

@mark-sil mark-sil commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

Summary

The SFM importer's character-validity check allowed only the BMP XML ranges ([0x20-0xD7FF], [0xE000-0xFFFD]) and omitted the supplementary planes ([0x10000-0x10FFFF]). .NET strings encode those code points as UTF-16 surrogate pairs, so each surrogate code unit was flagged invalid and removed — corrupting any imported data containing characters above U+FFFF (e.g. Wancho, Gothic, Adlam, emoji).

This produced errors like "SFM 'de' contains character value 0xD800, which is invalid and has been removed" and silently dropped the affected characters from the import.

Fix

In Sfm2Xml.Converter.ProcessSFMandData, recognize a valid high+low surrogate pair as a single supplementary-plane character and preserve it. Lone/unpaired surrogates are still removed (they remain invalid XML).

Testing

  • New unit test ConverterPreservesSupplementaryPlaneCharacters imports an SFM entry with supplementary-plane (Wancho) characters and asserts they survive the conversion. All Sfm2XmlTests pass.
  • Real-data verification: ran the production converter against the ticket's actual .db + .map files. Before: 14 invalid-character errors; after: 0 errors, with all supplementary code points (Gothic, Adlam, emoji) preserved intact and the belly/navel synonym cross-references correct.

The conversion-to-XML stage (where the bug lived and where the reported errors originated) is fully validated. End-to-end GUI import was not automated in this change.

Notes

LT-22020 (Adlam block import) shares this exact root cause and is also resolved by this fix, but is intentionally not referenced here — it is not in the current sprint and will be tested/closed separately once a build is available.

🤖 Generated with Claude Code


This change is Reviewable

The SFM importer's character-validity check allowed only the BMP XML
ranges ([0x20-0xD7FF], [0xE000-0xFFFD]) and omitted the supplementary
planes ([0x10000-0x10FFFF]). .NET strings encode those code points as
UTF-16 surrogate pairs, so each surrogate code unit was flagged invalid
and removed, corrupting any imported data containing characters above
U+FFFF (e.g. Wancho, Gothic, Adlam, emoji).

Recognize a valid high+low surrogate pair as a single supplementary
character and preserve it; lone/unpaired surrogates are still removed.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@github-actions

Copy link
Copy Markdown

NUnit Tests

    1 files  ±0      1 suites  ±0   10m 31s ⏱️ -1s
4 252 tests +1  4 179 ✅ +1  73 💤 ±0  0 ❌ ±0 
4 261 runs  +1  4 188 ✅ +1  73 💤 ±0  0 ❌ ±0 

Results for commit 3be2a94. ± Comparison against base commit fd245f0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant