LT-20644: Fix SFM import stripping upper Unicode plane chars by mark-sil · Pull Request #960 · sillsdev/FieldWorks

mark-sil · 2026-06-18T16:17:53Z

Summary

The SFM importer's character-validity check allowed only the BMP XML ranges ([0x20-0xD7FF], [0xE000-0xFFFD]) and omitted the supplementary planes ([0x10000-0x10FFFF]). .NET strings encode those code points as UTF-16 surrogate pairs, so each surrogate code unit was flagged invalid and removed — corrupting any imported data containing characters above U+FFFF (e.g. Wancho, Gothic, Adlam, emoji).

This produced errors like "SFM 'de' contains character value 0xD800, which is invalid and has been removed" and silently dropped the affected characters from the import.

Fix

In Sfm2Xml.Converter.ProcessSFMandData, recognize a valid high+low surrogate pair as a single supplementary-plane character and preserve it. Lone/unpaired surrogates are still removed (they remain invalid XML).

Testing

New unit test ConverterPreservesSupplementaryPlaneCharacters imports an SFM entry with supplementary-plane (Wancho) characters and asserts they survive the conversion. All Sfm2XmlTests pass.
Real-data verification: ran the production converter against the ticket's actual .db + .map files. Before: 14 invalid-character errors; after: 0 errors, with all supplementary code points (Gothic, Adlam, emoji) preserved intact and the belly/navel synonym cross-references correct.

The conversion-to-XML stage (where the bug lived and where the reported errors originated) is fully validated. End-to-end GUI import was not automated in this change.

Notes

LT-22020 (Adlam block import) shares this exact root cause and is also resolved by this fix, but is intentionally not referenced here — it is not in the current sprint and will be tested/closed separately once a build is available.

🤖 Generated with Claude Code

This change is

The SFM importer's character-validity check allowed only the BMP XML ranges ([0x20-0xD7FF], [0xE000-0xFFFD]) and omitted the supplementary planes ([0x10000-0x10FFFF]). .NET strings encode those code points as UTF-16 surrogate pairs, so each surrogate code unit was flagged invalid and removed, corrupting any imported data containing characters above U+FFFF (e.g. Wancho, Gothic, Adlam, emoji). Recognize a valid high+low surrogate pair as a single supplementary character and preserve it; lone/unpaired surrogates are still removed. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

github-actions · 2026-06-18T16:31:08Z

NUnit Tests

1 files ±0 1 suites ±0 10m 31s ⏱️ -1s
4 252 tests +1 4 179 ✅ +1 73 💤 ±0 0 ❌ ±0
4 261 runs +1 4 188 ✅ +1 73 💤 ±0 0 ❌ ±0

Results for commit 3be2a94. ± Comparison against base commit fd245f0.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

LT-20644: Fix SFM import stripping upper Unicode plane chars#960

LT-20644: Fix SFM import stripping upper Unicode plane chars#960
mark-sil wants to merge 1 commit into
mainfrom
LT-20644

mark-sil commented Jun 18, 2026 •

edited by jasonleenaylor

Loading

Uh oh!

github-actions Bot commented Jun 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

mark-sil commented Jun 18, 2026 • edited by jasonleenaylor Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Fix

Testing

Notes

Uh oh!

github-actions Bot commented Jun 18, 2026

NUnit Tests

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

mark-sil commented Jun 18, 2026 •

edited by jasonleenaylor

Loading