Several TEI XML files in the corpus contain invisible Unicode control characters (e.g. zero-width spaces). These characters are not visible in normal editors but appear as control characters in VS Code and similar tools.
Example from a TEI note:
<note place="foot" type="n1" targetEnd="#w8">
<p xml:space="preserve">
vgl. Cicero De Div. I,102
</p>
</note>
There is an invisible character before vgl. (likely U+200C ZERO WIDTH NON-JOINER or U+200B ZERO WIDTH SPACE).
Impact
These characters cause several issues in the processing pipeline:
- Empty footnotes appear in the rendered output because the transformation pipeline sometimes interprets the invisible character as content.
- Potentially similar effects may occur in other contexts where whitespace or text nodes are evaluated.
- They are related to these issues for corpora: ...
Proposed solution
Add a corpus sanitation step in the scripts pipeline that removes Unicode format/control characters that should not occur in TEI text nodes.
Characters to remove include for example:
U+200B ZERO WIDTH SPACE
U+200C ZERO WIDTH NON-JOINER
U+200D ZERO WIDTH JOINER
U+FEFF ZERO WIDTH NO-BREAK SPACE (BOM)
U+00AD SOFT HYPHEN
A regex that covers the most common cases:
[\u200B-\u200D\uFEFF\u00AD]
Several TEI XML files in the corpus contain invisible Unicode control characters (e.g. zero-width spaces). These characters are not visible in normal editors but appear as control characters in VS Code and similar tools.
Example from a TEI note:
There is an invisible character before
vgl.(likelyU+200C ZERO WIDTH NON-JOINERorU+200B ZERO WIDTH SPACE).Impact
These characters cause several issues in the processing pipeline:
Proposed solution
Add a corpus sanitation step in the scripts pipeline that removes Unicode format/control characters that should not occur in TEI text nodes.
Characters to remove include for example:
U+200BZERO WIDTH SPACEU+200CZERO WIDTH NON-JOINERU+200DZERO WIDTH JOINERU+FEFFZERO WIDTH NO-BREAK SPACE (BOM)U+00ADSOFT HYPHENA regex that covers the most common cases: