Skip to content

Remove invisible Unicode control characters in latin files #30

@tschomacker

Description

@tschomacker

Several TEI XML files in the corpus contain invisible Unicode control characters (e.g. zero-width spaces). These characters are not visible in normal editors but appear as control characters in VS Code and similar tools.

Example from a TEI note:

<note place="foot" type="n1" targetEnd="#w8">
  <p xml:space="preserve">
    ‌vgl. Cicero De Div. I,102
  </p>
</note>
Image

There is an invisible character before vgl. (likely U+200C ZERO WIDTH NON-JOINER or U+200B ZERO WIDTH SPACE).

Impact

These characters cause several issues in the processing pipeline:

  • Empty footnotes appear in the rendered output because the transformation pipeline sometimes interprets the invisible character as content.
  • Potentially similar effects may occur in other contexts where whitespace or text nodes are evaluated.
  • They are related to these issues for corpora: ...

Proposed solution

Add a corpus sanitation step in the scripts pipeline that removes Unicode format/control characters that should not occur in TEI text nodes.

Characters to remove include for example:

  • U+200B ZERO WIDTH SPACE
  • U+200C ZERO WIDTH NON-JOINER
  • U+200D ZERO WIDTH JOINER
  • U+FEFF ZERO WIDTH NO-BREAK SPACE (BOM)
  • U+00AD SOFT HYPHEN

A regex that covers the most common cases:

[\u200B-\u200D\uFEFF\u00AD]

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No fields configured for Bug.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions