Remove invisible Unicode control characters in latin files

Several TEI XML files in the corpus contain **invisible Unicode control characters** (e.g. zero-width spaces). These characters are not visible in normal editors but appear as control characters in VS Code and similar tools.

Example from a TEI note:

```xml
<note place="foot" type="n1" targetEnd="#w8">
  <p xml:space="preserve">
    ‌vgl. Cicero De Div. I,102
  </p>
</note>
```

<img width="141" height="63" alt="Image" src="https://github.com/user-attachments/assets/c7107988-254e-4dec-b43e-4996f735128a" />

There is an invisible character before `vgl.` (likely `U+200C ZERO WIDTH NON-JOINER` or `U+200B ZERO WIDTH SPACE`).

#### Impact

These characters cause several issues in the processing pipeline:

* **Empty footnotes appear in the rendered output** because the transformation pipeline sometimes interprets the invisible character as content.
* Potentially similar effects may occur in other contexts where whitespace or text nodes are evaluated.
* They are related to these issues for corpora: ...

#### Proposed solution

Add a **corpus sanitation step** in the scripts pipeline that removes Unicode format/control characters that should not occur in TEI text nodes.

Characters to remove include for example:

* `U+200B` ZERO WIDTH SPACE
* `U+200C` ZERO WIDTH NON-JOINER
* `U+200D` ZERO WIDTH JOINER
* `U+FEFF` ZERO WIDTH NO-BREAK SPACE (BOM)
* `U+00AD` SOFT HYPHEN

A regex that covers the most common cases:

```
[\u200B-\u200D\uFEFF\u00AD]
```



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove invisible Unicode control characters in latin files #30

Impact

Proposed solution

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Remove invisible Unicode control characters in latin files #30

Description

Impact

Proposed solution

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions