Computational structural analysis of the Voynich manuscript (Beinecke MS 408) revealing systematic grammar, case morphology, and consistent clause structure across all five manuscript sections.
The profile points to a Dravidian-speaking Siddha medical practitioner who designed a personal script to encode their pharmacopeia. The positional rules are too systematic for a naturally evolved writing system — this script was engineered, not inherited. Early folios show rougher, less consistent glyph forms; later sections are fluid and assured. The scribe was learning their own writing system as they wrote.
No second copy exists. No parallel text. No Rosetta Stone. This was a private notebook — never meant for anyone else to read.
The Voynich manuscript's statistics simultaneously mimic several different systems, and each metric points in a different direction:
- High Index of Coincidence (0.075) made it look like a simple substitution cipher of Latin or Italian — but the bigram coverage was far too low (7%) for any European alphabet.
- Low bigram coverage suggested a large syllabary (~140 symbols) — but the IC was 5x too high for a syllabary, which would spread frequency across all glyphs.
- Narrow word-length distribution (CV = 0.386) matched Latin syllable statistics perfectly — but the positional dominance (0.839) was higher than any natural language alphabet.
- Extreme positional constraints (some characters 99%+ word-initial or word-final) looked like a constructed system — but the paradigm fill rate (15%) and suffix Zipf exponent (0.893) were exactly in natural language range.
The resolution: it's an abugida — a small set of consonant bases (~14) combine with vowel modifiers to produce ~50 surface glyphs. This gives you a small effective alphabet (high IC) with many surface forms (low bigram fill) and strong positional rules (onset/nucleus/coda occupy distinct character slots).
The language itself is agglutinative with SOV word order, which produces the regular word lengths and clause-final verb patterns. The two noun classifiers (h- organic, k- material) act as scribal semantic markers, not grammatical gender — 64% of nouns carry neither.
Telugu positional abugida encoding is the closest statistical match across
all metrics (composite distance 0.066).
Telugu nearly matches the Voynich on all four key metrics. English and Latin
syllabary encodings fail on IC.
Dravidian languages match 94% of the Voynich's typological features — more
than any other language family tested.
No single language + simple syllabary reproduces all four metrics. The IC
gap (bottom-right) is what rules out a pure syllabary and points to an
abugida with a small base alphabet.
The EVA transcription encodes a consistent agglutinative grammar:
- 6 case suffixes with distinct verb selectional preferences (-an: 38% before verb 1H; -am: 33% before verb 1cH; -ae: 34% before verb 1K)
- SOV word order — 76.5% of clause-final words end in suffix -9
- Two noun classifiers (h- and k- prefixes) — NOT grammatical gender: 64% of nouns are unmarked, 54 roots appear in both classes, verbs don't agree
- Participle chaining for sequential procedures (-c89 = "having done")
- Definite article 4o- (proclitic, 97% character binding)
- Clause-final demonstratives (sam, san, sae) with case agreement
This grammar is internally consistent at 80% parseability across all 5 sections (biological, herbal A, herbal B, astronomical, recipe/stars) and 29,000 words.
Case suffix distribution across manuscript sections — each case has distinct
frequency profiles matching its grammatical function.
76.5% of clause-final words carry the finite verb suffix -9, confirming SOV order.
Noun root frequency varies systematically by section, consistent with a
medical handbook covering different domains.
h/k noun classifiers are scribal semantic markers, not grammatical gender —
64% of nouns are unmarked.
To verify: run translate_voynich.py on the standard EVA transcription. The
grammar rules are encoded in the script. The parseability percentage is reproducible.
Building on the grammar, further analysis suggests:
- Script type: positional syllabary/abugida with ~50 distinct glyphs (EVA collapses to ~30, destroying phonetic information)
- Language family: closest statistical match is Dravidian (Telugu positional abugida encoding, composite distance 0.066 across 6 metrics)
- Content: Siddha medical handbook — pharmacopeia, anatomy, medical astrology, and compounding procedures
- 23 preliminary glyph-to-syllable mappings from 9 plant name readings
- Most common content word may read as "amma" (body/being) — a proto-Dravidian root
Plant name cross-references yield consistent syllable mappings across 9 plants.
Five mutual-exclusion character groups — characters within a group never appear
adjacent, competing for the same structural slot. The signature of a featural script.
These proposals need independent verification, particularly by a Dravidian linguist working from original manuscript glyphs rather than EVA transcription.
The structural translation resolves 81% of words to English glosses.
The remaining 19% are left as bracketed placeholders [...] and fall into
three categories:
- Uppercase EVA variants — visually distinct glyphs for which we lack phonetic values
- Special characters — unusual glyphs outside the standard EVA alphabet
- Rare roots — insufficient distributional data to constrain meaning
The 81% that is translated is backed by distributional evidence. The 19% that remains bracketed is honestly unknown.
Core:
VOYNICH_TRANSLATION.txt— complete structural translation (4634 lines)voynich_lexicon.txt— lexicon with syntactic frames (9217 entries)voynich_syllabary.txt— 23 glyph-to-syllable mappings with evidencevoynich_semantic_map.txt— root meanings with confidence levelsvoynich_clause_structure.txt— case system, verb forms, clause templatestranslate_voynich.py— translation engine (reproducible)
Evidence:
METHODS.md— complete methodology, confidence levels, known limitationsvoynich_glyph_inventory.txt— visual variant catalog from hi-res imagesvoynich_plant_identifications.txt— 20 plants with Dravidian namesvoynich_unified_findings.txt— 80 consolidated findingsvoynich_*_report.txt— individual analysis reports
Zero-knowledge semantic topology extraction via dual grammar induction. Runs two complementary compression algorithms (Sequitur for structure, MR-RePair for frequency) on raw bytes, overlays the rulesets into a 2x2 matrix, and discovers relational structure from the residuals. The core principle — let structure emerge from the data without assumptions, then classify what the algorithms agree on vs. disagree on — is the same approach applied here to the Voynich manuscript's morphological patterns.
Biomechanical analysis of writing systems — modeling glyphs as physical stroke paths with two-axis motor cost, curvature, pen lifts, and transition angles. One useful idea from that work: characters in a writing system exist in a constrained energy landscape where positional patterns and transition costs encode structural information about the script. That perspective informed parts of the Voynich script analysis, particularly the mutual-exclusion character groups and positional dominance patterns.
Exploratory. Not peer-reviewed. The grammar is internally consistent and reproducible. The phonetic decryption and language identification are preliminary and need independent verification.