A New Word Structure

A few years ago I published a couple of posts regarding the low level and high level structure of Voynich words. While I think I got a number of things right I have had to revise the low level structure since that post, and now it’s time to revise the word structure completely. I believe there is a much simpler way to express how Voynich words are structured than before.

Before I start I want to mention the work of Jorge Stolfi, specifically his Grammar for Voynichese Words and Prefix-Midfix-Suffix Decomposition of Voynichese Words. Although my work is not directly built upon his the reader will notice some similarity. I want to acknowledge that I’ve found his work an inspiration from the very earliest day of reading about the manuscript.

The main difference between Stolfi’s work and my own is the unit of analysis. Stolfi analysed the structure of words according to categories of glyphs. I have based my analysis on “syllables” – regular divisions of words according to some basic rules. I will first need to explain a little bit about syllables and how my the process of division works.

What is a Syllable?

Although most readers will know what a syllable is, it’s useful to explain a couple of terms that I will use. You can think of a syllable as a string or bundle of sounds uttered close together: a vowel (or vowel-like sound) and optional consonants pronounced before or after it. When speaking about the components of a syllable the vowel is called the nucleus, the consonants before it the onset, and the consonants after it the coda.

Here’s a simple diagram from Wikipedia (where the sigma sign stands for ‘syllable’):

The components can be grouped in different ways. So the nucleus and coda together are called the rhyme, while the onset and nucleus can be called the body. For our purposes the model of body + coda will be the most relevan.

Finding Syllables

The basis for dividing words into syllables is the identification of [o] and [y] as key glyphs. They occur very commonly in most words and mostly in the same positions as each other. We can consider them as being the nucleuses of our syllables, whether or not we wish to consider them as vowels.

(While it is not necessary to believe in the equivalence of [y] and [a] or in [y] deletion to follow this word structure, I have written it from the perspective of those two hypotheses being true. For every mention of [y] it must be understood to represent [y], [a], and those locations where [y] is missing such as after [e] sequences or benches [ch, sh].)

We divide up a word by first finding all the occurrences of [o] and [y]. There are as many syllables as occurrences of these glyphs. Each syllable consists of the [o] or [y], all the glyphs before it until another [o] or [y] or start of the word is reached, and all the glyphs after if it is the final [o] or [y] in the word.

Let us take an example: [qokedar] (8 tokens). First we restore the missing [y] after [e] and change [a] to [y], giving: [qokeydyr]. Then we apply the syllable division process outlined above: there are three occurrences of [o] and [y] in the word, giving the syllables [qo] + [key] + [dyr].

It should be noted that [qo] and [key] as syllables as bodies only and only [dyr] has a coda. Indeed, the process of syllable-finding makes it impossible for there to be a coda except at the end of words, as glyphs are always assigned to a following syllable where possible.

This problem can arise with other languages, and knowing whether a consonant belongs to the coda of one syllable or the onset of the other is sometimes a difficult judgement. Phonotactic rules can be helpful to resolve the issue. If we think of the word carpet we know it has two syllables, yet we also know that syllables don’t start /rp/ so the second syllable might be /et/ or /pet/, but definitely not /rpet/.

Though we don’t know the phonotactic rules for the Voynich text we can use a similar judgement to check if the syllables we find are valid. A word such as [choldy] (10 tokens) looks more like [chol] + [dy] than [cho] + [ldy], though the syllable-finding process would give the latter. However, we can find a few words starting [ld], showing that syllables can start with this cluster (indeed, [ldy] is the most common with 25 tokens).

Only in a handful of cases do we feel the need to say that a non-final syllable has a coda, and most of those concern [i], [n], [m] not being at or near the end of a word. Thus we can consider the placement of codas at the end of words not to be an issue. Final syllables with codas will be considered along with other syllables as bodies (onset and nucleus) separate from their codas. Codas will be addressed toward the end of this post.

Permissible Bodies

For the purposes of this structural analysis I took all words with four or more tokens (1109 types) and divided them into syllables by hand. The number of syllables in each word type ranged from 1 to 4, though a few words could not be split into syllables. The table below gives the number of different word types by syllable length:

Number of Syllables Word Types %age
0 syllables 29 2.6%
1 syllable 326 29.4%
2 syllables 614 55.4%
3 syllables 136 12.3%
4 syllables 4 0.4%

We can see straight away that not only can be majority of words be split into syllables with this process, but that the number of syllables falls into a narrow range of 1 – 3 syllables.

Moreover, the division of Voynich words into syllables gives us a structurally coherent set of syllable bodies. This is partly due to the way the process works, such as inserting missing [y], but the structure of syllables is relatively simple. The number is also relatively small. Only 135 different syllable bodies were found in the words I analysed, and I would not expect an analysis of all words to expand this significantly.

Below is a list of the twenty most common bodies and the number of word types they occur in:

Syllable Body Word Types
dy 265
o 254
qo 144
y 138
ky 68
cho 56
ty 53
chy 51
ry 39
chey 37
ly 35
to 28
key 27
ko 25
shey 24
keo 23
sho 23
cheo 21
keey 21
lky 19

Body Rank Order

Up to now most of the things in this post have been previously discussed. What follows is a new theory which describes the structure of Voynich words. It is based on syllable bodies being ordered within words in predefined ways.

We saw above that 97% of words in the Voynich text have 1 – 3 syllables. Therefore we can say that there are up to three “slots” into which syllable bodies fit within words. Some syllable bodies, such as [qo, o, dy], are highly positional. We can guess beforehand which slot these bodies are likely to occur in for any given word.

In fact, every syllable body has a value which determines its position within a word. The slots must be filled in a particular order for the word to be valid. I call this theory “Body Rank Order” and it has a simple set of rules:

  1. Each syllable body has a Rank of 1, 2, or 3.
  2. Each word has a number of slots equal to its number of syllables.
  3. From left to right the Rank of each syllable body must increase (or stay constant).

The ranks for all syllable bodies which occurs in five or more word types are as below:

Rank 1 Rank 2 Rank 3
cheeo, cheey, cheo, chey, cho, chy

qo, o, y

sheey, sheo, shey, sho, shy


ckhey, ckhy, cthey, ctho, cthy

do, lo, ro, sy


kchey, kcho, kchy, keeey, keeo, keey, keo, key, ko, kshey, kshy, ky

lchey, lchy, lkeey, lky, lshey

pchey, pcho, pchy, po, py

tcheo, tchey, tcho, tchy, teeo, teey, teo, tey, to, ty






Not all of combinations of bodies from ranks 1, 2, and 3 occur and this table should not be taken as a guide to simply construct words. It may also be that one or two of these syllable bodies might belong better in a different rank dependent on future research ([so] is the most likely to need moving). Yet we can clearly see that the ranks are quite coherent: all syllable bodies starting with a bench [ch, sh] are in Rank 1, and all syllables with a gallows are in Rank 2.

The Body Rank Order fits Voynich words well, with no issues for the 350 most common words. The most common word which breaks the rules is [qokechy] with 13 tokens, having a rank order of 121. The next is [dalor] with 8 tokens, having a rank order of 32.

Of course, every one syllable word fits the theory perfectly, but the statistics for words with two or more syllables is still good:

Syllables Types Disordered %age
2 614 6 1%
3 136 5 4%
4 4 4 100%

All four syllable words are “disordered”, but they’re so rare that it is hardly an issue for the theory. It seems that four syllable words are unusual in multiple ways.

For two and three syllable words, the problems are caused by just a few errors:

  1. [chy] and [y] coming at the end of a word rather than the start.
  2. [lo] being out of order.
  3. A coda coming in the middle of a word.

This last error hasn’t really been mentioned but can be summarised. We can assign the rank of 4 to a coda as it almost always comes at the end of a word. Where it occurs in the middle of a word an error is thus found. The two word types in question are [daiidy] (6 tokens) and [dairal] (4 tokens).

Conclusion and Next Steps

Taking into account words which couldn’t be syllabified (2.6%), words with more than three syllables (0.4%) and words with rank errors (1%) we can see that the Body Rank Order theory accounts for 96% of all word types with more than four tokens.

The theory is thus highly successful yet still very simple. It builds on previous research concerning [y] while presenting a new field for further research. It reveals a word structure which demands further investigation.

The theory could be further refined, mentioned above with regards to [so]. This body is different as it mostly occurs at the start of lines, where different rules seem to apply, and a few others are likely the same.

It might be worth also calculating the rank orders for all word types, no matter how many tokens, to discover how well it describes unusual words. It is likely that unusual words are unusual for a reason, as with four syllable words.

Lastly, I stated above that the rank order within words should increase form left to right, or stay constant. I found that the most common words always showed an increase from left to right. The most common word which didn’t show an increase, but stayed constant, was [daly] (27 tokens).

The percentage of words for each syllable which didn’t show an increase was always a relatively small portion: 8% for two syllable words, and 5% for three syllable words. This would mean that for both 91% of words showed an always increasing rank order, rather than an error or staying constant. It may be that with further refining of the model this number can be brought down, thus simplifying and strengthening the theory.


One thought on “A New Word Structure

  1. I find your explanations of how the Voynich Manuscript’s vords are formed very clear and easy to understand. I suspect your algorithm could be applied to any language, and its syllable profile compared statistically to yours for Voynichese. Interpreted correctly, this could be a valuable tool for falsifying claims for the VMS being written in known languages. If the language tends to build syllables and words in ways that Voynichese patterns don’t allow for, then it’s not a match. Though I’m sure this tool’s negative predictive value would be far better than its positive predictive value, it would be interesting to see what languages spoken in the general vicinity of the VMS’s origin had phonological profiles — i.e. built syllables and words — the most similarly to the way Voynichese apparently does.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s