Bench-initial line position

It has long been observed that bench glyphs, [ch] and [sh], are less frequent at the start of first words in a line. Certainly Currier had made this observation back in the 1970s. A related, but less well-known feature is that words beginning with a bench glyph and sometimes more frequent in the second position of a line.

A couple of tables will help to illustrate the feature. The table below shows the number of words beginning [ch] in each position from the left by each hand/scribe.

Position from left12345
Hand 191439350296223
Hand 224188170190174
Hand 330255229224194

The table below shows the number of words beginning [sh] in each position from the left by each hand/scribe.

Position from left12345
Hand 196213142142103
Hand 236222175131114
Hand 3472171109590

The figures given are counts for three scribes/hands as identified by Lisa Fagin Davis (2020). The total number of words written by each hand differs, so the counts are not directly comparable. The interest is in how the numbers change from left to right.

Were words distributed randomly (or just evenly according to a principle which was “ignorant” of the line) then we would expect to see the numbers of these words to be broadly equal. The counts would decrease from left to right, simply because not all lines contain two, three, four, or five words.

Yet we see clearly two patterns as mentioned above:

  • In nearly all cases words beginning [ch] or [sh] in the leftmost position are significantly lower than subsequent positions. The only exception is for words beginning [sh] written by hand 1, where the fifth position is nearly as low as the first.
  • In some cases the number of bench-initial words in the second position is noticeably higher than subsequent positions. This is true for hand 1 in both cases, but also hands 2 (though weak) and 3 for [sh].

How can these features be explained? Is there some way we can even out the occurrence of bench initial words?

Several years ago we look at linestart words, which showed that the initial glyphs for words at the start of lines were quite different. Words beginning with [ych], [ysh], [dch], [dsh] were strongly tied to the start of lines. The possibility that words at the start of lines may have additional initial glyphs was explored in another post on the subject.

The hypothesis is thus that by removing [y] or [d] from certain words at the start of lines, we’ll end up with a more even distribution of words beginning with [ch] or [sh].

Below is a table showing the distribution of words beginning [ych]/[ysh] according to position in the line and by individual hand/scribe.

Position from left12345
Hand 175 / 144 / 00 / 01 / 11 / 0
Hand 223 / 213 / 00 / 03 / 13 / 2
Hand 368 / 342 / 11 / 02 / 00 / 0

And again, the same table as above but this time for [dch]/[dsh].

Position from left12345
Hand 161 / 277 / 312 / 48 / 320 / 0
Hand 235 / 40 3 / 00 / 01 / 40 / 2
Hand 336 / 293 / 11 / 02 / 12 / 0

The tables show that these words are strongly line initial, though [dch] in hand 1 appears to occur throughout the text in lower numbers.

If we take the hypothesis that all words beginning [ych], [ysh], [dch], [dsh] should have the initial [y] or [d] removed to yield a word beginning [ch] or [sh], doe that even up the distribution of bench initial words? The table below shows the changes to the distribution of bench initial words, with the original in normal text and the new in bold.

First for [ch]:

Position from left12345
Hand 191 / 227439 / 450350 / 362296 / 305223 / 244
Hand 224 / 82188 / 194170 / 170190 / 194174 / 177
Hand 330 / 134255 / 260229 / 231224 / 228194 / 196

And again for [sh]:

Position from left12345
Hand 196 / 137213 / 216142 / 146142 / 146103 / 103
Hand 236 / 97222 / 222175 / 175131 / 136114 / 118
Hand 347 / 110217 / 219110 / 11095 / 9690 / 90

The result show some improvement in the distribution of bench initial words, which is not a surprise given that the changes were designed with that in mind. However, the range of improvement differs across hands and between [ch] and [sh].

For [ch], none of the improved distributions match well. That of hand 1 is the most improved, and barely reaches the count in position 5. The others are still too far off. It may be that changes to Grove words, as discussed previously, could improve these scores further, but they may still fall short.

For [sh] the story is quite different. For both hand 1 and hand 3 the new distributions seems almost perfect, with position 1 matching positions 3, 4, and 5. (Of course, position 2 is still too high, which is another issue.) For hand 2 the new distributions isn’t nearly as good as must be considered almost as poor as for those beginning [ch].

The number of rove words for each section is around 45 [pch], 8 [fch], 10 [psh], and 1 [fsh]. The number is difficult to calculate without looking at every example, because some may turn out to genuinely start with a gallows. Removing that gallows to obtain a word starting [ch] or [sh] would be a wrong move. Even so, they still wouldn’t create a good match for bench initial distribution.

The question of what causes the spike in bench initial words in position 2 must be left unanswered. There are several avenues to pursue, such as the preceding word and the glyphs which follow the initial bench in position 2 words.

(This post partly stems from a conversation some time ago with Marco Ponzi about the spike in bench initial words in position 2 of a line.)

Glyph Distribution Diversity

Each glyph in the Voynich script has a different distribution. Some occur in a particular position, such as the start, middle, or end of words, or adjacent to specific glyphs, such as [q] before [o]. Some glyphs may appear adjacent to many others, some only a few. We can think of glyphs as having a more or less diverse distribution based on how many glyphs they occur next to.

The distribution of glyphs in relation to other glyphs also differs according to the direction of that relationship. The glyphs which come before [ch] are not the same as those which come after it. There may be both a different set of glyphs and a different number of glyphs which come before and after.

The chart below shows the number of different glyphs which come before and after the 22 most common glyphs in the text. The number of glyphs is for each is counted up to at least 95%+ of its distribution, so rare adjacent glyphs are not included. Also, I have taken [ch, sh, ckh, cth, cfh, cph] to each be single glyphs, and also counted a word break (or space) as a glyph to indicate the position at the start or end of a word.

Eighteen of the 22 glyphs have either the same or nearly the same level of diversity before and after. The highest is [o] which can take any of 10 glyphs before and 11 glyphs after. The lowest are [n] and [q] which both take one before and one after.

Four for of the 22 glyphs in the above chart should be looked at in more detail.

[y] and [a]

The glyphs [y] and [a] both show a distribution with significantly lower diversity of glyphs after them than before. Although they show significant overlap in the glyphs which come before, they have no overlap in the glyphs which come after.

These two glyphs are so far different from others that they raise the question of why. I’ve previously discussed the relationship between [y] and [a], and the possibility that [y] is sometimes deleted.


Although the difference in the diversity of glyphs before and after [m] is small in number it is relatively big. The glyph [m] has three different glyphs which can come before, but only one after (in fact, not a glyph, but a space.)

The strict word-final position of [m] is probably mostly to blame. Though [q] and [n], which are strongly tied to the start and end of words respectively, have the same number of glyphs appearing on both sides. It could be that, as suggested by a number of researchers, [m] is a word-final variant of another glyph.


The most interesting result from measuring the diversity of adjacent glyphs is that [l] is totally different from every other glyph. While only three glyphs come before [l] there are eight glyphs which can come after. It is the second most diverse glyph with regard to what glyphs can follow it.

This result is reminiscent of those from first-last combinations where [l] didn’t fit well into the proposed division between ‘strong’ and ‘weak’ glyphs. There is clearly something about the way [l] interacts with other glyphs which makes it unlike others.

The reason for this difference is unknown and deserves further investigation. I’ll likely follow up this post with a more in-depth look at [l], including breaking down its distribution into Currier A and B.

New Article: Glyph combinations across word breaks in the Voynich manuscript

I’m happy to announce that Marco Ponzi and I have recently had a paper published in Cryptologia titled “Glyph combinations across word breaks in the Voynich manuscript”.

The abstract:

The text of the Voynich manuscript exhibits relationships between neighboring words that have not formally been explored. The last and first glyphs of adjacent words show some dependency, and certain glyph combinations are more or less likely to occur. The patterns of preferences for glyph combinations demonstrate the existence of higher-level glyph groups. The behavior of the glyph combinations may arise due to changes in a glyph caused by its neighbor.

A preprint of the paper is available to download: Glyph Combinations across Word Breaks in the Voynich Manuscript – Preprint

Body Rank Order: Refinement to initial [y]

In my recent post about the new word structure I’ve called Body Rank Order I said that the rule for rank order wasn’t settled and may need further refinement. The rank order should either increase or stay constant from left to right, but I felt that “stay constant” was wrong and not elegant enough as a model. I noted that only a minority of words needed the “stay constant” part of the rule to be correct, and some of those words were unusual in other ways.

A good portion of the “stay constant” words are unusual because they occur mostly in the linestart position. Words starting [ych] and [ysh] are the two types which I wish to discuss in this post as they show the relevant patterns most strongly.

In the Body Rank Order model words beginning [ych] and [ysh] would be split into syllables in this way: [y] would be the first syllable body in the word with the next syllable starting [ch] or [sh]. Thus [ycheey] is composed of the syllable bodies [y] and [cheey] while [yshey] is composed of the syllable bodies [y] and [shey]. Some words, such as [ycheol], have a coda on the second syllable or, as with [yshedy], have three syllables.

The key point is that both [y] and any syllable starting with a bench [ch, sh] are in rank 1. Thus all words beginning [ych] or [ysh] have a body rank order which stays constant. There are 16 such words with four or more tokens in the Voynich text, and they total 182 tokens, or 35% of all “stay constant” exceptions to the model.

We know that both [y] and syllables starting with a bench [ch, sh] must be in rank 1 as they often occur before rank 2 syllables, such as those starting [k, t]. Neither can be placed into a different rank without causing greater exceptions.

The linestart phenomenon, where the first glyphs of words found at the linestart differ statistically from those in the rest of the text, adds to the problem. About 80% of words beginning [ych] and 90% of those starting [ysh] occur at the start of lines. Thus they are neither normal in their structure nor their distribution.

The solution comes from resolving both problems at once. Were [y] removed then the words would be both structurally normal and statistically normal in their distribution. Indeed, according to Transformation Theory this is the likely scenario: all words beginning [ych, ysh] originally began [ch, sh], and some unknown process added [y] to the start of some words beginning [ch, sh] which were at the start of a line.

It must be borne in mind that not all words beginning [ych, ysh] occur at the start of lines, nor that words beginning [ch, sh] are not found at the linestart. Simply that the environment for adding [y] was more prevalent in the linestart position, though could be absent and could be found elsewhere.

They key conclusion, however, is that the text of the Voynich manuscript must be “normalized” by removing the [y] from the start of words beginning [ych, ysh].

(It may be that other “normalization” is needed elsewhere, and that the difference between the “transformed” text and the “normal” text is hindering decipherment efforts.)

Matching [o]

The New Word Structure I published last week relies heavily on older research, specifically the idea that [y] and [o] are equivalent, and that [y] can be deleted in some environments. Indeed, since formulating these ideas most of my text research has relied on them, either openly or implicitly. Much of what I believe about the text would become invalid, or at least seriously undermined, were these ideas to be proven wrong.

The worry I have is that the separate ideas have been pieced together and extended over time (for example, [y] deletion after [ch, sh] as well as [e]) despite being used for a single goal. The goal is to show that [y] and it various forms and expressions are a match for [o], and therefore that [o] and [y] are in the same “class” of glyph.

Although we expect all glyphs to have different frequencies in detail, certain glyphs, such as [ch, sh] or [k,t], share the same distributions. That is, even though they occur more or less in some positions, they appear to be valid in the same positions. My expectation is that [y] is valid in the same positions as [o] once all its forms have been included, but I don’t believe I have ever properly tested this.

Mapping and Matching

To test if the distribution of [y] matches that of [o] I first made table of possible trigrams containing [o] and their token counts. Each trigram consisted of [o] with one of the common 22 glyphs or space before and after it. This gave 529 combinations, ranging from the unlikely [noq] (0 tokens) to the very common [.ok] (2481 tokens). (The full stop/period here denotes a space or word break.)

The reason for the trigram is that various expressions of [y] depend on the glyphs which come before or after. So [a] depends on the following glyph while a null expression/deletion relies on the preceding glyph.

The trigram table shows the actual environments in which [o] occurs and where it does not. The expressions and forms of [y] can then be mapped to the table, slowly building up the coverage and showing us if it doesn’t match any key parts of the distribution of [o].

After [q] ~ 21% of occurrences

We can start with the approximately 21% of [o] occurrences which definitely can’t be matched by [y]. Nothing can replace [o] after [q] and it’s universally acknowledged that this this particular bigram (and all the trigrams which contain it) is its own thing. Some would go as far to say that [qo] is a digraph.

I feel there’s evidence that [q] is specifically added to words starting [o]. This would make this particular distribution more about how [q] works rather than [o]. It is enough to say that nobody expects [o] to be replaced by another glyph in this environment so we don’t have to worry about it.

Before [r, l, i, n, m] ~ 34%

The glyphs [r, l, i, n, m] are important as they have been identified as the main glyphs which [a] occurs before. Thus for all the occurrences [o] in this environment we would expect them to be replaced not by [y] but [a].

If we look at some of the most commons trigrams for [o] before these five glyphs we can see that [a] validly replaces [o] in all cases.

With [o] Tokens With [a] Tokens
[chor] 527 [char] 159
[chol] 804 [chal] 124
[eor] 465 [ear] 171
[eol] 1009 [eal] 140
[kor] 171 [kar] 572
[kol] 289 [kal] 631
[tor] 157 [tar] 378
[tol] 334 [tal] 391
[dor] 160 [dar] 814
[dol] 223 [dal] 756
[lor] 167 [lar] 87
[lol] 166 [lal] 90
[doi] 45 [dai] 2170
[chon] 4 [chan] 14
[chom] 29 [cham] 40
[eom] 35 [eam] 29
[lom] 25 [lam] 43

We can clearly see how the differences in frequency changes across the trigrams. The trigrams with [a] and sometimes more, and sometimes less, common than those with [o]. But in all cases they are no less valid, which is an excellent sign.

(I invite the reader to stop and this point and try to replace [o] with any other glyph and get results as half as good.)

At the end of a word ~ 5%

At the end of a word [o] should always be replaced by [y]. The glyph [a] barely occurs in this position due to the lack of a conditioning environment.

The most common trigrams at the end of a word are shown below with their matches containing [y].

With [o] Tokens With [y] Tokens
[cho.] 183 [chy.] 959
[sho.] 184 [shy.] 270
[eo.] 436 [ey.] 3967
[do.] 37 [dy.] 6718
[ro.] 38 [ry.] 283
[lo.] 49 [ly.] 512

The glyph [y] is clearly much more common in the final position than [o]. We might have expected the large token count for [dy.] given its prevalence as a word ending. Yet [ey.] is clearly much more common also. It’s not a problem for our purposes, however, as we’ve shown that [y] can validly replace [o] in all these position. Indeed, the greater issue that some of the token counts for [o] that they barely feel valid.

At the start of words before glyphs which don’t cause [a] ~ 25%

The start of a word is the other place where we would expect to see [y] replace [o]. We’ve already looked at [o] before [r, l, i, n, m] and we take that as including occurrences at the start of words. But [o] is also common before a few others glyphs in this position, as seen in the table below.

With [o] Tokens With [y] Tokens
[.oa] 79 [.ya] 15
[.och] 73 [.ych] 241
[.oe] 132 [.ye] 9
[.ok] 2481 [.yk] 631
[.ot] 2427 [.yt] 529
[.od] 245 [.yd] 63
[.os] 61 [.ys] 12

We’ve clearly run into a problem here. The trigrams [.ya, .ye, .ys] have far too low token counts to be valid. We might be able to forgive [.ya], as the two glyphs are technically two forms of [y], but the other trigrams are still wrong.

In neither case does there appear to be a solution. The trigrams [.ae, .as] are not only outside the expected rules for [a], but they’re also less common still. The trigrams together account for about 1% of [o]’s distribution: a small amount but still a gap.

In the middle of words after [e] but not before [r, l, i, n, m] ~ 6%

I think that this is the first really difficult part of my ideas for some to understand. We know that [y] is relatively uncommon in the middle of words, and yet [a] only occurs before some glyphs. What is the form of [y] used before other glyphs?

The answer I came up with is that [y] is deleted, or simply not expressed, when it should follow [e]. So if we imagine [okeody] to be made from the words [okeo] and [dy], they [okedy] is made from the words [okey] and [dy]: neither [okeydy] nor [okeady] exist.

The four most common trigrams demonstrate the match well, with “null” simply meaning that [y] is not expressed.

With [o] Token With null
[eok] 116 [ek] 491
[eot] 54 [et] 214
[eod] 926 [ed] 5004
[eos] 235 [es] 428

Again, there are a few rarer trigrams with [e] which don’t work so well. The two most common are [eoy, eoa], which I wouldn’t expect to be common anyway due to the double [y].

In the middle of words after [h] but not before [r, l, i, n, m] ~ 6%

This environment is an extension of the one immediately above. There are many places where trigrams such as [chok] and [chod] and no equivalent with [y] or [a]. So I extended the argument used for [e] to cover benches as well.

Interestingly, this extension predated the discovery that benches help determine the length of [e] sequences. The outcome of that discovery is the hypothesis that benches could contain a “captured” [e], meaning that the environment following a bench is the same as following [e]. The use of [h] is to highlight the fact that the same is possibly true of bench gallows.

Below are some common examples.

With [o]
With null
[chok] 244 [chk] 160
[chot] 179 [cht] 84
[shok] 64 [shk] 54
[shot] 35 [sht] 19
[chod] 361 [chd] 815
[chos] 83 [chs] 108
[shod] 144 [shd] 181
[shos] 19 [shs] 23
[ckhod] 16 [ckhd] 36
[cthod] 49 [cthd] 30

Despite this environment being an extension of the argument for [o] after [e], it works perfectly well. The same issue arises with trigrams with [y] and [a], which is once again expected.

Everything else ~ 4%

This is the nub of the problem outlined at the start: my attempt to match [y] with [o] has not been systematic. I’ve been unsure of how well [y], [a], and null cover the whole of [o]. This 4% represents everything I’ve overlooked and it needs to be addressed.

To start with, about 0.5% of “everything else” is composed of really rare trigrams. The trigram [doa] has 8 occurrences. It might not be at all normal for [o] and I shan’t try to explain how [y] covers it.

However, some trigrams are common enough that they deserve addressing. Below are the nine most common, all with twenty tokens or more. I’ve added columns for their matches with [y], [a], and null, just so we can see if any of the existing explanations work.

With [o]
With [y]
With [a] Tokens Null Tokens
[tok] 23 [tyk] 4 [tak] 4 [tk] 1
[kod] 88 [kyd] 18 [kad] 8 [kd] 11
[tod] 112 [tyd] 23 [tad] 4 [td] 20
[dod] 24 [dyd] 22 [dad] 9 [dd] 23
[sod] 32 [syd] 4 [sad] 1 [sd] 32
[sos] 20 [sys] 0 [sas] 4 [ss] 6
[rod] 68 [ryd] 4 [rad] 4 [rd] 43
[lod] 51 [lyd] 4 [lad] 1 [ld] 452
[los] 21 [lys] 3 [las] 4 [ls] 162

Well, this is a bit of a mixed bag! It’s clear that there is no single pattern. That’s okay, as this is basically a residue category for [o]. There’s nothing particular which brought all these trigrams together other than the fact that they weren’t already mapped to another form of [y].

We’ll have to take these trigrams as groups according to what looks like the best answer.

[tok] and [sos]: There doesn’t seem to be a great answer for these two. Neither are very common themselves, but all the possible matches are poor. While [tok] isn’t a huge problem as it’s quite unusual in itself (two gallows in a single word) the trigram [sos] seems more normal and should have a similar solution to [sod] or [los].

[kod], [tod], and [dod]: These three seem as though two or even three of the possibilities might exist. Why is that? Could it be that the writer didn’t know what to do with [y] in these environments?

[sod], [rod], [lod], and [los]: All these clearly prefer the “null” version. This is a very interesting result as the deletion of [y] was built upon the idea that [e] (or a “captured” [e]) might somehow stand in for the missing [y]).

Yet there is something more worth noting. Three of the trigrams with [o], [sod], [rod], and [lod], are more common than expected in words at the start or end of line. Similarly, [sd], [rd], [ld], and [ls], are also more than than expected in words in these positions.

This could be a sign that, at least for these four trigrams, the match between the [o] and null versions is the right one.


I’m still not convinced that I have my idea about [y] matching [o] completely nailed down. There seems more that needs to be said, or something which I’m missing. Yet the majority of the distribution of [o] does seems to be covered within existing arguments.

Even if in a few places, such as with [.oe] and [.es], it does seems to fail, and the residue is quite confusing, there’s still much more good about the idea than bad. Maybe there is something which will tie all the pieces, including the gaps, into a whole. There needs to be an underlying reason why [y] becomes [a] in some places or is simply not expressed in others.

I definitely haven’t found that reason yet, but I’m willing to stand by my hypothesis while I look for it.

A New Word Structure

A few years ago I published a couple of posts regarding the low level and high level structure of Voynich words. While I think I got a number of things right I have had to revise the low level structure since that post, and now it’s time to revise the word structure completely. I believe there is a much simpler way to express how Voynich words are structured than before.

Before I start I want to mention the work of Jorge Stolfi, specifically his Grammar for Voynichese Words and Prefix-Midfix-Suffix Decomposition of Voynichese Words. Although my work is not directly built upon his the reader will notice some similarity. I want to acknowledge that I’ve found his work an inspiration from the very earliest day of reading about the manuscript.

The main difference between Stolfi’s work and my own is the unit of analysis. Stolfi analysed the structure of words according to categories of glyphs. I have based my analysis on “syllables” – regular divisions of words according to some basic rules. I will first need to explain a little bit about syllables and how my the process of division works.

What is a Syllable?

Although most readers will know what a syllable is, it’s useful to explain a couple of terms that I will use. You can think of a syllable as a string or bundle of sounds uttered close together: a vowel (or vowel-like sound) and optional consonants pronounced before or after it. When speaking about the components of a syllable the vowel is called the nucleus, the consonants before it the onset, and the consonants after it the coda.

Here’s a simple diagram from Wikipedia (where the sigma sign stands for ‘syllable’):

The components can be grouped in different ways. So the nucleus and coda together are called the rhyme, while the onset and nucleus can be called the body. For our purposes the model of body + coda will be the most relevan.

Finding Syllables

The basis for dividing words into syllables is the identification of [o] and [y] as key glyphs. They occur very commonly in most words and mostly in the same positions as each other. We can consider them as being the nucleuses of our syllables, whether or not we wish to consider them as vowels.

(While it is not necessary to believe in the equivalence of [y] and [a] or in [y] deletion to follow this word structure, I have written it from the perspective of those two hypotheses being true. For every mention of [y] it must be understood to represent [y], [a], and those locations where [y] is missing such as after [e] sequences or benches [ch, sh].)

We divide up a word by first finding all the occurrences of [o] and [y]. There are as many syllables as occurrences of these glyphs. Each syllable consists of the [o] or [y], all the glyphs before it until another [o] or [y] or start of the word is reached, and all the glyphs after if it is the final [o] or [y] in the word.

Let us take an example: [qokedar] (8 tokens). First we restore the missing [y] after [e] and change [a] to [y], giving: [qokeydyr]. Then we apply the syllable division process outlined above: there are three occurrences of [o] and [y] in the word, giving the syllables [qo] + [key] + [dyr].

It should be noted that [qo] and [key] as syllables as bodies only and only [dyr] has a coda. Indeed, the process of syllable-finding makes it impossible for there to be a coda except at the end of words, as glyphs are always assigned to a following syllable where possible.

This problem can arise with other languages, and knowing whether a consonant belongs to the coda of one syllable or the onset of the other is sometimes a difficult judgement. Phonotactic rules can be helpful to resolve the issue. If we think of the word carpet we know it has two syllables, yet we also know that syllables don’t start /rp/ so the second syllable might be /et/ or /pet/, but definitely not /rpet/.

Though we don’t know the phonotactic rules for the Voynich text we can use a similar judgement to check if the syllables we find are valid. A word such as [choldy] (10 tokens) looks more like [chol] + [dy] than [cho] + [ldy], though the syllable-finding process would give the latter. However, we can find a few words starting [ld], showing that syllables can start with this cluster (indeed, [ldy] is the most common with 25 tokens).

Only in a handful of cases do we feel the need to say that a non-final syllable has a coda, and most of those concern [i], [n], [m] not being at or near the end of a word. Thus we can consider the placement of codas at the end of words not to be an issue. Final syllables with codas will be considered along with other syllables as bodies (onset and nucleus) separate from their codas. Codas will be addressed toward the end of this post.

Permissible Bodies

For the purposes of this structural analysis I took all words with four or more tokens (1109 types) and divided them into syllables by hand. The number of syllables in each word type ranged from 1 to 4, though a few words could not be split into syllables. The table below gives the number of different word types by syllable length:

Number of Syllables Word Types %age
0 syllables 29 2.6%
1 syllable 326 29.4%
2 syllables 614 55.4%
3 syllables 136 12.3%
4 syllables 4 0.4%

We can see straight away that not only can be majority of words be split into syllables with this process, but that the number of syllables falls into a narrow range of 1 – 3 syllables.

Moreover, the division of Voynich words into syllables gives us a structurally coherent set of syllable bodies. This is partly due to the way the process works, such as inserting missing [y], but the structure of syllables is relatively simple. The number is also relatively small. Only 135 different syllable bodies were found in the words I analysed, and I would not expect an analysis of all words to expand this significantly.

Below is a list of the twenty most common bodies and the number of word types they occur in:

Syllable Body Word Types
dy 265
o 254
qo 144
y 138
ky 68
cho 56
ty 53
chy 51
ry 39
chey 37
ly 35
to 28
key 27
ko 25
shey 24
keo 23
sho 23
cheo 21
keey 21
lky 19

Body Rank Order

Up to now most of the things in this post have been previously discussed. What follows is a new theory which describes the structure of Voynich words. It is based on syllable bodies being ordered within words in predefined ways.

We saw above that 97% of words in the Voynich text have 1 – 3 syllables. Therefore we can say that there are up to three “slots” into which syllable bodies fit within words. Some syllable bodies, such as [qo, o, dy], are highly positional. We can guess beforehand which slot these bodies are likely to occur in for any given word.

In fact, every syllable body has a value which determines its position within a word. The slots must be filled in a particular order for the word to be valid. I call this theory “Body Rank Order” and it has a simple set of rules:

  1. Each syllable body has a Rank of 1, 2, or 3.
  2. Each word has a number of slots equal to its number of syllables.
  3. From left to right the Rank of each syllable body must increase (or stay constant).

The ranks for all syllable bodies which occurs in five or more word types are as below:

Rank 1 Rank 2 Rank 3
cheeo, cheey, cheo, chey, cho, chy

qo, o, y

sheey, sheo, shey, sho, shy


ckhey, ckhy, cthey, ctho, cthy

do, lo, ro, sy


kchey, kcho, kchy, keeey, keeo, keey, keo, key, ko, kshey, kshy, ky

lchey, lchy, lkeey, lky, lshey

pchey, pcho, pchy, po, py

tcheo, tchey, tcho, tchy, teeo, teey, teo, tey, to, ty






Not all of combinations of bodies from ranks 1, 2, and 3 occur and this table should not be taken as a guide to simply construct words. It may also be that one or two of these syllable bodies might belong better in a different rank dependent on future research ([so] is the most likely to need moving). Yet we can clearly see that the ranks are quite coherent: all syllable bodies starting with a bench [ch, sh] are in Rank 1, and all syllables with a gallows are in Rank 2.

The Body Rank Order fits Voynich words well, with no issues for the 350 most common words. The most common word which breaks the rules is [qokechy] with 13 tokens, having a rank order of 121. The next is [dalor] with 8 tokens, having a rank order of 32.

Of course, every one syllable word fits the theory perfectly, but the statistics for words with two or more syllables is still good:

Syllables Types Disordered %age
2 614 6 1%
3 136 5 4%
4 4 4 100%

All four syllable words are “disordered”, but they’re so rare that it is hardly an issue for the theory. It seems that four syllable words are unusual in multiple ways.

For two and three syllable words, the problems are caused by just a few errors:

  1. [chy] and [y] coming at the end of a word rather than the start.
  2. [lo] being out of order.
  3. A coda coming in the middle of a word.

This last error hasn’t really been mentioned but can be summarised. We can assign the rank of 4 to a coda as it almost always comes at the end of a word. Where it occurs in the middle of a word an error is thus found. The two word types in question are [daiidy] (6 tokens) and [dairal] (4 tokens).

Conclusion and Next Steps

Taking into account words which couldn’t be syllabified (2.6%), words with more than three syllables (0.4%) and words with rank errors (1%) we can see that the Body Rank Order theory accounts for 96% of all word types with more than four tokens.

The theory is thus highly successful yet still very simple. It builds on previous research concerning [y] while presenting a new field for further research. It reveals a word structure which demands further investigation.

The theory could be further refined, mentioned above with regards to [so]. This body is different as it mostly occurs at the start of lines, where different rules seem to apply, and a few others are likely the same.

It might be worth also calculating the rank orders for all word types, no matter how many tokens, to discover how well it describes unusual words. It is likely that unusual words are unusual for a reason, as with four syllable words.

Lastly, I stated above that the rank order within words should increase form left to right, or stay constant. I found that the most common words always showed an increase from left to right. The most common word which didn’t show an increase, but stayed constant, was [daly] (27 tokens).

The percentage of words for each syllable which didn’t show an increase was always a relatively small portion: 8% for two syllable words, and 5% for three syllable words. This would mean that for both 91% of words showed an always increasing rank order, rather than an error or staying constant. It may be that with further refining of the model this number can be brought down, thus simplifying and strengthening the theory.

Words Starting [e]

The glyph [e] doesn’t usually occur at the start (or end) of a word. Depending on the transcription you use the number of tokens beginning [e] might be 150 or fewer. Given that other glyphs contain a stroke which looks identical to [e] it seems there is a strong possibility that the marginal occurrence of [e] at the start of words is due to a transcription error.

In a list of words beginning with [e] I found that in exactly half the cases the glyph was followed by another [e], meaning that the word begins [ee]. It would be easy to misread [ee] for [ch] were the “crossbar” of the [ch] to be very faint. In most, though not all, cases switching the initial [ee] for [ch] would result in a valid word.

I examined a number of words starting [ee] using the high quality scans provided by the Beinecke Library. These scans were not available when most the transcriptions in use today were made. In almost all cases there was evidence that [ch] was the correct reading.

Thus I would conclude that most words starting [ee] are in fact errors, and that the existence of [e] in the initial position is even rarer than believed. Of the words starting with only one [e], virtually all are hapax legomena, meaning that they exist in a single token. This suggests that they could be writing errors.

However, the word [ety] occurs 7 times in the text according to my list. A visual inspection shows that 6 of these are correct readings: the [e] is not another glyph and the word is likely to definitely separate from other words. Lacking contrary evidence I would consider [ety] a valid word and worthy of further analysis.

Yet we see that [ety] is not alone. Another 16 words start [et], and the total number of token starting [e] plus any gallows is over forty. Given that we have dismissed most words starting [ee] as misreadings, this [e] + gallows combination accounts for most of the words beginning with [e]. (Though a visual inspection shows that some of these are misreadings too.)

It is still a very small number, and [e] really must be considered as a glyph which doesn’t usually occur at the start of a word. But this small class of words might itself be valid, if rare. And I wonder if we are looking at a pattern or structure which isn’t strictly a rare part of the Voynich language, but rather belongs to another “language” which occurs infrequently as a “foreign” aspect, like borrowed words?

We can see other examples throughout the text involving all glyphs. Usages which seem different from the norm, common enough not to be writing mistakes but uncommon enough that they don’t feel normal.

Similarities between [e] and [i]

Although not a subscriber to Cham’s Curve-Line System, in which he proposes that the strokes which look like [e] and [i] are fundamental to the Voynich script and text, there is an underlying nature of those strokes which is unexplained.

Consider some basic points, true for both the [e] and [i] glyphs:

  1. They can freely occur in sequences of two or more. Other glyphs are rarely doubled and never tripled.
  2. They rarely occur at either the start or end of words.
  3. Many other glyphs seem to be composed of [e] or [i] with another stroke added.
  4. They rarely occur before glyphs containing each other. So [e] is unlikely to come before a glyph containing [i] and [i] is unlikely to come before a glyph containing [e].
  5. They appear in the same word less than might be expected.

Yet, of course, they appear in totally different “slots” in words. Apart from a handful of places where they both occur (ostly after [o]) and a number of places where neither occurs, they’re almost in complementary distribution. I would be interested in discovering if any common words contrast [e] or [i], or whether they’re quite predictable on context alone.

A Note on Cyrillic

I tend to stay away from seeking specific languages or scripts to fit the Voynich text, preferring any solution to emerge from our understanding of the text rather than being imposed upon it. But I’ve recently noticed a few interesting points about the Cyrillic script that I wanted to share.

Naturally I don’t claim to be an expert on Cyrillic, and am happy to be corrected where wrong. I also do not want to suggest that the Voynich script is based on the Cyrillic script, but rather that there may be interesting parallels we can learn from. It may be that other researchers have already raised the same points.

Continue reading

The Relationship between Grove Words and Line Start Patterns

I’ve written before about Grove Words and what they might be, and also about the curious patterns that occur at the start of lines. There’s an obvious, but unanswered question of how these two phenomena—which both affect the first glyph in a line—might interact.

I want to present here, very shortly, partial answer to this question.

The string [oa], when at the start of a word, is quite strongly associated with the start of a line. gives 78 occurrences of words starting [oa], of which 32 are at the start of a line.

This is similar to words starting [sa], which occur 509 times in all, 190 times at the start of a line. We know that words starting [a] are very uncommon at the start of lines, and some kind of transformation may be causing [a] to become [sa]. It could be that, in certain (unknown) situations, [a] becomes [oa] instead of [sa].

Words starting [Goa] (where [G] is any gallows) occur 22 times. Of those, 15 occur at the start of paragraphs. These should be considered part of the Grove Word phenomenon. It should also be noted that words starting [Gs] are very rare.

From these observations we can draw a few of tentative conclusion: 1) that Grove Words and linestart patterns are distinct; 2) that they can both apply to a single word; and 3) that the line start patterns occur more ‘interior’ to a word and Grove Words are more ‘exterior’.

(It might be that words starting [Gy] show the same thing: 11 of 14 occurrences of words starting [py] are Grove Words, and words starting [y] are associated with the line start.)

(Also, the transcription for the first word on f29v is wrong: it is [koaiin] not [kooiin]. I’ve seen theories using the reading [kooiin], so it pays to check the transcriptions for yourself.)