The Relationship between Grove Words and Line Start Patterns

I’ve written before about Grove Words and what they might be, and also about the curious patterns that occur at the start of lines. There’s an obvious, but unanswered question of how these two phenomena—which both affect the first glyph in a line—might interact.

I want to present here, very shortly, partial answer to this question.

The string [oa], when at the start of a word, is quite strongly associated with the start of a line. gives 78 occurrences of words starting [oa], of which 32 are at the start of a line.

This is similar to words starting [sa], which occur 509 times in all, 190 times at the start of a line. We know that word starting [a] are very uncommon at the start of lines, and some kind of transformation may be causing [a] to become [sa]. It could be that, in certain (unknown) situations, [a] becomes [oa] instead of [sa].

Words starting [Goa] (where [G] is any gallows) occur 22 times. Of those, 15 occur at the start of paragraphs. These should be considered part of the Grove Words phenomenon. It should also be noted that words starting [Gs] are very rare.

From these observations we can draw a few of tentative conclusion: 1) that Grove Words and line start patterns are distinct; 2) that they can both apply to a single word; and 3) that the line start patterns occur more ‘interior’ to a word and Grove Words are more ‘exterior’.

(It might be that words starting [Gy] show the same thing: 11 of 14 occurrences of words starting [py] are Grove Words, and words starting [y] are associated with the line start.)

(Also, the transcription for the first word on f29v is wrong: it is [koaiin] not [kooiin].


Length of [e] Sequences

The character [e] is one of only two characters in the Voynich script which is regularly found repeated. Both [ee] and [eee] occur enough times to be valid sequences. As all three typically occur in the same position within words, they can be generically thought of as ‘[e] sequences’ filling a specific structural slot.

Although we can guess from their similar appearance and position that [e] sequences are a valid ‘family’ within the wider script, why one length is chosen over another is unknown. It may be that they encode different sounds or meanings. It may also be that they are conditioned by their environment. They have also sometimes been linked to benches [ch, sh], which have a related appearance and are often neighbouring [e] sequences in Voynich words.

I don’t expect in this post to answer these questions, but I want to set out some statistics and thoughts which seem pertinent.

Jorge Stolfi, in his page outlining his Grammar for Voynichese Words, provides some interesting tables for [e] sequences. In words which don’t contain a gallows the following counts for the different lengths are found:

[e]: 68
[ee]: 185
[eee]: 90

We can clearly see that [ee] is the most common, followed by [eee] at half as common, then [e] which is nearly only a third as common as [ee]. However, these numbers don’t include those [e] sequences following a bench character, which have quite different counts (let [B] stand for any bench, [ch, sh]):

[Be]: 3851
[Bee]: 917
[Beee]: 24

The counts are generally much higher, but the ratios are totally different. The most common is [e], [ee] is only a quarter as common, and [eee] less than 1% as common.

The difference in the ratios is also found in [e] sequences which follow gallows (let [G] stand for any non–bench gallows, [t, k, p, f]):

[Ge]: 2160
[Gee]: 2339
[Geee]: 189

[GBe]: 1102
[GBee]: 101
[GBeee]: 2

Although neither of these two sets of token counts match those above in environment without gallows, their general direction is the same. Those [e] sequences without a bench before them have high [ee] and significant [eee]. Those [e] sequences following a bench have low [ee] and insignificant [eee].

I performed some follow–up tests, breaking the figures down with specific gallows characters and with following characters such as [y, o, d]. In each case the same pattern emerged: all other things being equal, [e] sequences are shorter after [ch, sh].

This is also observed about bench gallows [cth, ckh], suggesting that they cause the same environment as [ch, sh]. This chimes with earlier observations on bench gallows and their possible relationship with [ch, sh], namely that they aren’t followed by [ch, sh].

Lastly, I want to mention an old observation by Currier, that [p, f] are never followed immediately by [e] sequences. They are however, followed by [ch, sh], which can then be followed by [e] sequences. The [e] sequences in these cases adhere to the same pattern as for gallows as a whole: [ee] is much less common than [e].

I don’t know how to explain all the above observations. There is likely much more to add before we have the full picture. For example, most words won’t begin or end with [e], but what can we say about the few which do? What observations can we make when a word contains two separate [e] sequences?

Bench characters are clearly an important environment for [e] sequences. Why they seem to govern the length of [e] sequences is unknown, but worth taking up as an idea for future investigations.

What can we say about [q]?

I want to quickly sum up my last three posts to give my opinion on [q]. We already know that it doesn’t often occur in labels, but does so in the main text. We also know that it occurs almost always at the start of a word and before [o].

I think what I have seen from the last few posts is that words beginning with [q] have two particular relationships. The first is that a word starting [q] must have a valid counterpart word starting [o], and it is not enough for the plain word without either [q] or [o] to exist. Conversely, even if the [o] form exists, that does not mean that the [q] form will.

This suggests that, whatever the meanings of [q] and [o] are, they are separate characters. A plain word such as [kaiin] has [o] added to it to form [okaiin], and then [q] added to form [qokaiin]. The prefix [qo] is not valid.

Second, the character which comes immediately after [q] and [o] conditions the relative frequencies of these two prefixes. So a word starting [ch] will have low levels of both, whereas a word starting [t] will have relatively high levels of both. The words [chy] and [tchy] have very different patterns of the two prefixes.

If we believe that characters stand for sounds then it is hard to see how [q] and [o] can be grammatical. Words should not belong to grammatical categories according to their initial sounds (though there are exceptions). The alternative is that the process of adding [q] and [o] is phonological.

It occurs to me that the first characters of plain words most likely to take [o] are [k, t, l, r]. These are the same characters which are most consistent in acting as ‘strong’ characters at the start of words. They also show positive attraction to appearing after words ending [o]. We have already seen this from another angle with regard to [o] forms after words ending with ‘strong’ characters.

However, the problem is that the further addition of [q] makes little sense. It only seems to occur frequently on those plain words which begin [k, t]. As the sequence [y.o] is actually quite acceptable in the language, it cannot be added just to prevent a weak–weak combination. (It turns out that only strong–strong combinations are discouraged.) Though the lack of [q] words in labels suggests that the preceding word provides an environment for its occurrence.

The value of [q] must be also connected with the identity of the gallows characters, yet words starting [lk] and [cth] have relatively low [q] form frequency ([ckh] seems to be higher).

It seems as though [q] is being used to create a phonological sequence which is preferred for some unknown reason.

Types of [qo] Triplets

My last two posts were about the relative frequencies of words prefixed with [o] and [qo]. Although there is some interesting information in those posts I don’t feel that it is well presented. After having examined and sorted the statistics again I want to propose a neater way of thinking about these words.

The core idea is that [o] is a prefix which can be added to the start of a word and that [q] is another prefix which can be added to words starting [o]. Thus every word is a member of a potential ‘triplet’ with a plain form, an [o] form, and a [qo] form. For example, [dol] would be a plain with [odol] as its [o] form and [qodol] as its [qo] form. It is the relative frequencies of these three forms which interests us in this post.

Not all triplets have the same relative frequencies. Some have [qo] forms which are common, others have [qo] forms which are rare. The same goes for plain and [o] forms. Thus there there are six possible combinations, depending on which form is higher (or lower) than the others:

Plain > [o] > [qo]
Plain > [qo] > [o]
[o] > Plain > [qo]
[o] > [qo] > Plain
[qo] > Plain > [o]
[qo] > [o] > Plain

I sorted all triplets with at least 30 tokens (about 140 triplets) between all three forms into these six groups. It should be noted that one group, [qo] > plain > [o], did not exist. Simply put, it is not possible to have a high frequency of [qo] forms without a relatively high frequency of [o] forms. Only one triplet had a [qo] form more than three times as common than they [o] form.

Although these groups were a good starting point for classification of the triplets, they were too blunt in their discrimination between relative levels. For example, the triplet [raiin] had an [o] form with one more token than the plain form, which [lkedy] had a plain form with two more tokens than the [o] form. In reality, both of these had plain and [o] forms which were more or less equal.

Thus I further sorted these group into four types which could be described fairly easily to show the relative frequencies. I will present the four types below.

Type I: 55 triplets. They all have high counts for the plain forms but low counts for both [o] and [qo] forms, which are mostly less than 10% of the plain form. The plain forms start with [a, o, y], [ch, sh], [d, s], and [cth] (but not [ckh]).

Type II: 42 triplets. The [o] form is the highest frequency form (or, for a few, nearly the highest). The plain and [qo] forms are less frequent, but with [qo] higher than plain. About half the plain forms begin with [t] and another three with [p], thirteen [k], a couple [l] and [r].

Type III: 17 triplets. The [qo] form is the highest frequency form, followed strictly by [o] then by the plain form usually much lower. Fifteen of the plain forms start with [k], with one each for [e] and [t].

Type IV: 20 triplets. Both the plain and [o] forms have at least ten tokens, but the [qo] form is always less than 40% as frequent of either, and usually much lower. Fourteen of the plain forms start with [l], the rest with [r] except [m].

There were also six triplets which didn’t fit easily into any of these types. Three began with [ckh] in the plain form and one was a Grove word.


I hope it is clear that first character of the plain form is a key indicator of how common the [o] and [qo] are both in total and relatively. Only Type II overlapped significantly, with Types III and IV. Otherwise, it is possible to broadly predict, for any given word, how often it will be prefixed with [o] and [qo].

For example, think of the plain form [dol] which I mentioned earlier. We can see that the plain form begins with [d] and so it is part of a Type I triplet. Thus the [o] and [qo] forms should occur much less than the plain form. So [dol] has 117 tokens, whereas [odol] has 2 and [qodol] only 1.

[qo], [o], and the Root Word

In my last post I spoke about the relationship between [q] and [o], given that one character is almost invariably followed by the other. In this post I want to talk about the root word, the one which begins after [qo] and [o] and how it affects the frequency of those prefixes.

For the sake of this post I used the same word triplets which I used in the last: three words which are the same except that one form is prefixed with [o], one is prefixed with [qo], and one has no prefix at all (a null prefix). So, for example, [key, okey, qokey] would be be one triplet. Each triplet has at least one valid form (which I take to mean having five or more tokens), though some forms a triplet may not occur at all.

I’ll present a narrative description of the results, sorted by the first character of the root word. Some characters had too few valid words to be included.

[ch, sh]

The bench characters both showed very low frequencies of [o] forms compared to null forms, with none greater than 3%. The highest was [chy] with 155 tokens and its [o] form [ochy] with 5 tokens. The highest count was [chedy] with 501 tokens and its counterpart [ochedy] with 8 tokens. The [q] forms were even rarer.

Even though some individual [qo] and [o] forms rose above the validity threshold, most did not, and it is possible to say that the overall pattern seems to be invalid. That is, words beginning [ch, sh] do not have [o] and [qo] forms.

[a, y]

The picture for [a, y] is much like the above for [ch, sh]. However, the frequency ratio peaked at about 6% [o] and [qo] forms. Some of the counterparts, such as [oaiin] and [qoaiin], with 26 and 23 tokens respectively, are clearly valid words, but still relatively rare compared to the null form [aiin] with 470 tokens.

These words can be considered to marginally take [qo] and [o] prefixes.


Most of these words had moderate levels of [o] forms, ranging from 30% to 130% relative frequency. The main exception was [r], for which the [o] form, [or], was obviously extremely common. However, no [r] word had a valid [qo] form except [qor].

It seems as though [o] is a valid prefix for these words, but [qo] is not.


The relative frequencies for [o] and [qo] forms went up to about 17%, with the [o] forms being always a little more common than the [qo] forms. The high frequencies of the null forms means that many of the [o] and [qo] forms have moderate to high counts, with [odaiin] having 61 tokens.


The [o] forms had strong relative frequencies, ranging from 32% to over 400% for [oly] ([ol] was even higher but, as with [or], is really an exception). All valid null forms beginning [l] produced valid [o] forms.

However, the rates for [qo] forms were much lower, with all being under 50% of null frequency and more than half under 10%. Only seven [l] words (of 35) produced valid [qo] forms. Ten produced no [qo] counterparts at all.

It is clear that [o] forms are valid for [l], but [qo] forms are mostly not.

[ckh, cth]

There were too few examples of these words to get good results. However, it seems as though they mostly take [o] and [qo] prefixes in small but valid amounts. The exceptions were [ckhol, cthol], which only had one [o] counterpart each.


All but four (of 37) null forms had [o] forms at parity or higher, with the highest being [okal] at 600% of [kal]. Similarly, all but eight had [qo] forms at parity or above. In nearly half the cases the [qo] form was more frequent that the [o] form, and several others were nearly at parity.

The [o] and [qo] forms for [k] words have strong frequencies, with the exceptional phenomena of [qo] often being stronger.


All but four (of 28) null forms had [o] forms at parity of higher, with the highest being [otam] with 47 tokens compared to 5 tokens for [tam]. Nine null forms had [qo] forms below parity, with the rest higher. For both [o] and [qo] forms the lowest relative frequencies were still healthy at 75% and 33% respectively.

However, unlike [k], the [qo] forms were typically less common than the [o] counterparts. Only three [qo] forms were at or above parity with [o]. Many were less than half as common, with three dropping below the validity threshold.

It seems that while both [o] and [qo] are valid prefixes, [o] is much more common.


Most words had a moderate to strong frequency of [o] forms, though with a lower level of [qo] forms.

Overall, much like [t].


A number of different patterns were found regarding the influence of the root word’s initial character. I think they can be broadly put into three groups.

The characters [a, y, ch, sh] typically did not show a high frequency of [o] or [qo] forms. It is arguable whether these prefixes are really productive in the same way as for other words. It should be noted that, according to the strong–weak split, all these are weak characters, as is [o].

The characters [r, l] showed a good number of [o] forms, but with much lower, or non–existent, [qo] forms.

The characters [k, t, p] all showed a strong number of [o] forms and also high levels of [qo] forms. However, [k] had even stronger [qo] forms while [t, p] had somewhat lower numbers for [qo].

The characters [ckh, cth] were too few to typify, and [d] is difficult due to the high number of null forms.


I’m not sure. That’s hardly what you want to hear, but it’s the truth. The low frequencies of [o] and [qo] forms for root words beginning with weak characters could show that these prefixes have some connection to last–first combinations. This would certainly be my preferred reading.

Yet quite how to explain the good presence of [o] but not [qo] for [l, r]? I note that, when we discussed last–first combinations and Transformation Theory, the words beginning [l] were those most likely to show a strong preference against the sequence [y.o]. It could be that [l] words avoid [y.o] by removing the [o], whereas gallows characters avoid the sequence by adding [q] to give [y.qo], but this is nothing more than a suggestion to be investigated.

The difference between [t] and [k], with one having a lower [qo] than [o] and the other a higher, reminds me of something we discussed a while ago. We talked about the distribution of [k] and [t], and noted that the strings [qoke, qoka] were much more common than [qote, qota]. Though it’s hard to know whether we’re seeing the same thing from two different angles, or if one is causing the other.

I feel as though this is another part of an emerging puzzle, and it’s not clear how the piece should be fitted together.

The Relationship between [q] and [o]

The character [q] is the most stereotyped character in the Voynich script. It almost always occurs 1) at the beginning of a word, and 2) before the character [o]. It also mostly occurs after a word ending [y]. Thus its immediate context can be guessed in most cases.

Some have proposed, due to the character’s position at the beginning of a word, and its lack of occurrence in isolated labels, that it stands for a whole word (a morphogram), maybe a grammatical word like the ampersand: &. I think this is highly doubtful as we would expect it to occur before all kinds of characters and not only [o].

The morphogram proposal can be salvaged by suggesting that the two characters [qo] are an inseparable digraph, or some such. This would be much more reasonable as [qo] occurs before all kinds of characters. Though it still has a small weakness in that [q] rarely occurs before a few other characters, such as [e]. This, however, is not a fatal objection.

I wish here to present some statistics to show that [qo] may well be separable and thus the morphogram proposal has a poor foundation.

No [q] without [o]

We know that words beginning [q] must nearly always have [o] as the next character. However, the removal of the [q] will almost always result in a valid word. There are few exceptions to this and most are marginal.

I took every word beginning with [q] which had five or more tokens (I will treat five as the threshold for validity in this post), and counted the number of tokens for the same words without the initial [q]. So, for example [qokey] had the counterpart word [okey]. The list contained 123 word pairs.

Here’s what I found:

  • every [q] word with more than 15 tokens (48 of 123) had a valid counterpart beginning [o];
  • all but two of the [q] words with more than 10 tokens (73 of 123) had a valid [o] counterpart;
  • only 15 valid [q] words lacked a valid [o] counterpart, and 13 of those words had fewer than ten tokens;
  • only 4 lacked no instances of an [o] counterpart: one had nine tokens, the other three just five tokens.

Here is the list of [q] words without valid counterparts, and the count for each:

With o Count With qo Count
okl 0 qokl 9
otched 0 qotched 5
oeol 0 qoeol 5
ool 0 qool 5
ockhol 1 qockhol 7
okeechy 1 qokeechy 6
oked 2 qoked 7
okeedar 2 qokeedar 6
okeed 3 qokeed 15
okshedy 3 qokshedy 11
oor 3 qoor 8
okod 3 qokod 7
oeeey 4 qoeeey 7
opol 4 qopol 6
otshy 4 qotshy 5

The most interesting words on this list are [qoor] and [qool]. The word [oor] is rare and [ool] simply doesn’t exist, yet [or] and [ol] are very common. The string [oo] is not common itself, so we’re looking at a marginal set of words. Indeed, the counts for most of the 15 above are low and really only two or three stand out as exceptions.

Ratios of [q] to [o]

I further wish to show that the number of [q] words is limited by the number of its [o] counterparts.

Here are a few key stats, based on the same list of word pairs as before:

  • only four valid [q] words exist without any [o] counterparts (so their ratio cannot be calculated);
  • 72 of the 123 valid [q] words are less common than their [o] counterparts;
  • of the remaining 47 valid [q] words, only six occur three times more than their [o] counterparts and all are words we have already listed above as exceptions without valid [o] counterparts.

So, in sum, all the words not given earlier as exceptions are either less common than their [o] counterparts or no more than three times as common. It should be noted that there is no absolute floor provided by [o] words, and that some words which occur commonly, such as [or] with 366 tokens, might have relatively low numbers of [q] counterparts, [qor] with only 23 tokens. But the opposite is not and cannot be true. The most extreme case is [qokeedy] with 305 tokens, whereas [okeedy] has 105 tokens.

The Null Hypothesis

The proposal that [qo] is the prefix and not [q] alone suggests that words beginning [qo] should be linked with words with no (or null) prefix. So, for example, a word such as [qokey] is formed from [qo] plus [key] instead of [q] plus [okey].

Let’s look at the statistics for [q] words and their null counterparts:

  • every valid [q] word with 19 or more tokens (43 of 123) has a valid null counterpart;
  • all but five valid [q] words with 10 or more tokens (73 of 123) had valid null counterparts;
  • 29 valid [q] words lacked a valid null counterpart, and 24 of those words had fewer than ten tokens;
  • five lacked no instances of a null counterpart: two had eight tokens and three had six tokens.

I won’t bother providing a table, as you can see that the stats for the null counterparts are somewhat worse: the number of valid [q] words lacking a valid counterparts nearly doubles from 15 words lacking [o] counterparts to 29 lacking valid null counterparts.

The ratio stats are even worse, with a much wider spread:

  • five valid [q] words have no instances of null counterparts (so their ratio cannot be calculated);
  • only 40 of the 123 valid [q] words are less common than their null counterparts;
  • of the remaining 78 valid [q] words, 26 are three or more times common than their null counterparts, and 12 of these have valid null counterparts.

In short, the frequency of the null counterparts bears little relation to the frequency of the [q] words. The word [qokal], with 191 tokens, is more than eight times as common as [kal], with only 23 tokens. Yet both are clearly valid words with relatively high counts.


The idea that [qo] is a prefix is the only way to preserve the hypothesis that [q] is grammatical in nature. However, the competing hypothesis, that [q] is a prefix which adds onto words already beginning with [o] performs better with a simple set of statistics: [o] counterparts to [q] words show more validity as words than null counterparts and the frequency ratio is tighter.

None of the numbers in this post are proof, but they at least give us caution that existing ideas aren’t well supported.

We ought to be more agnostic about the nature of [q] and seek other hypotheses. The possibility of a sound value for [q] should be explored as a potential better fit. There are a few sounds which could work, and there are explanations for the non–occurrence of the character in labels.

I think that a radical new solution to the problem of [q] should be sought.

A Hidden Phrase?

I want to give an example of the kind of outcome that I hope my research, particularly the Transformation Theory, will eventually bring. It isn’t always easy to explain my goals in the abstract so I think an example will be better.

Please note that the following is only an example. It is plausible and in line with my research, but is not a proposed reading.

Look at the image of text below, taken from the second line of f66r.

f66r line 2

The text is easy to read and even the few ill–formed characters are not ambiguous. I would transcribe the line in EVA as: [qokeedar okal okedy qokeedy qokal okedy]. The First Study Group and Takahashi agree with this transcription (though, curiously, Currier misses out [qokeedy] altogether).

The important thing to note first of all is that only words three and six of this excerpt match: [okedy]. Were we looking for strict repetitions nothing would be found here. A looser matching might say that [qokeedar] and [qokeedy] are similar, as are [okal] and [qokal].

But by ‘similar’ what do we mean? Usually that words differ by one or two characters. Yet without understanding why or how words differ we could propose many near matches that are, in truth, unrelated. Is [okal] related to [okar] or [otal] or [okol] or [ykal]? Or all or none?

I think I can begin to shed some light on how similar words are actually related, and by doing so reveal hidden patterns otherwise obscured.

From [y] to [a]

In my very first post on this blog I explained that [y] and [a] have some kind of equivalence. They have complementary distribution and, when taken together, match the distribution of [o] fairly well. We should consider them, for the purposes of future research, to be variants of the same character (although there may be some difference in their values discoverable later).

When we apply this knowledge of [y] and [a] to the first and fourth words of the excerpt, we can see that [qokeedy] and [qokeedar] differ not by two characters but only one. Because [y] becomes [a] before [r], [qokeedar] is actually [qokeedy–r].

Even though I don’t yet know what a final [r] might mean, we can propose that some process has added [r] to the end of [qokeedy] to make [qokeedar]. (There’s other evidence, such as word structure and trigram distribution, also showing the two words are likely related).

From Nothing to [q]

We can now move on to the second and fifth words of the excerpt: [okal] and [qokal]. The kind of knowledge we’re gaining through research into Last–First Combinations suggests that the first character of a word can be dependent on the last character of the preceding word. So the difference between the two words of an initial [q] could result from the alteration from [qokeedar] to [qokeedy].

Is this possibility borne out by the evidence we have? I should first state that I haven’t yet done any specific research on initial [q]. However, the statistics we have at least suggest it is likely.

A word ending [r] has a clear preference for and against certain characters at the start of the following word. It is a strong character and thus prefers a weak one after it, such as [y, a, o, ch, sh], and has a bias against [q]. The next word begins [o], which meets the expectation.

Yet [qokeedy] ends [y], a weak character. It has a preference for a strong character, such as [q] (though, we should note, possibly not a bias against [o]). Thus [qokal] is an acceptable word to follow, even though we don’t understand why the form [qokal] is preferred rather than [kal], which also starts with a strong character.

The Result

If we consider the above to be true, or at least plausible, we then have an interesting analysis of our excerpt. The six words are actually two three–word phrases which are very closely related.

The second phrase is [qokeedy qokal okedy], which is how it appears as written in the text. The first phrase is different only by a single original character [r]. By adding that onto the end of the first word we end up with [qokeedar okal okedy] due to the variation of [y] to [a] and the transformation of [q] to nothing.

Three differences are reduced to one and the text becomes more regular and with more obvious patterns. Repeated multiple times on every page, this would lead to a much different text.

I hope this examples clarifies my thinking about where textual analysis needs to go in the future. It’s about identifying as many of these changes as possible, understanding the links between characters and their contexts, and fixing rules which let us roll back the transformed text to reveal the original text below.