Length of [e] Sequences

The character [e] is one of only two characters in the Voynich script which is regularly found repeated. Both [ee] and [eee] occur enough times to be valid sequences. As all three typically occur in the same position within words, they can be generically thought of as ‘[e] sequences’ filling a specific structural slot.

Although we can guess from their similar appearance and position that [e] sequences are a valid ‘family’ within the wider script, why one length is chosen over another is unknown. It may be that they encode different sounds or meanings. It may also be that they are conditioned by their environment. They have also sometimes been linked to benches [ch, sh], which have a related appearance and are often neighbouring [e] sequences in Voynich words.

I don’t expect in this post to answer these questions, but I want to set out some statistics and thoughts which seem pertinent.

Jorge Stolfi, in his page outlining his Grammar for Voynichese Words, provides some interesting tables for [e] sequences. In words which don’t contain a gallows the following counts for the different lengths are found:

[e]: 68
[ee]: 185
[eee]: 90

We can clearly see that [ee] is the most common, followed by [eee] at half as common, then [e] which is nearly only a third as common as [ee]. However, these numbers don’t include those [e] sequences following a bench character, which have quite different counts (let [B] stand for any bench, [ch, sh]):

[Be]: 3851
[Bee]: 917
[Beee]: 24

The counts are generally much higher, but the ratios are totally different. The most common is [e], [ee] is only a quarter as common, and [eee] less than 1% as common.

The difference in the ratios is also found in [e] sequences which follow gallows (let [G] stand for any non–bench gallows, [t, k, p, f]):

[Ge]: 2160
[Gee]: 2339
[Geee]: 189

[GBe]: 1102
[GBee]: 101
{GBeee]: 2

Although neither of these two sets of token counts match those above in environment without gallows, their general direction is the same. Those [e] sequences without a bench before them have high [ee] and significant [eee]. Those [e] sequences following a bench have low [ee] and insignificant [eee].

I performed some follow–up tests, breaking the figures down with specific gallows characters and with following characters such as [y, o, d]. In each case the same pattern emerged: all other things being equal, [e] sequences are shorter after [ch, sh].

This is also observed about bench gallows [cth, ckh], suggesting that they cause the same environment as [ch, sh]. This chimes with earlier observations on bench gallows and their possible relationship with [ch, sh], namely that they aren’t followed by [ch, sh].

Lastly, I want to mention an old observation by Currier, that [p, f] are never followed immediately by [e] sequences. They are however, followed by [ch, sh], which can then be followed by [e] sequences. The [e] sequences in these cases adhere to the same pattern as for gallows as a whole: [ee] is much less common than [e].

I don’t know how to explain all the above observations. There is likely much more to add before we have the full picture. For example, most words won’t begin or end with [e], but what can we say about the few which do? What observations can we make when a word contains two separate [e] sequences?

Bench characters are clearly an important environment for [e] sequences. Why they seem to govern the length of [e] sequences is unknown, but worth taking up as an idea for future investigations.


What can we say about [q]?

I want to quickly sum up my last three posts to give my opinion on [q]. We already know that it doesn’t often occur in labels, but does so in the main text. We also know that it occurs almost always at the start of a word and before [o].

I think what I have seen from the last few posts is that words beginning with [q] have two particular relationships. The first is that a word starting [q] must have a valid counterpart word starting [o], and it is not enough for the plain word without either [q] or [o] to exist. Conversely, even if the [o] form exists, that does not mean that the [q] form will.

This suggests that, whatever the meanings of [q] and [o] are, they are separate characters. A plain word such as [kaiin] has [o] added to it to form [okaiin], and then [q] added to form [qokaiin]. The prefix [qo] is not valid.

Second, the character which comes immediately after [q] and [o] conditions the relative frequencies of these two prefixes. So a word starting [ch] will have low levels of both, whereas a word starting [t] will have relatively high levels of both. The words [chy] and [tchy] have very different patterns of the two prefixes.

If we believe that characters stand for sounds then it is hard to see how [q] and [o] can be grammatical. Words should not belong to grammatical categories according to their initial sounds (though there are exceptions). The alternative is that the process of adding [q] and [o] is phonological.

It occurs to me that the first characters of plain words most likely to take [o] are [k, t, l, r]. These are the same characters which are most consistent in acting as ‘strong’ characters at the start of words. They also show positive attraction to appearing after words ending [o]. We have already seen this from another angle with regard to [o] forms after words ending with ‘strong’ characters.

However, the problem is that the further addition of [q] makes little sense. It only seems to occur frequently on those plain words which begin [k, t]. As the sequence [y.o] is actually quite acceptable in the language, it cannot be added just to prevent a weak–weak combination. (It turns out that only strong–strong combinations are discouraged.) Though the lack of [q] words in labels suggests that the preceding word provides an environment for its occurrence.

The value of [q] must be also connected with the identity of the gallows characters, yet words starting [lk] and [cth] have relatively low [q] form frequency ([ckh] seems to be higher).

It seems as though [q] is being used to create a phonological sequence which is preferred for some unknown reason.

Types of [qo] Triplets

My last two posts were about the relative frequencies of words prefixed with [o] and [qo]. Although there is some interesting information in those posts I don’t feel that it is well presented. After having examined and sorted the statistics again I want to propose a neater way of thinking about these words.

The core idea is that [o] is a prefix which can be added to the start of a word and that [q] is another prefix which can be added to words starting [o]. Thus every word is a member of a potential ‘triplet’ with a plain form, an [o] form, and a [qo] form. For example, [dol] would be a plain with [odol] as its [o] form and [qodol] as its [qo] form. It is the relative frequencies of these three forms which interests us in this post.

Not all triplets have the same relative frequencies. Some have [qo] forms which are common, others have [qo] forms which are rare. The same goes for plain and [o] forms. Thus there there are six possible combinations, depending on which form is higher (or lower) than the others:

Plain > [o] > [qo]
Plain > [qo] > [o]
[o] > Plain > [qo]
[o] > [qo] > Plain
[qo] > Plain > [o]
[qo] > [o] > Plain

I sorted all triplets with at least 30 tokens (about 140 triplets) between all three forms into these six groups. It should be noted that one group, [qo] > plain > [o], did not exist. Simply put, it is not possible to have a high frequency of [qo] forms without a relatively high frequency of [o] forms. Only one triplet had a [qo] form more than three times as common than they [o] form.

Although these groups were a good starting point for classification of the triplets, they were too blunt in their discrimination between relative levels. For example, the triplet [raiin] had an [o] form with one more token than the plain form, which [lkedy] had a plain form with two more tokens than the [o] form. In reality, both of these had plain and [o] forms which were more or less equal.

Thus I further sorted these group into four types which could be described fairly easily to show the relative frequencies. I will present the four types below.

Type I: 55 triplets. They all have high counts for the plain forms but low counts for both [o] and [qo] forms, which are mostly less than 10% of the plain form. The plain forms start with [a, o, y], [ch, sh], [d, s], and [cth] (but not [ckh]).

Type II: 42 triplets. The [o] form is the highest frequency form (or, for a few, nearly the highest). The plain and [qo] forms are less frequent, but with [qo] higher than plain. About half the plain forms begin with [t] and another three with [p], thirteen [k], a couple [l] and [r].

Type III: 17 triplets. The [qo] form is the highest frequency form, followed strictly by [o] then by the plain form usually much lower. Fifteen of the plain forms start with [k], with one each for [e] and [t].

Type IV: 20 triplets. Both the plain and [o] forms have at least ten tokens, but the [qo] form is always less than 40% as frequent of either, and usually much lower. Fourteen of the plain forms start with [l], the rest with [r] except [m].

There were also six triplets which didn’t fit easily into any of these types. Three began with [ckh] in the plain form and one was a Grove word.


I hope it is clear that first character of the plain form is a key indicator of how common the [o] and [qo] are both in total and relatively. Only Type II overlapped significantly, with Types III and IV. Otherwise, it is possible to broadly predict, for any given word, how often it will be prefixed with [o] and [qo].

For example, think of the plain form [dol] which I mentioned earlier. We can see that the plain form begins with [d] and so it is part of a Type I triplet. Thus the [o] and [qo] forms should occur much less than the plain form. So [dol] has 117 tokens, whereas [odol] has 2 and [qodol] only 1.

[qo], [o], and the Root Word

In my last post I spoke about the relationship between [q] and [o], given that one character is almost invariably followed by the other. In this post I want to talk about the root word, the one which begins after [qo] and [o] and how it affects the frequency of those prefixes.

For the sake of this post I used the same word triplets which I used in the last: three words which are the same except that one form is prefixed with [o], one is prefixed with [qo], and one has no prefix at all (a null prefix). So, for example, [key, okey, qokey] would be be one triplet. Each triplet has at least one valid form (which I take to mean having five or more tokens), though some forms a triplet may not occur at all.

I’ll present a narrative description of the results, sorted by the first character of the root word. Some characters had too few valid words to be included.

[ch, sh]

The bench characters both showed very low frequencies of [o] forms compared to null forms, with none greater than 3%. The highest was [chy] with 155 tokens and its [o] form [ochy] with 5 tokens. The highest count was [chedy] with 501 tokens and its counterpart [ochedy] with 8 tokens. The [q] forms were even rarer.

Even though some individual [qo] and [o] forms rose above the validity threshold, most did not, and it is possible to say that the overall pattern seems to be invalid. That is, words beginning [ch, sh] do not have [o] and [qo] forms.

[a, y]

The picture for [a, y] is much like the above for [ch, sh]. However, the frequency ratio peaked at about 6% [o] and [qo] forms. Some of the counterparts, such as [oaiin] and [qoaiin], with 26 and 23 tokens respectively, are clearly valid words, but still relatively rare compared to the null form [aiin] with 470 tokens.

These words can be considered to marginally take [qo] and [o] prefixes.


Most of these words had moderate levels of [o] forms, ranging from 30% to 130% relative frequency. The main exception was [r], for which the [o] form, [or], was obviously extremely common. However, no [r] word had a valid [qo] form except [qor].

It seems as though [o] is a valid prefix for these words, but [qo] is not.


The relative frequencies for [o] and [qo] forms went up to about 17%, with the [o] forms being always a little more common than the [qo] forms. The high frequencies of the null forms means that many of the [o] and [qo] forms have moderate to high counts, with [odaiin] having 61 tokens.


The [o] forms had strong relative frequencies, ranging from 32% to over 400% for [oly] ([ol] was even higher but, as with [or], is really an exception). All valid null forms beginning [l] produced valid [o] forms.

However, the rates for [qo] forms were much lower, with all being under 50% of null frequency and more than half under 10%. Only seven [l] words (of 35) produced valid [qo] forms. Ten produced no [qo] counterparts at all.

It is clear that [o] forms are valid for [l], but [qo] forms are mostly not.

[ckh, cth]

There were too few examples of these words to get good results. However, it seems as though they mostly take [o] and [qo] prefixes in small but valid amounts. The exceptions were [ckhol, cthol], which only had one [o] counterpart each.


All but four (of 37) null forms had [o] forms at parity or higher, with the highest being [okal] at 600% of [kal]. Similarly, all but eight had [qo] forms at parity or above. In nearly half the cases the [qo] form was more frequent that the [o] form, and several others were nearly at parity.

The [o] and [qo] forms for [k] words have strong frequencies, with the exceptional phenomena of [qo] often being stronger.


All but four (of 28) null forms had [o] forms at parity of higher, with the highest being [otam] with 47 tokens compared to 5 tokens for [tam]. Nine null forms had [qo] forms below parity, with the rest higher. For both [o] and [qo] forms the lowest relative frequencies were still healthy at 75% and 33% respectively.

However, unlike [k], the [qo] forms were typically less common than the [o] counterparts. Only three [qo] forms were at or above parity with [o]. Many were less than half as common, with three dropping below the validity threshold.

It seems that while both [o] and [qo] are valid prefixes, [o] is much more common.


Most words had a moderate to strong frequency of [o] forms, though with a lower level of [qo] forms.

Overall, much like [t].


A number of different patterns were found regarding the influence of the root word’s initial character. I think they can be broadly put into three groups.

The characters [a, y, ch, sh] typically did not show a high frequency of [o] or [qo] forms. It is arguable whether these prefixes are really productive in the same way as for other words. It should be noted that, according to the strong–weak split, all these are weak characters, as is [o].

The characters [r, l] showed a good number of [o] forms, but with much lower, or non–existent, [qo] forms.

The characters [k, t, p] all showed a strong number of [o] forms and also high levels of [qo] forms. However, [k] had even stronger [qo] forms while [t, p] had somewhat lower numbers for [qo].

The characters [ckh, cth] were too few to typify, and [d] is difficult due to the high number of null forms.


I’m not sure. That’s hardly what you want to hear, but it’s the truth. The low frequencies of [o] and [qo] forms for root words beginning with weak characters could show that these prefixes have some connection to last–first combinations. This would certainly be my preferred reading.

Yet quite how to explain the good presence of [o] but not [qo] for [l, r]? I note that, when we discussed last–first combinations and Transformation Theory, the words beginning [l] were those most likely to show a strong preference against the sequence [y.o]. It could be that [l] words avoid [y.o] by removing the [o], whereas gallows characters avoid the sequence by adding [q] to give [y.qo], but this is nothing more than a suggestion to be investigated.

The difference between [t] and [k], with one having a lower [qo] than [o] and the other a higher, reminds me of something we discussed a while ago. We talked about the distribution of [k] and [t], and noted that the strings [qoke, qoka] were much more common than [qote, qota]. Though it’s hard to know whether we’re seeing the same thing from two different angles, or if one is causing the other.

I feel as though this is another part of an emerging puzzle, and it’s not clear how the piece should be fitted together.

The Relationship between [q] and [o]

The character [q] is the most stereotyped character in the Voynich script. It almost always occurs 1) at the beginning of a word, and 2) before the character [o]. It also mostly occurs after a word ending [y]. Thus its immediate context can be guessed in most cases.

Some have proposed, due to the character’s position at the beginning of a word, and its lack of occurrence in isolated labels, that it stands for a whole word (a morphogram), maybe a grammatical word like the ampersand: &. I think this is highly doubtful as we would expect it to occur before all kinds of characters and not only [o].

The morphogram proposal can be salvaged by suggesting that the two characters [qo] are an inseparable digraph, or some such. This would be much more reasonable as [qo] occurs before all kinds of characters. Though it still has a small weakness in that [q] rarely occurs before a few other characters, such as [e]. This, however, is not a fatal objection.

I wish here to present some statistics to show that [qo] may well be separable and thus the morphogram proposal has a poor foundation.

No [q] without [o]

We know that words beginning [q] must nearly always have [o] as the next character. However, the removal of the [q] will almost always result in a valid word. There are few exceptions to this and most are marginal.

I took every word beginning with [q] which had five or more tokens (I will treat five as the threshold for validity in this post), and counted the number of tokens for the same words without the initial [q]. So, for example [qokey] had the counterpart word [okey]. The list contained 123 word pairs.

Here’s what I found:

  • every [q] word with more than 15 tokens (48 of 123) had a valid counterpart beginning [o];
  • all but two of the [q] words with more than 10 tokens (73 of 123) had a valid [o] counterpart;
  • only 15 valid [q] words lacked a valid [o] counterpart, and 13 of those words had fewer than ten tokens;
  • only 4 lacked no instances of an [o] counterpart: one had nine tokens, the other three just five tokens.

Here is the list of [q] words without valid counterparts, and the count for each:

With o Count With qo Count
okl 0 qokl 9
otched 0 qotched 5
oeol 0 qoeol 5
ool 0 qool 5
ockhol 1 qockhol 7
okeechy 1 qokeechy 6
oked 2 qoked 7
okeedar 2 qokeedar 6
okeed 3 qokeed 15
okshedy 3 qokshedy 11
oor 3 qoor 8
okod 3 qokod 7
oeeey 4 qoeeey 7
opol 4 qopol 6
otshy 4 qotshy 5

The most interesting words on this list are [qoor] and [qool]. The word [oor] is rare and [ool] simply doesn’t exist, yet [or] and [ol] are very common. The string [oo] is not common itself, so we’re looking at a marginal set of words. Indeed, the counts for most of the 15 above are low and really only two or three stand out as exceptions.

Ratios of [q] to [o]

I further wish to show that the number of [q] words is limited by the number of its [o] counterparts.

Here are a few key stats, based on the same list of word pairs as before:

  • only four valid [q] words exist without any [o] counterparts (so their ratio cannot be calculated);
  • 72 of the 123 valid [q] words are less common than their [o] counterparts;
  • of the remaining 47 valid [q] words, only six occur three times more than their [o] counterparts and all are words we have already listed above as exceptions without valid [o] counterparts.

So, in sum, all the words not given earlier as exceptions are either less common than their [o] counterparts or no more than three times as common. It should be noted that there is no absolute floor provided by [o] words, and that some words which occur commonly, such as [or] with 366 tokens, might have relatively low numbers of [q] counterparts, [qor] with only 23 tokens. But the opposite is not and cannot be true. The most extreme case is [qokeedy] with 305 tokens, whereas [okeedy] has 105 tokens.

The Null Hypothesis

The proposal that [qo] is the prefix and not [q] alone suggests that words beginning [qo] should be linked with words with no (or null) prefix. So, for example, a word such as [qokey] is formed from [qo] plus [key] instead of [q] plus [okey].

Let’s look at the statistics for [q] words and their null counterparts:

  • every valid [q] word with 19 or more tokens (43 of 123) has a valid null counterpart;
  • all but five valid [q] words with 10 or more tokens (73 of 123) had valid null counterparts;
  • 29 valid [q] words lacked a valid null counterpart, and 24 of those words had fewer than ten tokens;
  • five lacked no instances of a null counterpart: two had eight tokens and three had six tokens.

I won’t bother providing a table, as you can see that the stats for the null counterparts are somewhat worse: the number of valid [q] words lacking a valid counterparts nearly doubles from 15 words lacking [o] counterparts to 29 lacking valid null counterparts.

The ratio stats are even worse, with a much wider spread:

  • five valid [q] words have no instances of null counterparts (so their ratio cannot be calculated);
  • only 40 of the 123 valid [q] words are less common than their null counterparts;
  • of the remaining 78 valid [q] words, 26 are three or more times common than their null counterparts, and 12 of these have valid null counterparts.

In short, the frequency of the null counterparts bears little relation to the frequency of the [q] words. The word [qokal], with 191 tokens, is more than eight times as common as [kal], with only 23 tokens. Yet both are clearly valid words with relatively high counts.


The idea that [qo] is a prefix is the only way to preserve the hypothesis that [q] is grammatical in nature. However, the competing hypothesis, that [q] is a prefix which adds onto words already beginning with [o] performs better with a simple set of statistics: [o] counterparts to [q] words show more validity as words than null counterparts and the frequency ratio is tighter.

None of the numbers in this post are proof, but they at least give us caution that existing ideas aren’t well supported.

We ought to be more agnostic about the nature of [q] and seek other hypotheses. The possibility of a sound value for [q] should be explored as a potential better fit. There are a few sounds which could work, and there are explanations for the non–occurrence of the character in labels.

I think that a radical new solution to the problem of [q] should be sought.

A Hidden Phrase?

I want to give an example of the kind of outcome that I hope my research, particularly the Transformation Theory, will eventually bring. It isn’t always easy to explain my goals in the abstract so I think an example will be better.

Please note that the following is only an example. It is plausible and in line with my research, but is not a proposed reading.

Look at the image of text below, taken from the second line of f66r.

f66r line 2

The text is easy to read and even the few ill–formed characters are not ambiguous. I would transcribe the line in EVA as: [qokeedar okal okedy qokeedy qokal okedy]. The First Study Group and Takahashi agree with this transcription (though, curiously, Currier misses out [qokeedy] altogether).

The important thing to note first of all is that only words three and six of this excerpt match: [okedy]. Were we looking for strict repetitions nothing would be found here. A looser matching might say that [qokeedar] and [qokeedy] are similar, as are [okal] and [qokal].

But by ‘similar’ what do we mean? Usually that words differ by one or two characters. Yet without understanding why or how words differ we could propose many near matches that are, in truth, unrelated. Is [okal] related to [okar] or [otal] or [okol] or [ykal]? Or all or none?

I think I can begin to shed some light on how similar words are actually related, and by doing so reveal hidden patterns otherwise obscured.

From [y] to [a]

In my very first post on this blog I explained that [y] and [a] have some kind of equivalence. They have complementary distribution and, when taken together, match the distribution of [o] fairly well. We should consider them, for the purposes of future research, to be variants of the same character (although there may be some difference in their values discoverable later).

When we apply this knowledge of [y] and [a] to the first and fourth words of the excerpt, we can see that [qokeedy] and [qokeedar] differ not by two characters but only one. Because [y] becomes [a] before [r], [qokeedar] is actually [qokeedy–r].

Even though I don’t yet know what a final [r] might mean, we can propose that some process has added [r] to the end of [qokeedy] to make [qokeedar]. (There’s other evidence, such as word structure and trigram distribution, also showing the two words are likely related).

From Nothing to [q]

We can now move on to the second and fifth words of the excerpt: [okal] and [qokal]. The kind of knowledge we’re gaining through research into Last–First Combinations suggests that the first character of a word can be dependent on the last character of the preceding word. So the difference between the two words of an initial [q] could result from the alteration from [qokeedar] to [qokeedy].

Is this possibility borne out by the evidence we have? I should first state that I haven’t yet done any specific research on initial [q]. However, the statistics we have at least suggest it is likely.

A word ending [r] has a clear preference for and against certain characters at the start of the following word. It is a strong character and thus prefers a weak one after it, such as [y, a, o, ch, sh], and has a bias against [q]. The next word begins [o], which meets the expectation.

Yet [qokeedy] ends [y], a weak character. It has a preference for a strong character, such as [q] (though, we should note, possibly not a bias against [o]). Thus [qokal] is an acceptable word to follow, even though we don’t understand why the form [qokal] is preferred rather than [kal], which also starts with a strong character.

The Result

If we consider the above to be true, or at least plausible, we then have an interesting analysis of our excerpt. The six words are actually two three–word phrases which are very closely related.

The second phrase is [qokeedy qokal okedy], which is how it appears as written in the text. The first phrase is different only by a single original character [r]. By adding that onto the end of the first word we end up with [qokeedar okal okedy] due to the variation of [y] to [a] and the transformation of [q] to nothing.

Three differences are reduced to one and the text becomes more regular and with more obvious patterns. Repeated multiple times on every page, this would lead to a much different text.

I hope this examples clarifies my thinking about where textual analysis needs to go in the future. It’s about identifying as many of these changes as possible, understanding the links between characters and their contexts, and fixing rules which let us roll back the transformed text to reveal the original text below.

Last–First Combinations and Transformation Theory

In May I proposed the Transformation Theory. The theory stated that words are transformed from their original shape through the influence of neighbouring characters and other aspects of the text. I presented a small piece of research regarding words beginning with [o], which provided very basic support for the theory. I now wish to present a more substantial look at the same words.

The statistics used in this post were provided by Marco Ponzi, whom I must thank. He also provided a great deal of discussion, and some suggestions, which find their way into this post. However, I do not wish to suggest he agrees with this post, in whole or in part (unless he otherwise states).

Note: a transcription with a ‘.’ denotes a space between words. So a transcription such a [n.o] means a sequence where a word ends [n] and the following word begins [o].

The following research is built on the idea of last–first combinations. Characters found at the end and beginning of words can be broadly classified into two groups, strong [k, t, r, n, s] and weak [y, o, a, ch, sh], according to the characters they prefer to neighbour. A word ending with a strong character prefers to be followed by a word beginning with a weak character, and vice versa. Strong–strong and weak–weak combinations are typically less common than strong–weak and weak–strong.

When last–first combinations show that certain sequences, such as [n.k] and [y.t], are common or uncommon, this is based on an aggregate of thousands of different words sharing only the last or first letters. In order for last–first combinations to become part of the Transformation Theory we need to show that the combinations are true with regard to individual words. We can do this by altering the shape of a word and measuring the changes in the last–first sequences it is part of.

For the research we took pairs of words which differed only by the presence or absence of an initial [o]. This was chosen because it is a very common variation and many word pairs exist with and without initial [o], and because [o] is a weak character usually followed by typical strong characters such as [k, t, r] (and [l], which acts strong at the beginning of a word). Thus, we should find numerous word pairs which begin [ot, ok, ol, or] in one form and [t, k, l, r] in the other. The theory states that by removing the initial [o] and altering the first character from weak to strong, but keeping the rest of the word the same, we should see a significant difference in the last character of the preceding words.

Twenty–one pairs which met the criteria were chosen on the basis of frequency. All pairs had at least 77 tokens in all, and each word of a pair had at least 31 tokens. The twenty–one pairs were: [or, r], [okaiin, kaiin], [okeedy, keedy], [okar, kar], [otol, tol], [okain, kain], [okedy, kedy], [okeey, keey], [otar, tar], [otedy, tedy], [otaiin, taiin], [olkeedy, lkeedy], [olkeey, lkeey], [oraiin, raiin], [olchedy, lchedy], [okol, kol], [opchedy, pchedy], [olkain, lkain], [otchedy, tchedy], [olor, lor], and [olkaiin, lkaiin]. It should be noted that several features, such as different word endings, the presence or [e] or [i] within the words, and overall length were represented multiple times in different combinations. This allowed us to check if the last character of the preceding word had any influence beyond the variable initial [o].


The most extreme change was observed following words ending with [n]. In all twenty–one word pairs the word beginning [o] was more common after [n] than that without the initial [o]. In many pairs there were no examples of the latter, despite words ending [n] occurring commonly before the [o] version of the pair. For example, [otar] has 41 tokens (30% of all its occurrences) in the sequence [n.otar], yet there are zero for [tar]. Likewise, [okain] has 43 tokens (30%) of [n.okain] and zero for [kain]. A full 13 of the 21 pairs showed this situation.

Here are the percentages after [n] for each word pair:

Without [o] % With [o] %
kain 0 okain 30
tar 0 otar 30
kaiin 0 okaiin 29
taiin 0 otaiin 27
lkeedy 0 olkeedy 26
lkain 0 olkain 25
keedy 0 okeedy 23
raiin 0 oraiin 21
tedy 0 otedy 20
pchedy 0 opchedy 20
keey 0 okeey 19
tol 0 otol 12
kedy 0 okedy 11
lchedy 1 olchedy 5
lkeey 3 olkeey 25
kol 3 okol 12
kar 4 okar 27
lkaiin 4 olkaiin 23
lor 5 olor 19
tchedy 6 otchedy 21
r 7 or 16

Words following [r] showed the same bias toward the version with initial [o], but without such extreme statistics. The sequence [r.o] was always the more common, sometimes more than twice as common but also sometimes the difference was not significant.

Here are the percentages after [r] for each word pair:

Without [o] % With [o] %
lkaiin 2 olkaiin 39
r 6 or 28
kedy 2 okedy 19
lchedy 1 olchedy 18
kain 4 okain 20
taiin 0 otaiin 15
keedy 2 okeedy 17
pchedy 3 opchedy 18
lkeedy 0 olkeedy 14
raiin 5 oraiin 18
tol 4 otol 17
lkeey 5 olkeey 18
tedy 0 otedy 12
tchedy 0 otchedy 12
lkain 11 olkain 19
kol 11 okol 18
kar 10 okar 17
tar 14 otar 20
kaiin 8 okaiin 14
lor 7 olor 10
keey 7 okeey 9

Where the word pairs followed words ending [o] with any frequency (which was only in about 8 of 21 pairs) there was a clear preference for theform without initial [o]. Some of these could be misreadings. For example, 28% of [r] occurred after words ending [o], but the sequence [o.r] might simply be [or] with a misplaced space. However, it is telling that only 3% of [or] occurred in the sequence [o.or].

The most difficult to understand result is that after words ending [y], the most common word ending in the text as a whole. With [y] and [o] both being weak characters we would expect a bias against the sequence [y.o]. However, even though many pairs did show a small avoidance of this, the differences were small and some pairs showed the opposite preference.

The only set of words to consistently show a strong preference one way or the other after [y] were those where the characters following [o] were either [lk] or [lch]. Of these five pairs, four had a 2:1 preference for avoiding [y.o].

At the suggestion of Marco we broke the words ending [y] down into two groups: those ending [dy] (a very common ending) and those ending [y] but not [dy]. This breakdown proved to be interesting. Two things were observed, though only one of which directly relates to Transformation Theory.

Firstly, the word pairs ending [dy] themselves were much more likely (across both versions) to come after other words ending [dy]. I won’t comment more on this, as it really deserves its own place.

Secondly, many pairs showed a preference for the form with initial [o] after words ending [dy], even where they did not show such a preference for words ending not [dy]. An example will make this clear. The pair [kaiin] and [okaiin] occur equally after all words ending [y] (32% and 29% respectively). However, nine out of ten times [kaiin] comes after a word ending only [y] not [dy], whereas for [okaiin] the split is 54% for not [dy] and 46% for [dy].


It would seem that last–first combinations have some reality at the level of individual words. The sequences [n.o] and [r.o] are preferred much like the aggregate statistics would suggest, and sequences such as [n.k] and [r.t] are often avoided. So strong–strong combinations are clearly not preferred.

The results for weak–weak combinations are mixed. The phrase [o.o] is not preferred where is exists. The [y.o] sequence has a tendency to be avoided only where the preceding word doesn’t end [dy] or the following word begins [olk, olch]. Where the preceding word does end [dy], then the sequence [dy.o] may be just as much, if not more common. It seems as though [dy] is, for some reason, ignored by the rules of last–first combinations.

Are the word pairs real?

It might be objected that the pairs of words used in this piece of research aren’t real pairs. That’s certainly a reasonable objection, and until we can read the text we cannot, sadly, prove it positively one way or the other. We can only definitely say that they are orthographic pairs, being spelt the same way other than the initial [o] character.

However, we can show that many pairs have similar distributions within the text. This is what we would expect were they true pairs, appearing in the same topics but in varying forms. Using Pearson’s correlation coefficient, where a value of +1 is total positive correlation, -1 total negative correlation, and 0 no correlation, 11 of the 21 pairs score .50 or better. (The worst performing pair is [olkeedy, lkeedy] with a value of .05. However, they are found almost exclusively in quires 13 and 20, just not in the same folios.)


I’m wary of putting too much weight on a single set of figures for proving a theory. But it certainly looks reasonable to suggest that the last–first combinations seen in aggregate do operate at the level of individual words. There’s nothing obviously against the last–first combinations, and even the worst performing, the weak–weak sequence [y.o], seems to be complicated by the treatment of the word ending [dy].

Naturally, further experiments, taking the same question from different angles, would add to the debate. However, the stark preference for the [n.o] sequence shows that, at least in some places, there’s little more than can be said. It would seem that, if a variant with initial [o] exists, then it will occur after words ending [n].

Taking this one phrase, [n.o], which is the strongest individual last–first combination we have seen, what is the consequence for Transformation Theory? The theory would state that one or both of the words in such a phrase have been altered under the influence of their neighbour. The variant beginning [o] occurs after a word ending [n] specifically because the [n] causes the [o] to be added (or the [o] causes the [n], but there are reasons this is unlikely). The context of one word thus transforms the other.

Were we to take this individual transformation as definitely true then we could use the knowledge to ‘undo’ the transformation and restore the original, underlying, text. For example, in the phrase [aiin otedy] the second word might have originally been spelt [tedy], only taking the initial [o] due to the influence of [aiin]. The original text might have been [aiin tedy].

The undoing of words transformations would, if repeated throughout the manuscript, have the likely effect of making it more regular: a smaller vocabulary, with more repeated phrases, and potentially more obvious grammar. The research goal for Transformation Theory is to gather further evidence for transformations and refine our understanding of them. Last–first combinations are one part of this goal.

I realize that this post doesn’t include the full statistics, only my impressionistic assessment of them, so I’m going to ask Marco’s permission to post them as a separate file.