Proof of Linefirst Transformation

Some time ago I wrote about linefirst words. The first letter of words which appear at the beginning of lines is statistically different from those of words which occur elsewhere in the line. Specifically, the characters [p, t, s, d, y] are much more common.

While we can put down the commonness of [p, t] to Grove Words—in whole or in part—the over–occurrence of [s, d, y] still needs to be explained. My belief is that when a word begins a line one of these three characters is often added to the beginning (or switched in the case of [y]).

I call this phenomenon Linefirst Transformation, and I do not understand its cause. But I believe that I now have some proof beyond statistics.

Continue reading

The Semi–Circle String

The fairly strict word structure of Voynich words invites the researcher to try and typify it into a set of simpler rules. Tiltman, Roe, and Stolfi have all attempted this with differing results. I too have tried it, looking at low level and high level word structure.

However, I think my attempts can be improved by adding easier ways to think about common sequences of characters. The one I want to talk about below I will call the ‘Semi–Circle String’ (SCS) for want of a catchier name.

Continue reading

Tiltman’s 1951 Report to Friedman

Brigadier John Tiltman was introduced to the Voynich Manuscript by William Friedman in 1950. Friedman gave a number of photostats to Tiltman and asked him to study them. The following year Tiltman reported his findings to Friedman. He also reprised them for a 1967 talk and paper.

For the purposes of history, and to make his findings better known, I have transcribed the overview of the report which Tiltman published in 1967. It consists of fifteen points (a to o), each of which is more or less a separate observation on the properties of the text. Tiltman used his own transcription alphabet which I have modernized to EVA, but otherwise the text is as little changed as possible. I have also included any plates where they are mentioned in the text, either by copying them out or by linking to online images of the same page.

It must be borne in mind that Tiltman saw [k] and [f], [t] and [p], and [l] and [m] as variants.

(a) Following are some notes on the common behaviour of some of the commonly occurring symbols. I would like to say that there is no statement of opinion below to which I cannot myself find plenty of contradiction. I am convinced that it is useless (as it is certainly discouraging) to take account at this stage of rare combinations of symbols. It is not even in every case possible to say which is a single symbol and what is not. For example, I am not completely satisfied that the commonly occurring [a] has not to be resolved into [ei] or possibly [oi]. I have found no punctuation at all.

(b) [ckh] and [cth] appear to be infixes of [k] and [t] within [ch]. The variant symbol represented by [m] appears most commonly at the end of a line, rarely elsewhere.

(c) Paragraphs nearly always begin with [k] or [t], most commonly in the second variant forms [f, p], which also occur frequently in words in the top lines of paragraphs where there is some extra space.

(d) [y] occurs quite frequently as the initial symbol of a line followed immediately by a combination of symbols which seem to be happy without it in any part of a line away from the beginning. Otherwise it occurs chiefly before spaces very frequently preceded immediately by [d]. Hence my belief that these two have some separative or conjunctive function. (I have to admit, however, that [y] also seems sometimes to take the place of [o] before [k] or [t] (though rarely, if ever, after [q]); this is particularly noticeable in some of the captions to illustrations in the astronomical section of the manuscript—these most commonly begin [ok] or [ot] and it is here that we occasionally see [yk] or [yt].)

(e) I have tried, for convenience of handling, to divide words into what I call “roots” and “suffixes.” This arrangement is shown:

Preliminary division of very common words in “roots” and “suffixes”

Roots: [ok, ot, qok, qot, ch, sh, d, s, lk]

Suffixes: either (i) [e, ee, eee] followed by [y] or [dy]

or (ii) one or more of the following:

[an, ain, aiin, aiiin]

[ar, air, aiir, aiiir]

[al, ail, aiil, aiiil (am, aim, aiim, aiiim)]

[or, ol (om)]

Regarding the second type of suffix, some of the combinations are so rare that I have been uncertain whether to take any account of them at all. Some are very common indeed. It seems to me that each of these combinations beginning [a] has its own characteristic frequency which is maintains in general throughout the manuscript and independent of context (except in cases where two or more [a] groups are together in series, as referred to later). These [a] groups, e.g., [ar] or [aiin], frequently occur attached directly to “roots,” particularly [ok], [ot], [d], and [s]. [okaiin], [qokaiin], and [daiin] rank high among the commonest words in the manuscript.

(f) There are however many examples of 2, 3, 4 or even 5 [a] groups strung together on end without spaces between them. When this occurs, there appears to be some selective preference. For examples, [ar] is very frequently doubled, i.e., [ar ar], wheres [aiin] which is generally significantly commoner, is rarely found doubled. Perhaps the commonest succession of three of these groups is [ar ar al]. [al] very frequently follows [ar], but [ar] hardly ever follows [al].

(g) [o], which has a very common and very definite function in “roots,” seems to occur frequently in “suffixes” in rather similar usage to [a], but nearly always as [or] and [ol]. [or aiin] is very common.

(h) The behaviour of the [a] (and [o]) groups has suggested to me that they may in fact constitute some form of spelling. It might be, for instance, that the manuscript is intended to demonstrate some very primitive universal language and that the author was driven to spell out the ends of words in order to express the accidence of an inflected language. If all the possible [a] and [o] combinations can occur, then there are 24 possibilities. They may, however, be modified or qualified in some way by the prefixed symbols [k], [t], [ok], [ot], [ch], [sh], [d], [s], etc., and I have not so far found it possible to draw a line anywhere. This, coupled with ignorance of the basic language, if any, makes it difficult to make any sort of attempt at solution, even assuming that there is spelling.

(i) [l], usually preceded by [a] or [o], is very commonly followed by [k], much less commonly by [t], with or without a space between. In this connection, I have become more and more inclined to believe that a space, though not intended to deceive, must not necessarily be regarded as a mark of division between two words or concepts.

(j) Speaking generally, each symbols behaves as if it had its own place in an “order of precedence” withing words; some symbols such as [o] and [y] seem to be able to occupy two functionally different places.

(k) Some of the commoner word, e.g., [okeey], [okeedy], [qokeedy], [odaiin], [okar], [odal], [daiin], [chedy], occur twice running, occasionally three times.

(l) I am unable to avoid the conclusion that the occurrence of the symbols [e] up to 3 times in one form of “suffix” and the symbols [i] up to 3 times in the other must have some systematic significance.

(m) Peter Long has suggested to me that the [a] groups might represent Roman Numerals. Thus [aiin] might be IIJ, and [ar ar al] XXV, but this, if true, would only present one with a set of numbered categories which doesn’t solve the problems. In any cases, though it accounts for the properties of the commoner combinations, it produces many impossible ones.

(n) The next three plates show pages where the symbols occur singly, apparently in series, and not in their normal functions. Plate 18. The column of 6 or 7 symbols [k] (or [t]), [o], [s], [y], [e], ?. In Plate 19 the succession of symbols in the circles must surely have some significance. One circle has the same series of 17 symbols repeated 4 times. Plate 20 also has an interesting column of symbols. In all three there are symbols which rarely, if ever, occur elsewhere.

(o) My analysis, I believe, shows that the text cannot be the result of substituting single symbols for letters in the natural order. Languages simply do not behave in this way. If the single words attached to stars in the astronomical drawings, for instance, are really, as they appear to be, captions expressing the names or qualities of those stars, there can hardly be any form of transposition system involved. And yet I am not aware of any long repetitions of more than 2 or 3 words in succession, as might be expected for instance in the text under the botanical drawings.

LAAFU and the Messy Transition from Spoken to Written

The existence of the line as a functional unit (LAAFU) in the Voynich Manuscript stands as a stumbling block for linguistic attempts to explain the text. For how can the container of a text alter its properties? I believe the answer may spring from the nature of a language as it takes its first baby steps into literacy.

In his research Prescott Currier showed quite clearly that certain characteristics of the text took place at the beginning and end of a line in the manuscript. The characters [m, g] occur more often word finally in the last words of a line, while [d, s] occur more often word initially in the first words of a line.

Though we might expect certain differences at the beginning and ends of paragraphs or sentences, due to possible patterns in sound or meaning at these points, we should not expect such a thing to occur on lines. For a line is an artifact of the text as written down, and it is not a linguistic fact in itself. Characteristics based on the line recur according to the width of the parchment, and should that width shrink or grow, so too must the interval of the characteristics. The same goes where a drawing causes the line length to be shorter.

One characteristic of LAAFU is the appearance of word–ending [m], which occurs often (though not exclusively) at the end of lines. On the side f17v there are five words ending in [m], four of which occur at the end of a line. However, the plant drawing extends the whole height of one side causing the lines to be significantly shorter than on many other pages. A even better example is f40v, where once again four of five word–ending [m] are at the end of a line. Two of them occur in a paragraph which is significantly shortened by the presence of a drawing, while the other two are in a paragraph which runs the full width of the page. On f54v there are five (out of eleven) occurrences at all different widths due to a raked end of line.

No matter where the end of line occurs, the LAAFU effects occur with it. It is not preplanned. The process which creates the effects must lie within the text, but somehow be dynamic according to the container of the text. It is almost as though the line break itself causes the change, and this is what I wish to argue. But how can this be? Let us look at the line and the effect separately to see if some answer can be found.

The Line

The first texts were spoken. Songs and poems, epic stories, ritual words, and so on. We can think of them as texts, rather than just speech, as they were composed and repeatable. They differed from utterances which were made for the occasion and could be chosen by the speaker. But they were similar in being both spoken and heard, made from speech sounds and interpreted as such.

Spoken texts were often structured with phrases, pauses, stops, stress, and any number of other features common to speech and which could be heard as distinct features. Yet this prosody does not consist of speech sounds as we commonly think of them, and is often inadequately captured by the bare 26 letters (or however many) of the alphabet. Today we use a range of punctuation to convey some of these characteristics of speech, such as full stops, commas, semicolons, and question marks, while being painfully aware—notoriously in the use of irony—that our script is not up to the job.

For those written texts strongly governed by stress or number of syllables, we also use the layout of the text on the page to help us. The lines of a poem, or of a play, are displayed in their meter. Each new line on the page is a new line of poetry. It is so common in European languages that this is how poetry ‘looks’: you don’t even need to understand a language to identify poetry. Here, at the very least, the line is a functional unit.

Yet punctuation and specific layout of a text are developments which came later—hundreds or even thousands of years later—than writing itself. If we look at the lines of Beowulf, the epic of Old English poetry, we can see lines (and a mid line break known as a caesura) which are important to understanding the rhythm of the spoken poem:

Hwæt! wē Gār-Dena     in gēar-dagum
þēod-cyninga     þrym gefrūnon,
hū ðā æþelingas      ell en fremedon.
Oft Scyld Scēfing      sceaþena þrēatum,

And Beowulf certainly was spoken, or performed in some manner, in its original form. The written version we have is just a copy, a deposit of the words for safekeeping, not the actual poem itself. The expectation was that anybody reading the text would already have heard the poem, understand how it was performed, and know the text was not the whole story. Indeed, the original written text doesn’t look like a poem at all:

There are no poetic lines, no breaks, no gaps for pauses (there is a little punctuation, however). Just a series of words to which the reader would bring their own knowledge of the poem to make into the whole work. Romans wrote their literature in a similar but even more extreme form: scriptio continua, where even the gaps between words were left out. Again, the reader and writing together make up the text with the reader supplying whatever the writing lacked. And far from being primitive or simply a stepping stone to modern writing, it is actually both later than the use of word division using gaps and still the norm for some languages today.

The relationship between the written and the spoken is often not straightforward. Modern assumptions about what written texts should be like have developed over many years and are not the only ones possible. How a language is written on the page is the outcome of many choices made by many people, and not always made wittingly. The meaning of a line in the Voynich Manuscript cannot be taken for granted.

The Effect

Prosody is not the only feature of speech often poorly represented in writing. When one sound effects a change in a nearby sound we call this sandhi. It can happen in lots of places and in lots of ways, and can sometimes be seen in writing. For example, in English the word a becomes an before a vowel: a bird but an apple. The English language wants us to insert an /n/ to make two words easier to pronounce together, rather than have two vowels in a row.

You might protest that sandhi can’t be exactly unrepresented when there is a great big example at the heart of English! But in truth that’s about all there is in written English. A similar sandhi effect happens in some dialects where two vowels—at the end of one word and the beginning of the following—have an /r/ inserted between them. Known as intrusive–r, it makes a phrase such as “law and order” sound like “lawr and order”. It has become quite common, yet is not represented in standard writing. If ever written it is considered an error, the fault of the writer for not knowing that the /r/ isn’t meant to be there, even though it is spoken.

The interesting thing about sandhi effects is that they don’t happen if the person speaking leaves a gap or pause in their speech. The pause, however slight, which we leave at the end of a sentence is enough to stop the effects from taking place. The phrase, “I studied law in England where I live” would typically have an /r/ inserted after law (and after where, the final /r/ of this word not usually being pronounced in such dialects). Yet, if we break up the phrase in two, “I studied law. In England where I live…” there would be no /r/ sound inserted after law.

You’ve likely already guessed where this is going, so I will get right to the point. The language of the Voynich Manuscript may have had several kinds of sandhi working on the sounds of the spoken language, and the transition to writing may have made the presence of pauses ambiguous due to the unfamiliar nature of page layout. I believe these may account for LAAFU effects in one of  two ways:

  1. That the main body of the text (that is, away from line ends) is usually subject to sound changes being well represented; the line break causing an actual pause in the prosody and thus the first and last words of sentences are unaffected.
  2. That sandhi effects are present in the spoken language but unrepresented in the main body of the text; the line break causes an uncertainty in the writer about prosody who then consciously inserts the sound changes at line end and line beginning.

I think the first scenario is unlikely. It requires the text to be shot through with sandhi effects, when we know it is actually pretty regular, and the writer to consider the line break a real pause.

The second scenario is more realistic. The main body of the text is regular and any sandhi is provided by the reader. The line break is not a real pause but a source of worry for the writer. Here he makes the sandhi explicit, consciously altering the end of the last words and the beginning of first words, so that the reader will get it right.

And so, this is my hypothesis for the line as a functional unit. It is neither pretty nor wholly obvious. But it provides a linguistic solution to a knotty problem.

Afterword: I am not happy with the final state of this article. It does not present all my thoughts in the way I would like. I have found it hard to reach and explain what I have in mind. But this is the nearest I have gotten in several months of working to write this down. I may change the article in the future to make things clearer, but the heart of the proposal will not change.


Last year I laid out my understanding of the low level and high level structure of Voynich words. As I consider the Voynich manuscript to be linguistic, I am happy to believe that the two structures relate to syllables. Specifically, the low level structure shows how a syllable itself is to be constructed, and the high level structure shows how syllables come together in words.

Now, after some delay, I have taken this ideal syllable and word structure and sought to apply it to actual words in the Voynich manuscript. For the purposes of the following a word type is a word with a specific spelling, such as [chor] or [opaiin], and a word token is an individual occurrence of a word type, so [chor] has 218 tokens and [opaiin] has 13 tokens.

I took the text of the whole Voynich manuscript and filtered all those words with fewer than five tokens or with uncertain readings. The filter of at least five tokens was chosen to provide 1) a wordlist short enough to sort by hand, and 2) a reasonable likelihood that the words are valid and not the result of writing or reading mistakes. My wordlist thus held 913 word types totalling 26,372 tokens—roughly two thirds of the total tokens in the manuscript.

I split every word type on the list to show the syllables it contains, and then sorted them into lists by number of syllables. Syllables were discovered using a fairly simple process: [a, y, o] are vowels and every instance of those indicates a syllable; [e] sequences are vowels if not immediately followed by [a, y, o]; and [ch, sh] count as vowels if not immediately followed by an [e] sequence or [a, y, o]; then, working from left to right, every character is part of the syllable belonging to next vowel on the right, except at the end of words where there are no more rightward vowels, where characters belong to the syllable of the last vowel to the left.

The whole of the wordlist was thus broken down into five smaller lists for words of 0 to 4 syllables. The statistics for each list are as follows:

0 syllables: 22 types, 634 tokens

1 syllable: 280 types, 11640 tokens

2 syllables: 500 types, 11504 tokens

3 syllables: 110 types, 2589 tokens

4 syllables: 1 type, 5 tokens

The list for two syllable words held the greatest number of word types, but one and two syllables words had roughly the same number of word tokens. Thus the word tokens by type is highest for one syllable words, with tokens by type for two and three syllables words about joint lowest (four syllable words are technically lower, but with only one example).

The number of one syllable words is likely limited by the total number of possible syllables in the Voynich language. Although some more possible syllables appear in two and three syllables words which do not appear alone, there is a finite ceiling to how many one syllable words there can be, and this is relatively low due to the rigid syllable structure.

The most interesting aspects occur at either end of the distribution in those words of no syllables and four syllables. The possibility of words without vowels should not be shocking, but it does prompt us to give some explanation. It could be that other vowel characters exist, or that not all words are fully written, or that characters are not always used for a sound. However, the small percentage of tokens which have no syllables suggest that it is not a great problem for my syllabification.

Yet the almost complete lack of words longer than three syllables is rather unexpected. It is often repeated that the Voynich texts lacks the short words common to many languages, but the truth is that it lacks long words. Over 85% of both word types and word tokens are one or two syllable words.

It is noteworthy that most of the multi–syllables words follow the breakdown rules which I put forward in my article on high level structure. One syllable of a word can be anything (the ‘Free’ syllable), but the other one or two must select from a much narrower pool. Moreover, the number of possible choices narrows further whether the additional syllable is to be put before or after the Free syllable (and there may be only one before and one after).

I believe the results of the syllabification were fairly successful, and that my method is at least as sound as any other. The outcome is a fairly regular set of syllables put together to form words in a fairly regular way. If further examination of the results gives us more insight into the structure of Voynich words then we can be sure that there is some basis for regarding the syllabification as at least partly right. Each of the four wordlists from none to three syllables will be examined in more details in future posts.

An Objection to Timm (2014)

In 2014, Torsten Timm published a paper, How the Voynich Manuscript was created, in which he claims to have discovered a generation method for the text of the manuscript. The work is an interesting look at several aspects of the text and any serious researcher should take the time to read it. It is at least on a par with Gordon Rugg’s work, in that it mounts a respectable challenge to both cryptological and linguistic camps of decipherment.

Timm make several points in his paper, but a simple summary of his conclusion is possible: the writer of the manuscript composed the text by taking words already written down (often only one or two lines above the one he was writing) and by making small modifications created the next word. Although the practice had some further complexities, and variety over the course of the manuscript, the bulk of the text was made in this way.

Timm further claims that there is a fairly simply ruleset for the modifications, given on page 16. For the aim of my criticism of Timm’s work in this post, rules I and II are the most important, and so are worth quoting:

I) Copy an already written glyph group and replace one or more glyphs with similarly shaped glyphs. An example is 8an 8aie 8aiy (“dain dail dair”) in line <f45v.P.4>.
II) Copy a glyph group and add one or more glyphs. An example is ohc89 4ohc89 4ohcc89 (“okedy qokedy qokeedy”) in line <f31r.P.10>.

I contend that these rules do not work without supplemental information on how Voynich words should (and do) look, undermining the claim that Timm has found a text generation method.  Let us look at these two rules critically, in reverse order

Rule II

For this rule the writer is supposed to have copied a glyph group (a word) and added characters. It would seem from the example given, and the broadness of the phrasing, that any character can be added anywhere in the word, at the beginning, or at the end. Yet few possible alterations would result in a valid word.

Take [okedy] from the example in the rule, which has 118 tokens, and add any gallows character anywhere in the word. How many such possibilities even occur once, never mind more than a handful of times? I admit that I haven’t gone through all the possibilities, but the answer is none or almost none. Indeed, add [d, s, l, r, m, n, i] anywhere in the word and you still won’t find may valid words (I think [olkedy], with 27 tokens, may be the best).

There are, in fact, only a handful of single characters which you can add to [okedy]—and then only in specific places—to make valid word. Add [q] or [ch] to the beginning, another [e] next to the existing one, [o] between [e] and [d], [ch, sh] between [k] and [e], and of course [l] before [k]. Maybe you can find more, but I doubt there are many more. It is significant that only [ch] can be added in more than one place.

An objection to this observation may be that Rule II lets you add more than one character at a time. But again, you must add the right two characters, in the right place, to make a valid word. You cannot transform [okedy] in to [kokedyk] or [oykerdy]. It seems as though there is another, deeper rule, governing this one, which dictates which characters may be added and where.

Rule I

The problems raised for Rule II arise in Rule I also. You simply cannot swap out any character and replace it with another and expect a valid new word. So [okedy] will not become [okeny] or [oksdy]. That much is obvious and there is no need to belabour the point that some unspoken rule must be operating to prevent the writer from making invalid words.

But at least here in Rule I the unspoken rule is spoken. Timm says that you can swap a character for any “similarly shaped” one. To be fair to Timm he does set out what this means on page 5, but then only as an observation about how words are related. He notes that words often different by only one character and that those characters are graphically similar (so that as [k] and [t] look alike, so [okedy] and [otedy] are both valid). This is a fairly common observation, and Timm is not wrong to make it. But the essence of this observation is left unresolved.

Timm notes in his discussion (page 36) that the script has a “design”, and that characters are individually designed from simpler strokes. But what is this design and why? It is not an idle question, as it bears hugely on the question of how the text was generated. If the writer could not simply make any alteration to any word, then it must be that each word has a small number of possible antecedents. Why would the writer limit themselves in such a way?

Moreover, there is a total set of permissible words (whether realized in the manuscript or not) which is a subset of all the possible words. And these permissible words, by virtue of the rules given on page 4 concerning “similarly shaped” characters, have a specific and definable structure. I have talked about the high level and low level word structure of Voynich words, and it is clear to most researchers that such a thing is very real. But again, what does that structure mean? And why are words structured like this?

Timm’s work does not really address the generative aspects of word structure, only the relative aspects. He points out in a novel way how words are related to one another in text, but not really how they are generated. For although he claims that most words are simple alterations of existing ones, they are alterations around a structure which existed in the writer’s mind. Simply saying that such a structure exists is not enough, unless we are to believe that it is arbitrary. And although he considers a variety of ways in which characters might have some meaning, he ultimately comes to the conclusion of meaninglessness:

In the end, the most plausible hypothesis for the Voynich manuscript is that the text generation method described in this paper was used to generate a meaningless pseudo text.

Yet the existence of an abstract set of principles which structures most of the words within the Voynich text should be the starting point of our textual analysis. It is so obvious, so rigid, and so studiable. To write it off as arbitrary just to create a meaningless text feels like throwing away some of our best evidence. Timm’s theory leaves unexplained a big piece of the puzzle, which is both intellectually and logically unsatisfying, and likely to be proven wrong.

Gallows and Featuralism

I have thought for some time that the Voynich script has aspects of featuralism. That is, the shape of a character is not wholly arbitrary but in some way related to the sound it represents. Similarly shaped parts of different characters may thus represent similar parts of sound.

A very basic example of featuralism is found in Latvian, where all the vowels come in long and short pairs. Short vowels are written with plain characters (a, e, i, u) and the corresponding long vowels with the same characters plus a small bar (a macron) over the top (ā, ē, ī, ū). The bar, or its absence, thus represents the length of a vowel as a feature rather than a whole sound. Hangul, the script used to write Korean, has even more extensive featuralism. Several characters are made by doubling or combining others in ways which consistently represent the same relationships between the starting characters and the result.

I do not think that the Voynich script is thoroughly featural, in that every stroke represents a part of a sound and the characters are simply bundles of stroke representing the bundles of features which make up a sound. I do not know if any featural script goes so far. But I do think that some aspects of featuralism may be present in the script, and here wish to look at one group in particular: the gallows characters.

The gallows characters are a group of eight symbols [k, t, f, p, ckh, cth, cfh, cph] which have a few shared aspects. Namely: 1) they all have a long straight left “leg” which extends the full height of the line; 2) a stroke links this leg to a loop at the right hand side, and 3) they all occur in roughly the same word environments. There are also a few seldom seen characters which look very similar, but only these eight occur in any number. Hence the focus on these eight characters alone.

However much they share the gallows characters are split by differences. Three features neatly divide the eight characters into halves: 1) the presence of a loop to the left; 2) the presence of a bent right leg (as opposed to straight); and 3) the presence of a “bench” (like a [ch] character). Each feature is shared by exactly half of the eight gallows character, and each unique combination of features is represented by a character.

The diagram below shows the characters mapped onto the three features:

Gallows Features

The diagram is labelled X, Y, Z, to show the three features, with a + or – for presence or absence: X for left loop, Y for bent leg, and Z for bench. Movement along any axis implies the switch of that feature from present to absent (or vice versa), while the presence or absence of the other two features keeps the same. Each of the gallows can be referred to by its XYZ coordinates, so that [k] is -,-,-, [p] is +,+,-, [cfh] is -,+,+, and so on.

We can also begin to think of the feature groups themselves in the abstract, and consider how they affect the characters in the text. This is what we will look at now.


The gallows characters differ in frequency quite substantially. The most common [k] occurs nearly 10,000 times, while the least common [cfh] only about 80. All the features seem to have some impact on how often a character occurs. Below is a list of the eight characters by frequency (approximately):

[k] 9940       [ckh] 910

[t] 5930        [cth] 945

[f] 425          [cfh] 75

[p] 1400       [cph]  220

Feature Z, the presence of a bench, has the most consistent impact. All characters which have Feature Z are less common than their equivalents without it. Characters without Feature Z are about five to ten times more common. Feature Y, the presence of a bent leg, has a similar impact but with a greater spread. Characters without Feature Y range from being four to 24 times more common.

Feature X, the presence of a left loop, is inconsistent. The presence of Feature X is more common than its absence when combined with the presence of Feature Y, the bent leg. But when Feature Y is absent it is unclear what difference Feature X makes to frequency.

Page Distribution

The distribution of gallows characters also has some interesting variations by features. The presence of Feature Y is much more likely on the first line of a paragraph, and Feature X more likely as the first character in a paragraph. These two observations have been extensively discussed.

Word Environment

The environment of a gallows character is somewhat influenced by the presence of absence of features. We noted above that all the eight gallows characters have similar word environments (that it, where they appear within words), but there are some sharp differences.

The presence of Feature Y, the bent leg, along with the absence of Feature Z, the bench, means that the character almost never appears before [e], but more commonly before [ch, sh]. The feature also makes it more likely that the gallows will appear at the beginning of a word.

The presence of Feature Z means that the character almost never appears before [ch, sh], but much more likely that the gallows will appear after those two characters. The feature also has an overall tendency to be more common at the beginning of words, though this is not wholly consistent. The absence of Feature Z has a higher instance of [o, y] immediately before.

The presence or absence of Feature X, the right loop, seems to make no great difference to word environment. The biggest differences come in the added presence of Feature Z, where X causes the gallows to be more common word initially, but still with significant occurrences in the same environments. There seems to be a certain level of replaceability with Feature X, where the gallows with and without this feature can appear in the same place quite happily.


A lot more work could be done to look at the differences between these different features, but I think this three feature breakdown of gallows characters will prove to be both interesting and useful. It unifies observed differences into a single system, hopefully modelling how the inventor of the Voynich script might have seen it.

Without more work it is hard to draw any firm conclusions about what each feature may mean. My hunch is that Feature X is the most significant of the three features, but for now the argument is more impressionistic than reasoned.