First–Last Combinations

The purpose of this post is to look at whether the beginnings and ends of adjacent words show any patterns in frequency. For this post I will call these first–last combinations.

Let’s have an example: when a word ends [y], is the next word more or less likely to begin with [q] or [o]? If we have a word such as [sheey], is the next word more likely to be [qokeedy] or [okeedy]?

We should expect the number of occurrences to be related to the number of times [y] ends a word and [q] or [o] begins a word, assuming that the two are not linked. Any substantially lower or higher frequencies would suggest an underlying process which needs to be investigated.

A few months ago Marco Ponzi very kindly created a batch of statistics on which this post is based. His statistics counted 1) the number of times a given character ends a word, 2) the number of times a given character begins a words, 3) the number of times we should expect a specific first–last combination, 4) the number of times that first–last combination actually occurs, and 5) the ratio of observed from expected.

This last number is important, and it should be stressed it is not the frequency but the deviation from expected frequency. So 1 = expected frequency, 2 = twice expected frequency, and 0.5 = half expected frequency.

The results show that some combinations are more common, others less common, than should be expected. For example, in Marco’s statistics [r] was the last character in a word 4181 times, and [d] was the first character 2334 times. We should expect an [r d] combination—so a phrase like [or daiin]—about 400 times. But such a combination occurs only 205 times.

The ‘missing’ 200 occurrences is something which will need to be explained. But before we attempt to do that let’s see what other patterns there are. Below is a table with first character in rows and last characters in columns. Each square is a first–last combination with the frequencies (as a decimal, where expected = 1).

I’ve coloured each square, with yellow, orange, and red as less common than expected and shades of green as more common. Purple is a broad band in the middle where the frequency is about what was expected.

First/Last	y	o	n	r	s	l	d
d	1.1645	1.5816	0.8271	0.5136	0.2822	1.3901	0.4543
k	1.3298	2.2183	0.1725	0.3153	0.2143	1.7832	0.2491
t	1.7464	1.3459	0.2889	0.2779	0.289	1.0369	0.112
r	1.5375	4.4127	0.146	0.2482	0.1793	0.8222	0.6947
l	1.5871	3.0925	0.1944	0.2566	0.4022	0.8925	1.5952
c	0.7529	0.805	1.3639	1.1217	0.9212	1.1129	0.7526
s	0.7814	0.6528	1.1895	1.1856	0.848	1.2135	0.8451
y	0.7962	0.8072	1.3505	1.2406	1.5655	0.6891	1.5775
o	0.8669	0.4835	1.2872	1.203	1.1843	0.8744	0.99
a	0.1794	0.7535	0.828	2.6476	4.386	0.5634	1.457
q	1.682	0.6726	0.6201	0.4356	0.3172	0.66	1.7517

Strong and Weak Groups

The most striking pattern which emerges is the presence of two main groups, which I’ve named strong and weak (though the names have no meaning). The strong group contains [k, t, r, n, s] and the weak group contains [y, o, a, ch, sh]*.

For characters in these groups the frequency of first–last combinations is very simple. Any combination where the two characters are from the same group is likely to be less frequent, and any combination from different groups is likely to be more frequent than expected.

More specifically, strong first characters are highly correlated with weak end characters and against strong end characters. Weak first characters are moderately correlated against weak endings, and range from ambivalent to highly for strong end characters.

The division is not perfect, as not all characters fall into these two groups, and not all first–last combinations obey the rules. The combination [n a] is maybe the worst of all these, deviating only a little less than [o a].

Exceptions and Outliers

The character [q] is most like the strong group, except that while [y] is very commonly the end of the preceding word, [o] has a much lower frequency. It is noteworthy that, as [q] is almost always at the beginning of a word and before [o], something like two thirds of its occurrences are in the string [y qo].

While [d] is definitely a strong character at the beginning of a word, the situation at the end of a word is mixed. It is still most similar to a strong character, with words ending [d] being infrequent before words beginning [k, t, r, d] and moderately frequent before most weak characters, though not [ch]. However, it is common before words beginning [l, q]. The numbers involved are low, but it is unexplained.

The character [l] is the most wayward. Like [d] it acts as a strong character at the beginning of a word. Yet at the end of a word it is wholly unclassifiable. Most beginning characters appear moderately after a word ending [l], but [y, a, q] appear significantly less and [d, k] significantly more. It simply does not fit into the paradigm.

How Does the Process Work?

As we have looked at the fact that some characters occur more often with others in first–last combinations, one question has consciously been dodged. How do some combinations become more common and other less common?

We are once more faced with a twofold explanation, the same as when we examined Grove words and the position of [m], and with all line patterns generally. Do words move around or are they altered? Is a combination like [sheey qokeedy] made by bringing words together or by shaping an existing combination to make them fit?

The answer is impossible to know for sure until we understand the underlying language. But the idea of altering a word fits with what we already know or suspect about how line patterns work. It is also more readily believable as a linguistic process, for such a thing does occur in a number of languages.

Yet this leads us to a second question: which character conditions which in first–last combinations? If a word begins [k], does that cause the preceding word to add a [y] to the end? Or does a word ending [r] cause the next word to adopt an [o] at the beginning?

The answer may be a bit of both. A first–last combination such as [r k] would be avoided quite heavily, but there are words ending [ry] and words beginning [ok], either of which would solve the problem.

Yet [n] is almost always word final, and the only common character it can be regularly substituted for is [r]. As both [n] and [r] are in the strong group their substitution would make no difference. It is more likely, in such a case, that weak group characters are added to the beginning of following words.

What Does it Mean?

The strong and weak groups fall into line with some things we already knew about the Voynich script, though much is new.

I have already proposed that [y, a, o] make up a group of vowels, and so this is likely to be the core driver behind the weak group. The membership of [ch, sh] is a new observation, and the relationship between those characters and vowel is unexplored.

One key link between these weak group characters is in the first syllable of multisyllable words. If the second syllable of a word contains [t, k], then the preceding syllable normally only contains characters from this weak group (and [e], which is not analyzed here), with the sole exception of the syllable [qo].

The strong group is likely to be consonants, which is something we might have already guessed. But here we have a potential phonological process to explore and explain with reference to that fact.

Though [l] is most like characters in the strong group, there is clearly something odd about its value. It is really the most intriguing of all characters in the Voynich script, which is not obvious at first glance.

*The statistics don’t differentiate [ch, sh] but rather [c, s]. The majority will be [ch, sh], but there may be some interference from words beginning [s, ckh, cth, cph, cfh]. It will be interesting to learn if the differentiated statistics bear out all that I’ve said in this post.

6 thoughts on “First–Last Combinations”

nickpelling says:

on January 10, 2017 at 11:10 am

An interesting post: but how consistent are the results between A and B?

LikeLike

- EMSmith says:
  
  on January 10, 2017 at 5:20 pm
  
  That would need more statistics to test, as I don’t have the breakdown between A and B in this case.
  
  LikeLike
  
david says:

on February 7, 2017 at 1:33 pm

Emma, I do not know if you are aware, but a lot of things you are working on, are possible using the extensive list I’ve made of f.e. all ‘letter dependencies’ in Excel.
It’s also comprehensible that you want to do it yourself with your own data, that’s fine. Just wanted to mention it, cause I know how much time it takes to generate good data.

LikeLike

- EMSmith says:
  
  on February 7, 2017 at 7:01 pm
  
  Thank you!
  
  LikeLike
  
D.N. O'Donovan says:

on March 21, 2017 at 12:08 pm

Emma,
I ask this because you’re a linguist, not a Voynich-related linguist. I wonder if we have enough evidence of the old Italic group to identify one of them. Just as example, Etruscan. (Etruscan rustic art sometimes has the strange small-mouth-with twisted lips that you see on a couple of the ‘sun’ images in the Vms).

Not a theory, or an hypothesis, just a question whether – if it were something like that which had been copied by would-be fifteenth-century antiquarians – we’d ever have a hope of understanding it.

LikeLike

nickpelling says:

on May 6, 2019 at 5:45 pm

Hmmm… I would have thought that the -o r- result of 4.4127 is surely an indication that the space inside the “or” pair is miscopied or mistranscribed. Similarly for the -o l- pair (3.0925) and even the -o k- pair (2.2183).

The -s a- result of 4.386 is less easy to account for, though. I also have a suspicion that this may be stronger in B than in A.

The absences in the grid are also intriguing (e.g. -d t- and -n r-). Incidentally, r appears word-initial 400-odd times, most often in Q13 and Q20, though there are a fair few places where it is a mistranscribed s- , *sigh*.

LikeLike