The first glyphs of words which occur at the start of lines are statistically different from the first glyphs of words which in the text as a whole. Some glyphs are more likely to occur at the beginning of lines, others less. The goal of this article is to look more closely at the difference and attempt to describe what is happening.
The difference between words at the start of lines and the whole of the text was noticed at least as early as Prescott Currier in the 1970s. The statistics that he was able to collate were partial, owing to the samples he had available to him, but we are now able to see the problem in full. Nevertheless, I have chosen to concentrate on the text of the so–called Stars section of the manuscript (also known as Recipes, taking up most of Quire 20). This is because it presents a long and almost continuous piece of text, with neither breaks for diagrams nor substantial shifts in column width (the width of the text column naturally determines what portion of words occur at the beginning of a line—few for wide columns and many for narrow columns—and so it is needful to keep the width steady in our sample). The text is written in the Currier B language.
I took a transcription of the text, cleansing any words with unsure readings, and divided the words into two groups depending on where the occurred on a line: one for first word of a line and one for the rest. Due to the presence of Grove words, which affect the first word of a paragraph, I further split the words which occur at the beginning of a line into two groups: one for the first word of paragraphs and one for the rest.
The number of words in the different groups were as follows: paragraph initial: 286; linestart but not paragraph initial: 686; all non-linestart: 8672. The paragraph initial words will not be further considered in this article.
The difference between first glyphs of words in these two groups is stark. Over 75% of linestart words begin with one of four glyphs: [y] 23.2%, [d] 21.6%, [o] 16.6%, and [s] 15.3%; whereas a similar percentage of non-linestart words begin with: [o] 24.3%, [q] 18.5%, [ch] 18.0%, [a] 8.9%, and [sh] 8.1%. Only [o] occurs high on both lists, and even then at a significantly different percentage.
On the assumption that non-linestart words are ‘normal’, some process is altering those words which occur at the start of lines. The first step to understanding that process is describing the differences in more detail and suggesting how linestart words can be ‘changed back’. If by a series of alterations we arrive at a point where the two statistics match then we may have captured the rudiments of the process, albeit in reverse.
There may be some linguistic—or other—principle underlying the process of linestart transformations which, even if we do not know the identity of, we can at least more properly be aware of its presence. This linguistic fact will help us relate the different glyphs involved.
It is no overstatement to say that the problem of ‘undoing’ linestart transformations is complex. Many different glyphs are involved on both sides, and a simple look persuades us that there is no one–to–one match for glyphs.
The character [o] needs to gain nearly 8 percentage points to go from 16.6% linestart to 24.1% non-linestart. Yet the three other most common initial glyphs for linestart words all need to be reduced by much greater amounts: nearly 18 percentage points for [d], over 14 percentage points for [s], and a whopping 21.4 percentage points for [y]. Simply adding [o] to any one of these would create too many words.
Moreover, adding a glyph does not seem to be a solution. Adding [o] to words beginning with [d], [s], or [y] would not result in the right kinds of words. Almost no words begin [oy], few begin [os], and only some [od]. Most words beginning [o] are followed with either <k>, <t>, or <l>. Similar arguments can be made for many of the target glyphs.
So where to begin? And what to do with the words we wish to transform?
The best place to begin is looking in a little more detail at individual glyphs and see which may give us an opening. For linestart words, those starting with [d], [s], and [y] are almost the only glyphs whose occurrence we wish to lessen. (The glyph [t] is a little too common initially, but not to the same extreme.) And for non-linestart words, one glyph we wish to increase is [a]: it is 8.9% of non-linestart words, but 0.1% of linestart words. We should expect about 60 tokens beginning with [a] to occur as linestart words.
The glyph [a] is important because it simply cannot occur before certain glyphs, including [d], [s], and [y]. Therefore we can know immediately that [a] is not to be added to the start of these words. Furthermore, when we look at the second glyph of linestart words we find more glyph which <a> cannot come before: most words beginning [d] are [da], [dch], [do] and [dsh]; for [s] they are [sa] and [so]; for [y] they are [yk], [ych], [yt], and [ysh]. The glyph [a] will not go before any of these second glyphs either, so we cannot replace the existing first glyph with [a]. Neither addition nor replacement is possible to transform these words into ones which begin with [a].
Thus we are left with no option but to consider that words beginning with [a] must be made by removing the first glyphs of linestart words. But which ones? It cannot be words beginning [ya] as there are only 5 tokens as linestart words: far too few. For words beginning [da], though are 82 such tokens as linestart words—far more than expected—these are not wholly statistically abnormal. About 80% of non-linestart words starting [d] actually have the first two glyphs [da]. Removing [d] from some occurrences of [da] might work.
However, there are 68 tokens beginning [sa] where the expected number might be only four or five. This is also roughly the expected number of words beginning with [a]: 68 tokens (plus one existing) would make it 10% of linestart words, compared with 8.9% of non-linestart words: near enough.
We might here have the first part of the transformation process: when a word starting [a] comes at the beginning of a line it is prefixed with [s]. That is the kind of simple and broad rule we are seeking. Whether it is true or not remains to be proven.
And there, tantalizingly, I will leave the first part of this article.