The following post is something of a hybrid. I want to make a few new points but also reiterate some points I have made in the past. I am sorry if it reads a bit disjointed, but I promise something interesting lurks within.
It is well–known that [i] and [e] occur in many words and often in sequences: that is, more than one in a row. This is a characteristic they share with each other. None of the other characters in the script regularly occur in this way: [e] and [i] and found in repeating sequences over four thousand times each while [oo], the next highest, is found fewer than a hundred times.
One curious aspect of [e] and [i], however, is that they occur less often in words together than might otherwise be expected. Let us have some statistics to show what I mean.
I took the one thousand most common words, which account for over 26,000 tokens, or nearly 70% of the text. They include all words with five or more tokens, so we can be assured that they are not reading or writing errors. I marked them according to the presence of [e] and [i] (and also the number of syllables).
I found that [e] occurred in 39% of all words while [i] occurred in 12% of all words. Were the occurrence of [e] and [i] to be independent we would expected that in a thousand words about 45 to 50 would contain both. However, only 8 words contained both [e] and [i]: [sheaiin], [chedaiin], [chedain], [chekaiin], [cheodaiin], [shedaiin], [sheodaiin], [oteodaiin].
It should be immediately apparent that in seven of these cases the [e] immediately follows a [ch, sh] as part of a syllable string which commonly occurs at the beginning of words. In my discussion of high level word structure I mentioned that these syllables were constrained in what they could contain, compared with the syllables which followed them and could contain anything. I also mentioned that they became even more constrained in three syllable words.
It turns out that the number of syllables appears to be relevant to the occurrence of [e] and [i].
In the thousand most common words the distribution is: one syllable 32%, two syllables 54%, three syllables 12%, with the balance made up of words which are unclassifiable. The distribution of words with [e] is broadly similar, being a little higher in three syllable words and a little lower in one syllable words. But the distribution of [i] is much different: [oteodaiin], given above as one of the eight words which contain both [e] and [i], is the only three syllable word in the sample which contains [i].
So here we have an interesting conjunction: 1) words containing both [e] and [i] are less common than they should be, 2) most of those which do occur begin with [ch, sh] which is much less common in three syllable words, and 3) [i] itself almost never occurs in three syllable words. (In case you are wondering if the non–occurrence of [i] in three syllable words might explain why there are so few words with both [e] and [i], we should still expect about 25–30 two syllable words with both. There are only 6.)
The answer is that the appearance of [e] and [i] are somehow linked and are not independent events. This is not to say that they are variants but that they are often mutually exclusive.
As the number of syllables also works to exclude [i] we might wonder if this is related. Because [i] mostly occurs at the end of a word, and the number of syllables can affect certain linguistic characteristics of a word (such as prosody), [i] could be a marker for those characteristics. Were [e] the marker for a similar or related prosodic process, or a sound thus affected, the occurrence of both in a word would then naturally be less likely.
One last thing to mention is this: of the one thousand most common words 12% contain [i], of one thousand words with a single token then same figure is over 19%. That is, common words are less likely to contain [i] than rare words. Conversely, [e] occurs less in rare words than in common words. Word with both [e] and [i] are still considerably less common that would be expected otherwise.