The following hypothesis occurred to me while I was
investigating a cipher theory proposed by Rich Santa
Coloma. (This is not a new idea amongst Voynich researchers, but it was new to
me!)
The VMs "words" are codes for plaintext character groups, probably
trigraphs, digraphs and single characters.
How does one use this system?
1) Take each word in the plaintext
2) Break it up into a sequence of one or more trigraphs, digraphs and single
characters by referring to a code table
3) Write the code for each, separated by a space, and terminate the last
tri/di-graph/character code by a VMs "9".
The labels are probably treated differently: there may well be a separate set of
codes just for the labels.
As an example, take the following "sentence" of 33 "words" from the Herbal
folios:
h1cok 2oe 1c9 4ohom 2oy
4ok1coe 1oyoy 2o82c9 4okd9
4okcc9 8am 4okC9
Kay o1c9 1oe 1oe 4ok1c9
8am 1okd9 8ae s19
k1c9 8am 8C9
ko8 8an 4okds 3o h1cc9 sam 1oh1oe
1oy Hos
Breaking the VMs "words" at each terminal "9", this is
deciphered to be a sentence of 13 words:
h1cok 2oe 1c
4ohom 2oy 4ok1coe 1oyoy 2o82c
4okd
4okcc
8am 4okC
Kay o1c
1oe 1oe 4ok1c
8am 1okd
8ae s1
k1c
8am 8C
ko8 8an 4okds 3o h1cc
sam 1oh1oe 1oy Hos
Each of these words is built of one or more codes. E.g. the first word in the
list above is "h1cok 2oe 1c" and may be deciphered as
h1cok = "qui",
2oe = "de"
1c = "m"
to make the Latin word "quidem".
An interesting feature of this cipher/code is that you
may have several choices of how to split each plaintext
word into tri/di/mono-graphs, but without ambiguity
for the decipherer. This may be an explanation for the
different frequency distributions between the VMs folios and Currier hands: they
were written by different scribes who tended to split the
plaintext words differently.
We first take a substantial body of text from the VMs, e.g. the Recipes folios, and feed it through an application code that extracts all the VMs words, and groups them according to the procedure described above, using one or more arbitrary characters as word ending marks. Typically we use VMs "9". Each sentence so derived is analysed: each of the tokens is analysed for n-gram content and frequencies are tallied.
At the end of the processing, the n-grams are sorted into frequency order: the most frequent n-grams appear first in the list.
At this point the application moves to its second stage. It ingests a large list of Latin phrases, generated by Knox (thanks, Knox!) and processes each word in each unique phrase for n-gram content, so extracting the n-gram frequencies for Latin. The phrases are placed in a sorted list: shortest first. The n-grams are sorted by frequency, most frequent first.
Here are the Latin phrase sizes used:
A total of 53834 different phrases of size >= 2
2 4405
3 28152
4 8524
5 3866
6 2227
7 1507
8 1085
9 813
10 633
11 513
12 424
13 356
14 300
15 252
16 209
17 177
18 150
19 130
The third stage of the application is to generate a set of Genetic Algorithm chromosomes. Each chromosome takes the Top N n-grams from the Voynich n-gram list and pairs them with a random selection of the n-grams from the Latin list.
For example, for a Chromosome of length 15 (in fact the GA uses much longer lengths, typically 200) the following table might be used:
V: am ay ae 1c8 4ohC oe 1c 4oham 8am 4ohan oham okam oy 1c7 e
L: ed gi n de et ae p s du tu nd d tio rum te
The chromosomes are "scored" by having them translate/decipher a training set of sentences from the input VMs folios. To calculate the score of each chromosome for each sentence, the sentence word tokens are converted to Latin n-grams using the chromosome's table. Then the tokens are joined together to form the plaintext words. The plaintext words are looked up in the Latin dictionary: the chromosome's score is increased for valid words, and decreased for invalid words. Once all the words in the sentence have been deciphered in this way, it is compared with each of the Latin phrases: if a Latin phrase appears in the sentence, the score of the chromosome is increased substantially.
The best chromosome found by a Monte Carlo method (basically generating random chromosomes, and retaining the best scoring chromosome) is placed at the top of a list, and then the remaining chromosomes needed for the Genetic Algorithm are generated.
The GA phase now begins: the chromosomes are genetically altered, mated and selected to optimise the best chromosome's score on the training sentences. This phase is compute intensive.
Periodically, the GA will report on its progress:
Epoch 311 Cost/Ave 62.845588235294116/61.22993872549012 same 1 Mutated 21.608040201005025% New 1 MS 15
62.845588235294116 GAPhrases$Chromosome@41ec5a Good=128 / 408 = 31.37255% 40 phrases in 25 sentences
S: am ay ae 1c8 4ohC oe 1c 4oham 8am 4ohan oham okam oy 1c7 e
R: ed gi n de et ae p s du tu nd d tio rum te
Sentence 189
S: 2o ok1c - 1coe hc1 - 1Kc - ohan ae e hC - 4ohan 1cH - 1c7ay ap e2c - 2c7ae ohcay e hc8 - 1coehC - ehc - ohC - 4ohC - 4ohc - 4ohan ap -
T: endve la' binteua tunti nis te' pi et' in'* tunis
In this report, the GA has been running for 311 "epochs" (each epoch is a new generation of chromosomes). The cost (score) of the best chromosome is 62.8, whereas the average score of all the chromosomes in the population is 61.2. In this Epoch, there has been no change to the best chromosome since the last Epoch ("same 1"), 21% of the chromosomes have been mutated, a fresh chromosome ("New 1") was inserted at this Epoch (to ensure diversity - this is not usually done in GA, but I find it produces more reliable training). "MS 15" means that the maximum number of no-change Epochs seen so far has been 15 ... the larger this number is, the more stagnant the chromosome pool is, and the nearer to a solution we are.
The following line shows in detail how the best chromosome has scored: its table produces 128 valid Latin words, from a total of 408 translations i.e. about 31%. In the 25 sentences being used in training, 40 common Latin phrases have been found.
The next two lines show the first 15 n-grams in the mapping that the chromosome is using.
Then the status report shows how the chromosome fared on translating a sentence picked at random from the VMs folios. Since the GA is being trained only on the first few sentences, the remainder are essentially "unseen", and so a valid, sensible translation in a non-trained sentence is significant.
The sentence picked is number 129 (the training set is the first 25 sentences in this run, so number 129 is well outside that). The VMs source sentence is shown with hyphens "-" separating the tokens that make up words. E.g. "2o ok1c" is the first word. Beneath is the Latin translation. A Latin word followed by a single quote means that that word appears in the Latin dictionary, and is thus valid. A star appearing after a set of valid Latin words indicates that the Latin phrase made up by the words is common, or at least appears in Knox's list.
To be completed (simulation runs currently in progress)