I grouped all the folios from f1v to f20v inclusive, and labeled the group as "Herbal folios", and folios f103r to f116r inclusive labeled as "Recipe folios". I ran each group through a program that extracts all the prefixes, suffixes and stems, validates each, and orders them in frequency. (The method used was described in an earlier email to the list.) My first question was: are the word frequencies and prefix/stem/suffix (PSS) frequencies similar between the Herbal and Recipe collections? Here are the results. I'll show only the suffix frequencies, because they are the most interesting. Herbal: 1331 different words, top 10 words: "8am 1oe 1oy K9 89 19 s 8ay 2oe oy" Recipe: 1443 different words, top 10 words: "am ay 1c89 oe 4ohC9 8am oy 4oham 1c9 2c89" Top 10 Herbal Suffixes (Frequency) 9 0.105580695 89 0.065862246 y 0.06435395 e 0.058320764 am 0.04524887 m 0.03167421 s 0.027652087 19 0.025641026 8 0.023629965 oy 0.02212167 Top 10 Recipe Suffixes 9 0.11764706 e 0.05882353 89 0.05020284 y 0.04817444 am 0.036511157 8 0.029411765 ay 0.028904665 ae 0.024340771 oy 0.023326572 oe 0.021805273 Note: Similar sets (7 of 10), with suffix "9" being approximately a factor two more common than the next most common suffix. I'm not sure what conclusions can be drawn, if any, from this. For fun, I applied the same analysis to a similar number of words from Augustinus Latin. Here are the results, together with the VMs data: (Augustinus: 1257 different words, top 10 words: "et te in non me mihi est domine ut enim") Top 10 Latin Suffixes m 0.118421055 s 0.10526316 e 0.047368422 que 0.039473683 i 0.034210525 o 0.028947368 t 0.028947368 us 0.02631579 rum 0.021052632 a 0.021052632 So, Latin does not have the same frequency pattern at all. So, is there a language which does have a similar patterm? I looked at French from 1367, Spanish from 1527, German from 1553, and old English (Courtier): Top 10 French Suffixes s 0.1199 t 0.0736 z 0.0708 e 0.0654 nt 0.0463 es 0.0436 l 0.0245 r 0.0218 re 0.0191 er 0.0191 tre 0.0163 Top 10 Spanish Suffixes s 0.1874 n 0.0519 o 0.0464 a 0.0445 r 0.0297 do 0.0297 es 0.0241 l 0.0223 e 0.0223 va 0.0204 to 0.0148 Top 10 German Suffixes en 0.1171 t 0.1171 s 0.1122 n 0.0537 er 0.0390 ten 0.0341 d 0.0341 e 0.0293 m 0.0293 ts 0.0244 r 0.0244 Top 10 English Suffixes e 0.1404 n 0.0449 s 0.0421 t 0.0393 re 0.0337 y 0.0281 ne 0.0253 l 0.0253 r 0.0253 ll 0.0253 ed 0.0225 The Spanish suffix "s" is three times more frequent than the next suffix: not a good match to the VMs. Similarly for the English "e". The German suffix pattern is completely different to the VMs. The French pattern looks similar to the VMs. Let's look at the French Stems, and compare with the VMs: Top 10 Herbal Stems o 0.15171504 9 0.058377307 8 0.045184698 k 0.04287599 1o 0.040567283 oe 0.036609497 o8 0.028364116 oy 0.026385223 y 0.02176781 2 0.02176781 Top 10 French Stems a 0.0704 d 0.0544 es 0.0528 en 0.0448 le 0.0432 se 0.032 ent 0.0304 de 0.0272 ce 0.0272 ne 0.0256 A poor match. Conclusion: the "9" suffix in the VMs appears too frequently for it to come from Latin, German, English or Spanish. Although French has a similarly frequent suffix "s", the stem frequencies of French don't match the VMs. Hypothesis: the "9" suffix in the VMs is not a word suffix, but punctuation or some other annotation. Perhaps a key mark for deciphering purposes. Next step: re-analyse the PSS frequencies in the VMs after removing suffix "9" from words where it appears.
Astrological: folios 66v to 73v inclusive
Biological: folios 75r to 85r inclusive
I agree that one might expect the results to be different for
the astro section since we are presumably looking at labels and
names. So it would be like comparing a dictionary result to
a prose result. (I will make a version of the analysis code that
just looks at unique words in the text.)
Anyway, here are the results:
Herbal: 1331 different words, top 10 words: "8am 1oe 1oy K9 89 19 s 8ay 2oe oy"
Recipe: 1443 different words, top 10 words: "am ay 1c89 oe 4ohC9 8am oy 4oham 1c9 2c89"
Astrological: 1771 different words, top 10 words: "ay am ae 8am s 8ay 8ae 89 okcos ohC9"
Biological: 2135 different words, top 10 words: "oe 4ohan 1c89 2c89 4ohc89 4oe 4ohae 1c9 4oham"
Top 10 Herbal Suffixes (Frequency)
9 0.105580695
89 0.065862246
y 0.06435395
e 0.058320764
am 0.04524887
m 0.03167421
s 0.027652087
19 0.025641026
8 0.023629965
oy 0.02212167
Top 10 Recipe Suffixes
9 0.11764706
e 0.05882353
89 0.05020284
y 0.04817444
am 0.036511157
8 0.029411765
ay 0.028904665
ae 0.024340771
oy 0.023326572
oe 0.021805273
Top 10 Astrological Suffixes
9 0.120173536
89 0.055531453
am 0.046420824
ay 0.04381779
s 0.04295011
ae 0.04251627
e 0.040347073
79 0.026898047
y 0.022993492
oe 0.022125814
Top 10 Biological Suffixes
9 0.11961975
89 0.049643517
e 0.038288884
oe 0.031687353
y 0.030102983
c89 0.029838923
ae 0.0293108
c9 0.0293108
oy 0.02719831
ay 0.02508582
The suffix frequency results for the different folio groups look
reassuringly similar to me: the differences are what you would
see if you compared two modestly sized tests in, say, English.
Indeed, one can tentatively conclude that the language is the
same in all four of the VMs sections.
On the other hand, the top 10 word lists are quite different. Curious.
Regarding word stems: the definition of a word stem for this study is
"any group of characters that spells a valid word by itself, and is
also found following one or more other characters (a prefix) and/or followed
by one or more other characters (a suffix)."
So, single VMs characters can be stems. After all, it may be that
a single VMs character equates to multiple plaintext characters, so
we have to have the flexibility to assign single characters as stems.
To clarify, take for example the VMs word "8am". The candidate stems are
"8am", "8a", "am", "8", "a" and "m". Those candidates that appear as single words
in the VMS dictionary are classed as valid stems (in this case, I believe
all six are valid stems).
Once we have a list of all the valid stems in the text, we can count how
often each appears, and then order that list. This is what is done to
obtain the lists above.
Because this method is fully general, we avoid any assumptions about how
many characters a single VMs character maps to.
I changed the algorithm so that it only accumulated prefix/stem/suffixes
for unique words in the VMs (as opposed to accumulating them for all
words). I think this is more sensible, otherwise a very popular word
ended up skewing the statistics.
After doing this, the results for suffixes look similar between
Latin and VMs (Recipes) - using 3800 words:
Top 20 Latin Suffixes (from a Latin dictionary)
s 0.08350305
o 0.042769857
t 0.03971487
m 0.034623217
is 0.029531568
e 0.02749491
us 0.026476579
a 0.022403259
es 0.020366598
rum 0.01934827
um 0.018329939
tum 0.017311608
mus 0.017311608
to 0.017311608
i 0.01629328
tus 0.01629328
tis 0.015274949
c 0.014256619
em 0.013238289
am 0.013238289
Top 20 Herbal Suffixes
9 0.094210714
89 0.045487236
e 0.040273283
ay 0.036857247
y 0.036857247
am 0.03613808
ae 0.029126214
an 0.028047465
oe 0.024631428
79 0.023552679
oy 0.023013305
8 0.023013305
o 0.020316433
ap 0.019417476
c89 0.018878102
c9 0.017979145
s 0.017799353
m 0.015462064
o89 0.014383315
19 0.01366415
This suggests the following (partial) cipher :
VMs Latin
=== =====
9 s
8 i
7 u
e m
a r
o a
y um
m is
1 t
4 qu
c e
g f
k c
2 d
s p
h n
3 h
Top 20 VMs words translated
am -> ris
ay -> rum
ae -> rm
1c89 -> teis
4ohC9 -> quan?s
1c9 -> tes
oe -> am
4oham -> quanris
8am -> iris
4ohan -> quanr?
oham -> anris
okam -> acris
oy -> aum
an -> r?
ohan -> anr?
e -> m
2c89 -> dkis
1c79 -> tkus
ohC9 -> an?s
okay -> acrum
In this analysis, the software looks in the text for all nGrams that appear at least twice as a) a prefix, or b) as a suffix or at least once as a stem, and calculates their (normalised) frequencies. I'm not sure what to make of the results! For N=3, looking at the Herbal folios f1v-f20v inclusive, 1331 different words. Confirmed valid prefix/stem/suffix counts 99 252 111 Prefix/Stem/Suffix frequency, normalised 4ok 0.1010101 o89 0.05952381 o89 0.09009009 4oh 0.07070707 1oe 0.055555556 8am 0.09009009 1oe 0.060606062 4ok 0.055555556 1c9 0.054054055 1oh 0.04040404 8am 0.04761905 1oy 0.054054055 ok1 0.04040404 4oh 0.04761905 1oe 0.045045044 8oe 0.030303031 1oy 0.03968254 coe 0.036036037 1oy 0.030303031 1c9 0.031746034 cc9 0.027027028 1co 0.030303031 1co 0.023809524 e89 0.027027028 1ok 0.030303031 8oe 0.023809524 ham 0.027027028 4oj 0.030303031 coe 0.01984127 2c9 0.027027028 For N=3, processing the same number of different words from Thomas Hardy (English) Confirmed valid prefix/stem/suffix counts 87 160 67 Prefix/Stem/Suffix frequency, normalised com 0.04597701 ely 0.025 ing 0.07462686 par 0.022988506 ted 0.025 led 0.04477612 rea 0.022988506 led 0.025 sed 0.04477612 mot 0.022988506 sed 0.025 ely 0.04477612 pla 0.022988506 ght 0.025 ted 0.029850746 see 0.022988506 ing 0.01875 ter 0.029850746 pas 0.022988506 ked 0.01875 son 0.029850746 wai 0.022988506 per 0.01875 ned 0.029850746 can 0.022988506 com 0.01875 ner 0.029850746 smi 0.022988506 par 0.01875 mon 0.029850746 For N=3, same number of words from Augustinus (Latin) Confirmed valid prefix/stem/suffix counts 102 197 83 Prefix/Stem/Suffix frequency, normalised qua 0.039215688 ere 0.05076142 ere 0.04819277 fac 0.029411765 qua 0.035532996 iat 0.04819277 qui 0.029411765 fac 0.02538071 que 0.036144577 dic 0.029411765 ita 0.02538071 ius 0.036144577 pot 0.029411765 ius 0.02538071 ita 0.036144577 ter 0.019607844 que 0.020304568 rum 0.024096385 ali 0.019607844 dic 0.020304568 ent 0.024096385 aud 0.019607844 ini 0.020304568 ram 0.024096385 par 0.019607844 ans 0.015228426 unt 0.024096385 cor 0.019607844 ent 0.015228426 ris 0.024096385 For N=4 Voynich (statistics become poorer as N increases, of course) Confirmed valid prefix/stem/suffix counts 6 14 6 Prefix/Stem/Suffix frequency, normalised 4oko 0.16666667 o8ae 0.14285715 co89 0.16666667 okam 0.16666667 okam 0.14285715 e8am 0.16666667 oh2o 0.16666667 4ok1 0.071428575 o8an 0.16666667 4okc 0.16666667 4oh1 0.071428575 e2oe 0.16666667 k2co 0.16666667 co89 0.071428575 9koy 0.16666667 4ohC 0.16666667 4oko 0.071428575 oKoy 0.16666667 4ok1 0.0 e8am 0.071428575 1o89 0.0 4oh1 0.0 oh2o 0.071428575 oe89 0.0 ok1c 0.0 o8an 0.071428575 o8ae 0.0 ohoe 0.0 4okc 0.071428575 ho89 0.0 For N=4 English Confirmed valid prefix/stem/suffix counts 36 66 26 Prefix/Stem/Suffix frequency, normalised pres 0.055555556 ined 0.045454547 sing 0.115384616 dist 0.055555556 ring 0.045454547 ined 0.115384616 weak 0.055555556 test 0.045454547 ally 0.07692308 occa 0.055555556 ment 0.030303031 ring 0.03846154 outl 0.027777778 pres 0.030303031 ence 0.03846154 prob 0.027777778 sing 0.030303031 nded 0.03846154 ment 0.027777778 weak 0.030303031 ding 0.03846154 cons 0.027777778 prob 0.030303031 ning 0.03846154 atte 0.027777778 hern 0.030303031 ness 0.03846154 stan 0.027777778 sion 0.030303031 wing 0.03846154 For N=4 Latin Confirmed valid prefix/stem/suffix counts 63 126 57 Prefix/Stem/Suffix frequency, normalised faci 0.06349207 bant 0.03968254 ntes 0.0877193 pecc 0.04761905 ntes 0.03968254 quam 0.05263158 invo 0.031746034 faci 0.031746034 endo 0.05263158 cred 0.031746034 pecc 0.031746034 ebam 0.03508772 infa 0.031746034 endo 0.023809524 erem 0.03508772 puer 0.031746034 ndis 0.023809524 iens 0.03508772 habe 0.031746034 quam 0.023809524 ones 0.03508772 form 0.031746034 quid 0.023809524 bant 0.01754386 pare 0.031746034 rati 0.023809524 abam 0.01754386 nesc 0.031746034 ibus 0.015873017 ndis 0.01754386 For N=5 Voynich (no data satisfies selection) For N=5 English Confirmed valid prefix/stem/suffix counts 15 29 13 Prefix/Stem/Suffix frequency, normalised consi 0.13333334 ation 0.06896552 ation 0.15384616 ornam 0.13333334 consi 0.06896552 sting 0.15384616 appea 0.06666667 ornam 0.06896552 dered 0.07692308 dimen 0.06666667 sting 0.06896552 ality 0.07692308 occup 0.06666667 still 0.06896552 ingly 0.07692308 stand 0.06666667 dered 0.03448276 ental 0.07692308 conce 0.06666667 ingly 0.03448276 rning 0.07692308 sugge 0.06666667 dimen 0.03448276 ented 0.07692308 diffe 0.06666667 occup 0.03448276 rence 0.07692308 speci 0.06666667 ality 0.03448276 sions 0.07692308 For N=5 Latin Confirmed valid prefix/stem/suffix counts 21 44 23 Prefix/Stem/Suffix frequency, normalised volun 0.0952381 entes 0.06818182 entes 0.13043478 pecca 0.0952381 batur 0.045454547 batur 0.08695652 lauda 0.0952381 tibus 0.045454547 antur 0.08695652 quaer 0.0952381 invoc 0.045454547 tibus 0.08695652 metue 0.0952381 pecca 0.045454547 bamus 0.08695652 invoc 0.04761905 lauda 0.045454547 torum 0.08695652 infan 0.04761905 quaer 0.045454547 tatis 0.04347826 inven 0.04761905 volun 0.045454547 itate 0.04347826 nesci 0.04761905 metue 0.045454547 antes 0.04347826 paren 0.04761905 bamus 0.045454547 bilis 0.04347826
Here are the N=3 counts/frequency for the 1331 unique words in f1v-f20v of the Herbal: Confirmed valid prefix/stem/suffix counts 99 252 111 Prefix/Stem/Suffix frequency, normalised 4ok 10 0.1010101 o89 15 0.05952381 o89 10 0.09009009 4oh 7 0.07070707 1oe 14 0.055555556 8am 10 0.09009009 1oe 6 0.060606062 4ok 14 0.055555556 1c9 6 0.054054055 1oh 4 0.04040404 8am 12 0.04761905 1oy 6 0.054054055 ok1 4 0.04040404 4oh 12 0.04761905 1oe 5 0.045045044 8oe 3 0.030303031 1oy 10 0.03968254 coe 4 0.036036037 1oy 3 0.030303031 1c9 8 0.031746034 cc9 3 0.027027028 1co 3 0.030303031 1co 6 0.023809524 e89 3 0.027027028 1ok 3 0.030303031 8oe 6 0.023809524 ham 3 0.027027028 4oj 3 0.030303031 coe 5 0.01984127 2c9 3 0.027027028 (e.g. the sequence "4ok" appears 10 times at the start of a longer word (prefix)) N=3 for 1331 unique words in the Astrological Section Confirmed valid prefix/stem/suffix counts 154 346 153 Prefix/Stem/Suffix frequency, normalised okc 11 0.071428575 o89 16 0.046242774 o89 13 0.08496732 ohc 8 0.051948052 okc 11 0.031791907 cos 6 0.039215688 4oh 7 0.045454547 8ae 11 0.031791907 8am 6 0.039215688 9hc 7 0.045454547 1co 10 0.028901733 8ae 6 0.039215688 oko 6 0.038961038 oko 10 0.028901733 cc9 4 0.026143791 oka 6 0.038961038 oho 9 0.02601156 coe 4 0.026143791 oho 5 0.032467533 ohc 8 0.023121387 o79 4 0.026143791 1ok 5 0.032467533 oka 8 0.023121387 oh9 4 0.026143791 oh1 5 0.032467533 4oh 8 0.023121387 c79 4 0.026143791 1co 4 0.025974026 9hc 7 0.020231213 c89 3 0.019607844 N=3 for 1331 unique words in the Biological Section Confirmed valid prefix/stem/suffix counts 124 275 124 Prefix/Stem/Suffix frequency, normalised 4oh 13 0.10483871 c89 26 0.094545454 c89 17 0.13709678 4ok 10 0.08064516 4oh 20 0.07272727 c79 13 0.10483871 4oe 8 0.06451613 c79 13 0.047272727 1c9 9 0.07258064 oeh 6 0.048387095 4ok 12 0.043636363 C89 7 0.05645161 oe1 5 0.04032258 1c9 11 0.04 2c9 7 0.05645161 ohc 4 0.032258064 2c9 9 0.03272727 189 4 0.032258064 soe 4 0.032258064 4oe 8 0.02909091 eoy 3 0.024193548 oe2 3 0.024193548 oeh 7 0.025454545 cc9 3 0.024193548 91c 3 0.024193548 8ae 7 0.025454545 hC9 3 0.024193548 8ay 3 0.024193548 8ay 7 0.025454545 ae9 3 0.024193548 N=3 for 1331 unique words in the Recipes Section Confirmed valid prefix/stem/suffix counts 135 303 143 Prefix/Stem/Suffix frequency, normalised 4oh 17 0.12592593 4oh 18 0.05940594 c89 13 0.09090909 4ok 14 0.1037037 4ok 17 0.05610561 o89 13 0.09090909 ohc 9 0.06666667 o89 16 0.052805282 189 8 0.055944055 okc 8 0.05925926 c89 15 0.04950495 c79 7 0.04895105 oeh 7 0.05185185 oeh 10 0.0330033 8am 7 0.04895105 1co 5 0.037037037 1co 10 0.0330033 8ay 6 0.04195804 g1c 4 0.02962963 ohc 9 0.02970297 coe 5 0.034965035 4oj 4 0.02962963 c79 9 0.02970297 8ae 5 0.034965035 ohC 4 0.02962963 8ae 9 0.02970297 1c9 4 0.027972028 1oe 3 0.022222223 189 9 0.02970297 cc9 4 0.027972028 I can generate the tables for N=4 and N=5 if they are of interest.
Notice how words tend to start with "4", "o" and "1" and tend to end with "9", "m" and "e". This sort of feature has me excited about Philip Neal's anagram encryption idea explained here: http://voynichcentral.com/users/philipneal/language.html which is summarised thus (quoting from that page): "1. Divide a plaintext into lines 2. Sort the words of each line into alphabetical order 3. Sort the letters of each word into alphabetical order 1. one thing led to another thing last night 2. another last led night one to thing thing 3. aehnort alst del ghint eno ot ghint ghint" Right now I am repurposing my Genetic Algorithm to attach some lines of the VMs assuming such an encryption - I am killed by the permutations (which go as factorial the length of the word).