Prefix Stem and Suffix Analysis of the Voynich Manuscript

Using the "Herbal" and "Recipes" folios

I grouped all the folios from f1v to f20v inclusive, and labeled the 
group as "Herbal folios", and folios f103r to f116r inclusive 
labeled as "Recipe folios". 

I ran each group through a program that extracts all the prefixes, 
suffixes and stems, validates each, and orders them in frequency. (The 
method used was described in an earlier email to the list.) 

My first question was: are the word frequencies and prefix/stem/suffix 
(PSS) frequencies similar between the Herbal and Recipe collections? 

Here are the results. I'll show only the suffix frequencies, because 
they are the most interesting. 

Herbal: 1331 different words, top 10 words: "8am 1oe 1oy K9 89 19 s 8ay 2oe oy" 
Recipe: 1443 different words, top 10 words: "am ay 1c89 oe 4ohC9 8am oy 4oham 1c9 2c89" 

Top 10 Herbal Suffixes  (Frequency) 

9       0.105580695 
89      0.065862246 
y       0.06435395 
e       0.058320764 
am      0.04524887 
m       0.03167421 
s       0.027652087 
19      0.025641026 
8       0.023629965 
oy      0.02212167 

Top 10 Recipe Suffixes 

9       0.11764706 
e       0.05882353 
89      0.05020284 
y       0.04817444 
am      0.036511157 
8       0.029411765 
ay      0.028904665 
ae      0.024340771 
oy      0.023326572 
oe      0.021805273 

Note: Similar sets (7 of 10), with suffix "9" being approximately a factor two more 
common than the next most common suffix. 

I'm not sure what conclusions can be drawn, if any, from this. For fun, 
I applied the same analysis to a similar number of words from Augustinus 
Latin. Here are the results, together with the VMs data: 

(Augustinus: 1257 different words, top 10 words: "et te in non me mihi est domine ut enim") 

Top 10 Latin Suffixes 

m       0.118421055 
s       0.10526316 
e       0.047368422 
que     0.039473683 
i       0.034210525 
o       0.028947368 
t       0.028947368 
us      0.02631579 
rum     0.021052632 
a       0.021052632 

So, Latin does not have the same frequency pattern at all. 

So, is there a language which does have a similar patterm? I looked at French 
from 1367, Spanish from 1527, German from 1553, and old English (Courtier): 

Top 10 French Suffixes 

s    0.1199 
t    0.0736 
z    0.0708 
e    0.0654 
nt    0.0463 
es    0.0436 
l    0.0245 
r    0.0218 
re    0.0191 
er    0.0191 
tre    0.0163 

Top 10 Spanish Suffixes 

s    0.1874 
n    0.0519 
o    0.0464 
a    0.0445 
r    0.0297 
do    0.0297 
es    0.0241 
l    0.0223 
e    0.0223 
va    0.0204 
to    0.0148 

Top 10 German Suffixes 

en    0.1171 
t    0.1171 
s    0.1122 
n    0.0537 
er    0.0390 
ten    0.0341 
d    0.0341 
e    0.0293 
m    0.0293 
ts    0.0244 
r    0.0244 

Top 10 English Suffixes 

e    0.1404 
n    0.0449 
s    0.0421 
t    0.0393 
re    0.0337 
y    0.0281 
ne    0.0253 
l    0.0253 
r    0.0253 
ll    0.0253 
ed    0.0225 

The Spanish suffix "s" is three times more frequent than the next suffix: not 
a good match to the VMs. Similarly for the English "e". 

The German suffix pattern is completely different to the VMs. 

The French pattern looks similar to the VMs. 

Let's look at the French Stems, and compare with the VMs: 

Top 10 Herbal Stems 

o       0.15171504 
9       0.058377307 
8       0.045184698 
k       0.04287599 
1o      0.040567283 
oe      0.036609497 
o8      0.028364116 
oy      0.026385223 
y       0.02176781 
2       0.02176781 

Top 10 French Stems 

a       0.0704 
d       0.0544 
es      0.0528 
en      0.0448 
le      0.0432 
se      0.032 
ent     0.0304 
de      0.0272 
ce      0.0272 
ne      0.0256 

A poor match. 

Conclusion: the "9" suffix in the VMs appears too frequently for it to come from 
Latin, German, English or Spanish. Although French has a similarly frequent suffix 
"s", the stem frequencies of French don't match the VMs. 

Hypothesis: the "9" suffix in the VMs is not a word suffix, but punctuation or 
some other annotation. Perhaps a key mark for deciphering purposes. 

Next step: re-analyse the PSS frequencies in the VMs after removing suffix "9" 
from words where it appears.

Using the Biological and Astrological Folios

Astrological: folios 66v to 73v inclusive

Biological: folios 75r to 85r inclusive

I agree that one might expect the results to be different for 
the astro section since we are presumably looking at labels and 
names. So it would be like comparing a dictionary result to 
a prose result. (I will make a version of the analysis code that 
just looks at unique words in the text.) 

Anyway, here are the results: 

Herbal: 1331 different words,       top 10 words: "8am 1oe 1oy K9 89 19 s 8ay 2oe oy" 
Recipe: 1443 different words,       top 10 words: "am ay 1c89 oe 4ohC9 8am oy 4oham 1c9 2c89" 
Astrological: 1771 different words, top 10 words: "ay am ae 8am s 8ay 8ae 89 okcos ohC9" 
Biological: 2135 different words,   top 10 words: "oe 4ohan 1c89 2c89 4ohc89 4oe 4ohae 1c9 4oham" 

Top 10 Herbal Suffixes  (Frequency) 

9       0.105580695 
89      0.065862246 
y       0.06435395 
e       0.058320764 
am      0.04524887 
m       0.03167421 
s       0.027652087 
19      0.025641026 
8       0.023629965 
oy      0.02212167 

Top 10 Recipe Suffixes 

9       0.11764706 
e       0.05882353 
89      0.05020284 
y       0.04817444 
am      0.036511157 
8       0.029411765 
ay      0.028904665 
ae      0.024340771 
oy      0.023326572 
oe      0.021805273 

Top 10 Astrological Suffixes 

9       0.120173536 
89      0.055531453 
am      0.046420824 
ay      0.04381779 
s       0.04295011 
ae      0.04251627 
e       0.040347073 
79      0.026898047 
y       0.022993492 
oe      0.022125814 

Top 10 Biological Suffixes 

9       0.11961975 
89      0.049643517 
e       0.038288884 
oe      0.031687353 
y       0.030102983 
c89     0.029838923 
ae      0.0293108 
c9      0.0293108 
oy      0.02719831 
ay      0.02508582 


The suffix frequency results for the different folio groups look 
reassuringly similar to me: the differences are what you would 
see if you compared two modestly sized tests in, say, English. 
Indeed, one can tentatively conclude that the language is the 
same in all four of the VMs sections. 

On the other hand, the top 10 word lists are quite different. Curious. 



Regarding word stems: the definition of a word stem for this study is 
"any group of characters that spells a valid word by itself, and is 
also found following one or more other characters (a prefix) and/or followed 
by one or more other characters (a suffix)." 

So, single VMs characters can be stems. After all, it may be that 
a single VMs character equates to multiple plaintext characters, so 
we have to have the flexibility to assign single characters as stems. 

To clarify, take for example the VMs word "8am". The candidate stems are 
"8am", "8a", "am", "8", "a" and "m". Those candidates that appear as single words 
in the VMS dictionary are classed as valid stems (in this case, I believe 
all six are valid stems). 

Once we have a list of all the valid stems in the text, we can count how 
often each appears, and then order that list. This is what is done to 
obtain the lists above. 

Because this method is fully general, we avoid any assumptions about how 
many characters a single VMs character maps to.

Refinement

I changed the algorithm so that it only accumulated prefix/stem/suffixes
for unique words in the VMs (as opposed to accumulating them for all
words). I think this is more sensible, otherwise a very popular word
ended up skewing the statistics.

After doing this, the results for suffixes look similar between
Latin and VMs (Recipes) - using 3800 words:

Top 20 Latin Suffixes (from a Latin dictionary)

s 0.08350305
o 0.042769857
t 0.03971487
m 0.034623217
is 0.029531568
e 0.02749491
us 0.026476579
a 0.022403259
es 0.020366598
rum 0.01934827
um 0.018329939
tum 0.017311608
mus 0.017311608
to 0.017311608
i 0.01629328
tus 0.01629328
tis 0.015274949
c 0.014256619
em 0.013238289
am 0.013238289

Top 20 Herbal Suffixes

9 0.094210714
89 0.045487236
e 0.040273283
ay 0.036857247
y 0.036857247
am 0.03613808
ae 0.029126214
an 0.028047465
oe 0.024631428
79 0.023552679
oy 0.023013305
8 0.023013305
o 0.020316433
ap 0.019417476
c89 0.018878102
c9 0.017979145
s 0.017799353
m 0.015462064
o89 0.014383315
19 0.01366415

This suggests the following (partial) cipher :

VMs Latin
=== =====
9 s
8 i
7 u
e m
a r
o a
y um
m is

1 t 
4 qu
c e
g f
k c
2 d
s p
h n
3 h


Top 20 VMs words translated

am -> ris
ay -> rum
ae -> rm
1c89 -> teis
4ohC9 -> quan?s
1c9 -> tes
oe -> am
4oham -> quanris
8am -> iris
4ohan -> quanr?
oham -> anris
okam -> acris
oy -> aum
an -> r?
ohan -> anr?
e -> m
2c89 -> dkis
1c79 -> tkus
ohC9 -> an?s
okay -> acrum

Looking for longer repeating character sequences

In this analysis, the software looks in the text for all nGrams that 
appear at least twice as a) a prefix, or b) as a suffix or at least 
once as a stem, and calculates their (normalised) frequencies. 

I'm not sure what to make of the results! 


For N=3, looking at the Herbal folios f1v-f20v inclusive, 1331 different words. 

Confirmed valid prefix/stem/suffix counts 99 252 111 
Prefix/Stem/Suffix frequency, normalised 
4ok     0.1010101               o89     0.05952381              o89     0.09009009 
4oh     0.07070707              1oe     0.055555556             8am     0.09009009 
1oe     0.060606062             4ok     0.055555556             1c9     0.054054055 
1oh     0.04040404              8am     0.04761905              1oy     0.054054055 
ok1     0.04040404              4oh     0.04761905              1oe     0.045045044 
8oe     0.030303031             1oy     0.03968254              coe     0.036036037 
1oy     0.030303031             1c9     0.031746034             cc9     0.027027028 
1co     0.030303031             1co     0.023809524             e89     0.027027028 
1ok     0.030303031             8oe     0.023809524             ham     0.027027028 
4oj     0.030303031             coe     0.01984127              2c9     0.027027028 

For N=3, processing the same number of different words from Thomas Hardy (English) 

Confirmed valid prefix/stem/suffix counts 87 160 67 
Prefix/Stem/Suffix frequency, normalised 
com     0.04597701              ely     0.025           ing     0.07462686 
par     0.022988506             ted     0.025           led     0.04477612 
rea     0.022988506             led     0.025           sed     0.04477612 
mot     0.022988506             sed     0.025           ely     0.04477612 
pla     0.022988506             ght     0.025           ted     0.029850746 
see     0.022988506             ing     0.01875         ter     0.029850746 
pas     0.022988506             ked     0.01875         son     0.029850746 
wai     0.022988506             per     0.01875         ned     0.029850746 
can     0.022988506             com     0.01875         ner     0.029850746 
smi     0.022988506             par     0.01875         mon     0.029850746 

For N=3, same number of words from Augustinus (Latin) 

Confirmed valid prefix/stem/suffix counts 102 197 83 
Prefix/Stem/Suffix frequency, normalised 
qua     0.039215688             ere     0.05076142              ere     0.04819277 
fac     0.029411765             qua     0.035532996             iat     0.04819277 
qui     0.029411765             fac     0.02538071              que     0.036144577 
dic     0.029411765             ita     0.02538071              ius     0.036144577 
pot     0.029411765             ius     0.02538071              ita     0.036144577 
ter     0.019607844             que     0.020304568             rum     0.024096385 
ali     0.019607844             dic     0.020304568             ent     0.024096385 
aud     0.019607844             ini     0.020304568             ram     0.024096385 
par     0.019607844             ans     0.015228426             unt     0.024096385 
cor     0.019607844             ent     0.015228426             ris     0.024096385 


For N=4 Voynich (statistics become poorer as N increases, of course) 

Confirmed valid prefix/stem/suffix counts 6 14 6 
Prefix/Stem/Suffix frequency, normalised 
4oko    0.16666667              o8ae    0.14285715              co89    0.16666667 
okam    0.16666667              okam    0.14285715              e8am    0.16666667 
oh2o    0.16666667              4ok1    0.071428575             o8an    0.16666667 
4okc    0.16666667              4oh1    0.071428575             e2oe    0.16666667 
k2co    0.16666667              co89    0.071428575             9koy    0.16666667 
4ohC    0.16666667              4oko    0.071428575             oKoy    0.16666667 
4ok1    0.0                     e8am    0.071428575             1o89    0.0 
4oh1    0.0                     oh2o    0.071428575             oe89    0.0 
ok1c    0.0                     o8an    0.071428575             o8ae    0.0 
ohoe    0.0                     4okc    0.071428575             ho89    0.0 

For N=4 English 

Confirmed valid prefix/stem/suffix counts 36 66 26 
Prefix/Stem/Suffix frequency, normalised 
pres    0.055555556             ined    0.045454547             sing    0.115384616 
dist    0.055555556             ring    0.045454547             ined    0.115384616 
weak    0.055555556             test    0.045454547             ally    0.07692308 
occa    0.055555556             ment    0.030303031             ring    0.03846154 
outl    0.027777778             pres    0.030303031             ence    0.03846154 
prob    0.027777778             sing    0.030303031             nded    0.03846154 
ment    0.027777778             weak    0.030303031             ding    0.03846154 
cons    0.027777778             prob    0.030303031             ning    0.03846154 
atte    0.027777778             hern    0.030303031             ness    0.03846154 
stan    0.027777778             sion    0.030303031             wing    0.03846154 

For N=4 Latin 

Confirmed valid prefix/stem/suffix counts 63 126 57 
Prefix/Stem/Suffix frequency, normalised 
faci    0.06349207              bant    0.03968254              ntes    0.0877193 
pecc    0.04761905              ntes    0.03968254              quam    0.05263158 
invo    0.031746034             faci    0.031746034             endo    0.05263158 
cred    0.031746034             pecc    0.031746034             ebam    0.03508772 
infa    0.031746034             endo    0.023809524             erem    0.03508772 
puer    0.031746034             ndis    0.023809524             iens    0.03508772 
habe    0.031746034             quam    0.023809524             ones    0.03508772 
form    0.031746034             quid    0.023809524             bant    0.01754386 
pare    0.031746034             rati    0.023809524             abam    0.01754386 
nesc    0.031746034             ibus    0.015873017             ndis    0.01754386 

For N=5 Voynich (no data satisfies selection) 

For N=5 English 

Confirmed valid prefix/stem/suffix counts 15 29 13 
Prefix/Stem/Suffix frequency, normalised 
consi   0.13333334              ation   0.06896552              ation   0.15384616 
ornam   0.13333334              consi   0.06896552              sting   0.15384616 
appea   0.06666667              ornam   0.06896552              dered   0.07692308 
dimen   0.06666667              sting   0.06896552              ality   0.07692308 
occup   0.06666667              still   0.06896552              ingly   0.07692308 
stand   0.06666667              dered   0.03448276              ental   0.07692308 
conce   0.06666667              ingly   0.03448276              rning   0.07692308 
sugge   0.06666667              dimen   0.03448276              ented   0.07692308 
diffe   0.06666667              occup   0.03448276              rence   0.07692308 
speci   0.06666667              ality   0.03448276              sions   0.07692308 

For N=5 Latin 

Confirmed valid prefix/stem/suffix counts 21 44 23 
Prefix/Stem/Suffix frequency, normalised 
volun   0.0952381               entes   0.06818182              entes   0.13043478 
pecca   0.0952381               batur   0.045454547             batur   0.08695652 
lauda   0.0952381               tibus   0.045454547             antur   0.08695652 
quaer   0.0952381               invoc   0.045454547             tibus   0.08695652 
metue   0.0952381               pecca   0.045454547             bamus   0.08695652 
invoc   0.04761905              lauda   0.045454547             torum   0.08695652 
infan   0.04761905              quaer   0.045454547             tatis   0.04347826 
inven   0.04761905              volun   0.045454547             itate   0.04347826 
nesci   0.04761905              metue   0.045454547             antes   0.04347826 
paren   0.04761905              bamus   0.045454547             bilis   0.04347826

Here are the N=3 counts/frequency for the 1331 unique words in f1v-f20v of the Herbal: 

Confirmed valid prefix/stem/suffix counts 99 252 111 
Prefix/Stem/Suffix frequency, normalised 
4ok     10      0.1010101               o89     15      0.05952381              o89     10      0.09009009 
4oh     7       0.07070707              1oe     14      0.055555556             8am     10      0.09009009 
1oe     6       0.060606062             4ok     14      0.055555556             1c9     6       0.054054055 
1oh     4       0.04040404              8am     12      0.04761905              1oy     6       0.054054055 
ok1     4       0.04040404              4oh     12      0.04761905              1oe     5       0.045045044 
8oe     3       0.030303031             1oy     10      0.03968254              coe     4       0.036036037 
1oy     3       0.030303031             1c9     8       0.031746034             cc9     3       0.027027028 
1co     3       0.030303031             1co     6       0.023809524             e89     3       0.027027028 
1ok     3       0.030303031             8oe     6       0.023809524             ham     3       0.027027028 
4oj     3       0.030303031             coe     5       0.01984127              2c9     3       0.027027028 


(e.g. the sequence "4ok" appears 10 times at the start of a longer word (prefix)) 

N=3 for 1331 unique words in the Astrological Section 

Confirmed valid prefix/stem/suffix counts 154 346 153 
Prefix/Stem/Suffix frequency, normalised 
okc     11      0.071428575             o89     16      0.046242774             o89     13      0.08496732 
ohc     8       0.051948052             okc     11      0.031791907             cos     6       0.039215688 
4oh     7       0.045454547             8ae     11      0.031791907             8am     6       0.039215688 
9hc     7       0.045454547             1co     10      0.028901733             8ae     6       0.039215688 
oko     6       0.038961038             oko     10      0.028901733             cc9     4       0.026143791 
oka     6       0.038961038             oho     9       0.02601156              coe     4       0.026143791 
oho     5       0.032467533             ohc     8       0.023121387             o79     4       0.026143791 
1ok     5       0.032467533             oka     8       0.023121387             oh9     4       0.026143791 
oh1     5       0.032467533             4oh     8       0.023121387             c79     4       0.026143791 
1co     4       0.025974026             9hc     7       0.020231213             c89     3       0.019607844 


N=3 for 1331 unique words in the Biological Section 

Confirmed valid prefix/stem/suffix counts 124 275 124 
Prefix/Stem/Suffix frequency, normalised 
4oh     13      0.10483871              c89     26      0.094545454             c89     17      0.13709678 
4ok     10      0.08064516              4oh     20      0.07272727              c79     13      0.10483871 
4oe     8       0.06451613              c79     13      0.047272727             1c9     9       0.07258064 
oeh     6       0.048387095             4ok     12      0.043636363             C89     7       0.05645161 
oe1     5       0.04032258              1c9     11      0.04                    2c9     7       0.05645161 
ohc     4       0.032258064             2c9     9       0.03272727              189     4       0.032258064 
soe     4       0.032258064             4oe     8       0.02909091              eoy     3       0.024193548 
oe2     3       0.024193548             oeh     7       0.025454545             cc9     3       0.024193548 
91c     3       0.024193548             8ae     7       0.025454545             hC9     3       0.024193548 
8ay     3       0.024193548             8ay     7       0.025454545             ae9     3       0.024193548 


N=3 for 1331 unique words in the Recipes Section 

Confirmed valid prefix/stem/suffix counts 135 303 143 
Prefix/Stem/Suffix frequency, normalised 
4oh     17      0.12592593              4oh     18      0.05940594              c89     13      0.09090909 
4ok     14      0.1037037               4ok     17      0.05610561              o89     13      0.09090909 
ohc     9       0.06666667              o89     16      0.052805282             189     8       0.055944055 
okc     8       0.05925926              c89     15      0.04950495              c79     7       0.04895105 
oeh     7       0.05185185              oeh     10      0.0330033               8am     7       0.04895105 
1co     5       0.037037037             1co     10      0.0330033               8ay     6       0.04195804 
g1c     4       0.02962963              ohc     9       0.02970297              coe     5       0.034965035 
4oj     4       0.02962963              c79     9       0.02970297              8ae     5       0.034965035 
ohC     4       0.02962963              8ae     9       0.02970297              1c9     4       0.027972028 
1oe     3       0.022222223             189     9       0.02970297              cc9     4       0.027972028 

I can generate the tables for N=4 and N=5 if they are of interest.

Philip Neal's Anagram Encryption

Notice how words tend to start with "4", "o" and "1" and tend to end 
with "9", "m" and "e". 

This sort of feature has me excited about Philip Neal's anagram encryption idea 
explained here: http://voynichcentral.com/users/philipneal/language.html 
which is summarised thus (quoting from that page): 

  "1. Divide a plaintext into lines 
   2. Sort the words of each line into alphabetical order 
   3. Sort the letters of each word into alphabetical order 

   1. one thing led to another thing last night 
   2. another last led night one to thing thing 
   3. aehnort alst del ghint eno ot ghint ghint" 

Right now I am repurposing my Genetic Algorithm to attach some lines of 
the VMs assuming such an encryption - I am killed by the permutations 
(which go as factorial the length of the word).