The Voynich Manuscript

This is a collection of various computational methods I have used in an attempt to get some understanding of the cipher used in the Voynich manuscript.

Comparisons

Here

Strong's Cipher

Supposing that Strong's conjecture about the structure of the VMs cipher is correct, we use a computational method to try to infer the alphabets and their sequence with the use of a large Latin diictionary.

Details are here.

Mapping the Voynich Manuscript to Music

Using a Java Media Framework MIDI application that processes the Voynich text, aach character is mapped to a note on the MIDI scale in a two octave range: Note = 48 + i mod 24

Each word is played as one or more chords of these notes, using the Grand Piano instrument.

If a gallows character ("h" or "k") appears in the word, then the notes preceding the gallows character are played as a chord, followed shortly by the notes after the gallows character, but shifted by either 1 octave ("h") or two octaves ("k"). If the character "9" appears as the last in the word, then the delay before playing the next word is reduced.

Using this method, here is what the common Voynich word "8am" sounds like: 8am

The word "okoe": okoe

The word "4ohc89": 4ohc89

And the complete Folio f1r (a long pause is taken after each paragraph on the folio): Folio f1r

And Folio f3r

 

Folio Similarities

More details here.

A Phonetic Attack

Comparing the phonetic content of various texts with VMs "words", using "Soundex" and "Double Metaphone"

For details, go here.

Phrase Analysis

This study is based on the idea that the Voynich "Words" are in fact codes for parts of plaintext words. The Voynich symbol "9" is used to delimit the words. A Genetic Algorithm is used to explore the cipher, using a large body of common Latin phrases (as opposed to single words).

For details, go here.

Vowel/Consonant Group Encryption Scheme

For details, go here.

Transitions between Glyphs

This study looks at the probabilities associated with glyph sequences in the VM. E.g. which is the most likely glyph to find at the beginning of a word in the herbal section?

For details, go here.

Anagram Analysis of Star Labels etc.

This investigation was inspired by Robert Teague's deciphering of e.g. f68r3.

For details, go here.

Analysis of the Prefixes, Stems and Suffixes in the Voynich Manuscript

Extract:

For N=3, looking at the Herbal folios f1v-f20v inclusive, 1331 different words. 

Confirmed valid prefix/stem/suffix counts 99 252 111 
Prefix/Stem/Suffix frequency, normalised 
4ok     0.1010101               o89     0.05952381              o89     0.09009009 
4oh     0.07070707              1oe     0.055555556             8am     0.09009009 
1oe     0.060606062             4ok     0.055555556             1c9     0.054054055 
1oh     0.04040404              8am     0.04761905              1oy     0.054054055 
ok1     0.04040404              4oh     0.04761905              1oe     0.045045044 
8oe     0.030303031             1oy     0.03968254              coe     0.036036037 
1oy     0.030303031             1c9     0.031746034             cc9     0.027027028 
1co     0.030303031             1co     0.023809524             e89     0.027027028 
1ok     0.030303031             8oe     0.023809524             ham     0.027027028 
4oj     0.030303031             coe     0.01984127              2c9     0.027027028 

For more results, go here.

Useful Links

This is by no means an exhaustive list, but these are some of the Voynich sites/pages I have found useful, interesting and/or amusing.

Rene Zandbergen's site

http://www.voynich.nu/

Voynich Central

http://www.voynichcentral.com/

Elmar Vogt's blog

http://voynichthoughts.wordpress.com/

H.R. Santa Coloma's site

http://www.santa-coloma.net/voynich_drebbel/voynich.html

Philip Neal's webpages

http://voynichcentral.com/users/philipneal/

Jan Hurych's VM Letter Frequency Analysis

http://hurontaria.baf.cz/CVM/b2.htm

Voynich Transcription

http://voynichcentral.com/transcriptions/Voynich-101/index.html
 
Edith Sherwood's site
 
http://www.edithsherwood.com/voynich_botanical_plants/
 
This is a list of Latin herb names compiled from various source

https://julianbunn.org/Voynich/LatinHerbs.txt
 
Don Latham's site (with lists of Star names, constellations etc.)
 
http://www.sixmilesystems.com/voynich
 
Francois Almaleh's site
 
http://www.almaleh.com/ms-voynich/index.html
 
 

Attempts to crack the cipher with Genetic Algorithms

Some Assumptions about the VM text

My assumption in the work with Genetic Algorithms (GAs) is that the Voynich is a terse, compressed, abbreviated n-Gram conversion of the plaintext language. The scribe would have consulted a table of n-Grams while converting the Latin (or whatever) to the VM text, perhaps even writing the enciphered words out on a separate work sheet before copying them into the VM.

For example, assume the following extract from a set of possible n-Gram conversion rules (or "mappings") used by the scribe:

Source n-Gram Voynich n-Gram
pot oc
he c
is 9
h f
s 8
y a

then, to convert the English word "hypothesis" to Voynichese, we do as follows:

h => f

y => a

pot => oc

he => c

s => 8

is => 9

so that "hypothesis" => "faocc89"

The inverse mapping can be used to decode the Voynich word (perhaps not unambiguously ...).

Clearly, there are a very large number of possible n-Gram mappings!

Candidate Decipherings of f27v

Go here to look at folio 27v, and see the various Genetic Algorithm attempts at deciphering the text.

Voynich Herbs

Edith Sherwood has a web site where she details compelling possible identifications for the plants depicted in the "herbal" pages of the VM.

Dana Scott's page also has plausible identifications for the plants.

As has often been pointed out, if we look at the first Voynich "word" that appears on each page of the herbal part of the VM, we find that those words are unique, or appear elsewhere very rarely. It thus seems reasonable that the words may be the names of the plants depicted.

The GA was set up to find a set of n-Gram mappings that would convert a list of 111 Voynich first herbal words into Latin/English or Spanish. For this, dictionaries of Latin, English and Spanish herb/plant names were used.

The GA sought a mapping that would convert all the Voynich words for herbs/plants into as many valid plaintext (Spanish, English, Latin) words as possible. The best result was for a mixed English/Latin dictionary (see table): 31 of the 111 Voynich words were converted, about 30% success rate.

(One should never expect 100% success, due to missing names in the dictionary, transcription errors, missing n-Grams, incomplete n-Grams etc..)

The results are shown below in tabular form, together with Dana Scott's and Edith Sherwood's identification. The first column shows the folio in the VM, the second shows the first Voynich word on that folio. For the GA identification columns (3 and 4) the Voynich mapped word is shown, in quotation marks if not found in the associated dictionary, and in bold if found in the dictionary.

Note that, probably unsurprisingly, nowhere do the IDs from the GA in Spanish, English/Latin and Scott/Sherwood, agree! NOT YET, anyway :-)

(What amuses me about about this mapping technique is that it tends to produce words that sound plausible in the target language. E.g. for f4r the Latin/English word "paptise" sounds like a valid word.)

Folio Voynich 1st Word Candidate GA ID, Spanish Candidate GA ID, Latin/Engish Dana Scott ID, English Dana Scott ID, Latin Sherwood ID, Latin Sherwood ID, English
f1r fa19s costa "greica"        
f1v h1s9 rabo geum Deadly Nightshade Atropa belladonna Hyoscyamus niger Solanum nigrum Solanum dulcamara Atropa belladonna Deadly Nightshade
f2r h98an9 "jzba" "ariapha" Cornflower Centaurea cyanus Centaurea diffusa Diffuse Knapweed
f2v hoom "meic" "padi" Water Lily Nymphaea candida Nymphoides Nymphoides
f3r k2cos chinita (Impatiens) arnica     Celosia argentea Feathery amaranth
f3v hoam menta (mint) paris     Helleborus foetidus Dungwort
f4r ho8ae19 "mezirn" "paptise"     Saxifraga cespitosa Alpine Saxifrage
f4v j1oom pastora (Poinsettia) "oigle"     Campanula rapunculus Rampion
f5r h2o89 "piyn" "hicse"     Arnica montana Wolfs Bane
f5v hA1coy malanga (Malanga) cirsium Tennis Racket Plant Agrimonia eupatoria Malva sylvestris Mallow
f6r foay "oote" "erk"     Acanthus mollis Bear Breeches
f6v hoay9say1Chay "meotendoteisedh" "pakpikrtsst"     Eryngium maritimum Sea Holly
f7r f1o8am "saynta" acris     Trientalis europea Starflower
f7v joe29 "rden" anise     Myrica gale Bog Myrtle
f8r g2oe "dno" "miv"     Pisum sativum Green Pea
f8v Ko8 "anop" "amot"     Symphytum officinale Comfrey
f9r k98eo "uardna" "cernur"     Ricinus communis Casteroil
f9v fo1oy "oveh" "erut" Heartsease, Wild Pansy Viola tricolor Violaceae Viola
f10r g1oK9 "pohon" "apryse"     Cichorium pumilum Chicory Endive
f10v gam tora (Tora Tree) gale     Linnaea borealis Twinflower
f11r k2oe chino (Chinese Hat Plant) "arv"     Rosmarinus officinalis Rosemary
f11v goe81o89 "albaveaca" "maadud"     Curcuma longa Turmeric
f13r koy3oy "lenga" "mdoium"     Banana Banana
f13v hoaiy "memh" "paft"     Lonicera periclymenum Honeysuckles Woodbines
f14r g1o8am "poynta" "apcris"     Scorzonera Black Salsify Vipers Grass
f14v g891om "uomic" "gesdi"     Stachys monnieri Wood Betony Heal-all Sel-heal Woundwort
f15r k2oy "chiga" "arium"     Sonchus oleraceus Sow Thistles
f15v gayoy "t8h" "gabt"     Paris quadrifolia Herb Paris
f16r go1co89 "alblanyn" "marscse"     Cannabis Cannabis
f16v g1yAm "potoora" "aptule"     Chrysanthemum Chrysanthemum
f17r f2o89 "hayn" "ulcse"     Catananche caerulea Cupids Dart
f17v g1o8oe "poyno" "apcv"     Dioscorea Yams
f18r g8yaz89 "ullngn" "gmeagse"     Aster alpinus Aster
f18v koe8 la (?) mad     Telfairia Fluted pumpkin
f19r g1oy "poga" apium     Polemonium coeruleum Greek Valerian
f19v go1am "albbora" mantle     Draba nivalis Nailwort
f20r h81o89 "caveaca" woud     Astragalus hypoglottis Milk vetch
f20v faIsay "crrote" greek     Cynara cardunculus Cardoon
f21r g1oy "poga" apium     Anagallis arvensis Pimpernel
f21v koe829 "laol" "madpe"     Dictamnus albus Burning bush False Dittany White Dittany Gas Plant
f22r goe "albv" "maus"     Verbena officinalis Common Vervain Holy Herb
f22v g9samoy "..dah" "hnshot"     Tulip Tulip
f23r g9818op ".fhilo" "hsthlo"     Pulsatilla vulgaris Pasque flower
f23v go8azoe "albzucv" "mapacus"     Borago officinalis Borage Star Flower
f24r goyoy9 "alb.." "maby"     Cucumis sativus Cucumber
f24v k1o8ay coyote (wild) rock     Ficus religiosa Sacred Fig Bo Tree
f25r f1oe89 "sanoaca" "avd"       Wild Thyme
f25v goCam "albcuora" "malile"     Isatis tinctoria Woad
f26r g%coh9 "spnij" lunaria     Prunella vulgaris Self heal
f26v g1c8ay pochote (Pochote) "apgok"     Lens culinaris Lentil
f27r hsoy manga (Mango) "veium"     Spinacia oleracea Spinach
f27v fo1ou oveja (?) eruca French Marigold Tagetes patula Dianthus superbus Dianthus
f28r g1o8ay "poyote" "apck"     Aristolochia Smearwort Birthwort Pipevine
f28v h2oe pino (Pine) "hiv" Dahlia Dahlia imperialis Rhododendrons Rhododendrons
f29r gosam "alb.ora" "mansle"     Lactuva sativa longifolia Romaine Cos Lettuce
f29v hoom "meic" "padi"     Nigella sativa Roman coriander
f30r oh1cs9 "elanbo" "inrsum"     Prunella vulgaris Healall
f30v Ks1an rubia (Madder) montana     Cuscuta europaea Dodder
f31r hcc8c9 lichi (Lychee) "rgoio"     Erigeron acris Fleabane
f31v go8az "albzon" "mapnn" Fernleaf yarrow Achillea filipendulina Valerian Valerian
f32r f1am santa (?) "aris"     Veronica triphyllos Speedwell
f32v h1co8am "ranizora" "genple"     Campanula rotundifolia Harebell
f33r k28ay "chizh" "arpt"     Silene vulgaris Bladder Campion
f33v kayay "qllh" "opmet" Masterwort Astrantia major Tanacetum parthenium Feverfew
f34r g1cocj19 "ponianos" "apnbie"     Anemone hortensis  
f34v hs189 "mansn" "vewse"     Lunaria annua Honesty Money Plant
f35r Koo anona (Custard Apple) amur     Cichorium intybus Radicchio
f35v gay1oy "trtga" galium     Ribes nigrum Blackcurrant
f36r j1af8aN "pa.nzti" "onupfl"     Delphinium staphisagria Delphinium
f36v g1ayos9 "pooteesn" "apksise"     Lamium amplexicaule Henbit
f37r koGoe "luiv" malus     Mentha longifolia Mint
f37v h2o89 "piyn" "hicse"   fedtschenkoi englerii Emilia fosbergii Tassel flower
f38r koeoy "lilh" "mmut"        
f38v oh1oj "eveet" inula     Euphorbia myrsinites Myrtle Spurge
f39r kc7o128 "goguadp" "gienmpot"        
f39v g7aiy "inmh" "naft"        
f40r g1c9 "poi" apio     Erodium malacoides Storks bill
f40v j1c7an "pagmo" "oospo"   Epiphyllum oxypetalum Crocus vernus Crocus
f41r j2c9hc8aecc9 "roilizrii" "ediorpcuio"     Origanum vulgare Wild Marjoram
f41v hcSo8ae "lirbzv" "riupus"     Coriandrum sativum Coriander Cilantro
f42r 2o "ah" st        
f42v k1o˛ cola (?) rosa     Aquilegia vulgaris Columbine Culverwort
f43r kayo8am "q.zora" "opbple"     Stellaria media Chickweed
f43v g8saiy9 "u.lbn" "gnsicse"     Elytrigia repens Couch grass
f44r k2o8g9 "chiy." arch     Mandragora officinarum Mandrake
f44v k2o china (Impatiens) "arur"     Apium graveolens Celery
f45r g9h98ae ".jzv" "hariapus"     Atriplex hortensis Orach Saltbush
f45v hosay9 "me.." pansy     Lavandula angustifolia Lavender
f46r g1coJ9 "ponitr" "apnta"     Leucanthemum vulgare Oxeye Daisy
f46v jo79e3c7 "rimvig" "andretos"   Tanacetum parthenium, Chrysanthemum parthenium Inula conyza Ploughmans Spikenard Great Fleabane
f47r g1aiy "pomh" "apft" Lady's Mantle, Lion's Foot Alchemilla vulgaris Rosaceae Sempervivum tectorum Houseleek
f47v g2cok "dnier" minor   Arnica montana Pulmonaria officinalis Lungwort
f48r g28am "dzora" "miple"     Adonis Vernalis False Hellebore
f48v g1co819 "ponifn" "apnsse"     Ruta graveolens Rue Herb of Grace
f49r gA2oe "ceahv" costus     Nymphaea caerulea Blue Nile Lotus
f49v g he wort        
f50r g2coy "dnih" mint     Astrantia major Masterwort
f50v k19 con (?) rose   Telopea speciosissima Gentiana frigida Stiff Gentain
f51r k2oe819 "chinofn" "arvsse"     Cakile maritima Searocket
f51v go2o89 albahaca (Basil) "mastd"     Salva officinalis Sage
f52r k8oh1F9 "queacn" "toinnise"     Anemone coronaria Poppy Anemone
f52v g1oy "poga" apium     Polystichum setiferum Fern
f53r hA8ap "mazlo" "ciplo"     Achillea Ptarmica Sneezewort
f53v k2oy3c9 "chigamin" "ariumocse"     Hieracium aurantiacum Hawkweed
f54r go8am "albzora" maple     Cirsium oleraceum Cabbage thistle
f54v g1co8ay "ponizh" "apnpt" Bittersweet Nightshade Solanum dulcamara Perovskia atriplicifolia Russian Sage
f55r go8am "albzora" maple     Fumaria officinalis Fumitory
f55v h1C8189 "raecsn" "geriwse" Forest lily Veltheima bracteata Broccoli Broccoli
f56r ok1ae "tebv" "trntus"     Drosera  Sundews
f56v h1cok "ranier" "genor"     Cycas revoluta Sago Palm
f57r joccoHc9 "riopei" "anomiaio"     Sherardia arvemsis Blue Field Madder
f65r           Alchemilla vulgaris Ladies Mantle
f65v           Centaurea cyanus Cornflower
f66v           Satureja montana Winter Savory
f87r           Satureja hortensis Summer Savory
f87v         Senecio Primula vulgaris Primrose
f87v         Kleinia Pedicularis flammea Lousewort Wood Bettony
f89v           Actaea spicata Baneberry
f90r           Conyza bonariensis Fleabane
f90v           Eruca vesicaria Arugula Rocket
f93r           Cynara cardunculus Artichoke
f93v           Lupinus Lupin
f94r         Botrychium lunaria Botrychium lunaria Moonwort Moonfern
f94v           Agrostemma Githago Corncockle Red Campion
f94v           Glycyrrhiza glabra Liquorice
f94v           Plantago lanceolata Ribwort Plantain Kemps
f95r         Berberis Sambucus nigra Elderberry
f95v           Althaea Rosea Hollyhock
f96r           Angelica archangelica Garden Angelica
f96v           Tamus communis Black Bryony

 

Character Analysis

See the section "Bibliography" at the end for the provenance of the various texts used.

Here is the ranking of the most popular words in each text (full tables are available in the Excel spreadsheet)

The 1/2/3/4 n-Grams are calculated as follows. Suppose the word "sesame" appears in the text. The following counts are made:

Then, each count is weighted by the length of the group, yielding s=2, e=2, a=1, m=1, se=2, es=2, sa=2, am=2, me=2, ses=3, etc. The counts are added to running sums for each distinct group found, and normalised at the end of processing.

Comparison of most popular "words" Comparison of most popular characters Comparison of most popular 1/2/3/4 character

combinations

Voynich Latin German English French Herbs
8am et vnnd the de et
1oe in ain and et in
1oy non jn of que non
s te das to la si
19 est die in le qu�
oy me darnach that a Et
am sed es a en Hoc
8ay mihi so it les per
89 ut den not ou tibi
K9 enim z� I soit H�c
2oe cum mit he & quam
8an quod th� be par tamen
1c9 ad woll is Roy cum
ay deus nim for ne dum
2o qui machen as lour forte
oe a s� so si sub
oham quam jst have ceo vires
oh9 quia ainem with pour genus
4ok19 aut las� other nostre est
8ae de darein this dit tum
9 quae wenig they ditz Si
Koe quo der his nul quoque
4oh19 tu nit And est quod
2oy quid oder all soient ut
7am si man them des se
Koy illa von then terre huius
1H9 atque daran you come etiam
ok9 per wie but Item satis
Voynich Latin German English French Herbs
o e n e e i
9 i e t s e
1 a a a n a
a t i o t t
8 u s n r r
c s r h i u
h n d s o s
e m t i a o
y r l r u n
k o h d l m
m c c l d c
2 d m u c l
4 l � m p p
s p o y m d
7 b g c g v
, q f f f b
K v v w q g
p g b g v f
C f w p z q
n h z b h �
H x j v y h
g E k k b x
A I p I E P
3 S � A x H
j Q � T & S
z y u C R E
( C A L I A
f A N x S I
Voynich Latin German English French Herbs
o e e e e i
9 i n t s e
1 a a n n a
a t r a r r
8 u s r i t
c s i s t u
h r t o o s
e n l i a n
k m d h u o
y o h l l m
1o c c d c c
oe d en u es l
oh l ch c nt p
4 p er th en d
4o er m m d er
m b in y re is
ok is f er m re
8a ti g g p ti
89 nt � p is v
am en b he g b
1c re o w er ri
oy in nd es te en
, q ai re f us
2 qu w f on g
o8 um de in le it
co am v en ent or
7 at da the ou nt
ay us ar de v f

The full distributions are plotted in graphical form. Here is the single character distribution. The leftmost points correspond to the top row of the middle table above.

Here is the word length distribution. (Note the striking similarity between mediaeval German and Voynich.)

Here is the frequency of the appearance of words. Note that the Latin Herbs (a specialized text) has a similar shape to the Voynich Herbal.

Here is the 1/2/3/4 letter group frequency plot.

See also: Comparison of "Voyn_101" and "PagesH" Transcriptions

Genetic Algorithm

Armed with the character analysis above, the following cipher presents itself as a possibility: in the plaintext, convert each group of 1, 2, 3 or 4 characters into a Voynich group of 1,2,3 or 4 characters. We call this a "mapping". For example, referring to the tables above, when creating Voynich from Latin, an obvious cipher just based on the frequencies of the groups would be:

e => o, i => 9, ...

er => 4o, is => ok, ti => 8a, ...

ent => 9k, ant => A, ...

This can be encoded into an algorithm thus which maps strings in "repl" to strings in "seek"

 

	String seek[] = {"4ok1",
			 "4oh","8am","1oe","4ok","ok1","o89","1oy","oh1","o8a","oha","ohc","c89","1co","k1o","1c9",
			 "c79","h1o","1o8","oko","oho","coe","8ae","co8","k19","h19","8ay","ham","hcc","koe","oka",
			 "hco",
			 "1o", "oe", "oh", "4o", "ok", "8a", "89", "am", "1c", "oy", "o8", "co", "ay",
			 "k1", "h1", "19", "hc", "c9", "ha", "ae", "79", "2o", "cc", "ko", "ho", "c8", "9h",
			 "9k", "c7", "2c", "ka", "kc", "1a", "an", "h9", "o,", "e8", "k9", "ap", "8o", "e,",
			 ",1", "7a", "81", 
			 "o",  "9",  "1",  "a",  "8",  "c",  "h",  "e",  "k",  "y",  "4",  "m",  ",",
			 "2",  "7",  "s",  "K",  "C",  "p",  "g",  "n",  "H",  "j",  "A"};

	
	//Latin
	String repl[] = {"un",
			 "ri", "on", "f",  "es", "g",  "em", "de", "se", "co", "ne", "ur", "si", "ic", "ui", "me",
			 "ere","eb", "la", "ma", "le", "id", "bu", "nti","no", "cu", "eba","qui","ie", "al", "ul",
			 "ns",
			 "c",  "d",  "l",  "er", "is", "ti", "nt", "en", "re", "in", "um", "am", "us",
			 "te", "it", "v",  "tu", "ta", "ra", "di", "an", "ni", "li", "et", "ba", "ae", "mi",
			 "ent","st", "h",  "nd", "ci", "pe", "im", "ua", "io", "tur","il", "ve", "iu", "as",
			 "vi", "ita","ca",
			 "e",  "i",  "a",  "t",  "u",  "s",  "r",  "n",  "m",  "o",  "p",  "b", "q",
			 "qu", "at", "or", "ia", "ar", "ce", "ib", "ec", "ab", "ru", "ant"};

Such an algorithm is used inside a Chromosome of the Genetic Algorithm. The Chromosome decodes Voynich into Latin by  matching character groups in the Voynich word against each of the strings in the "seek" list in turn. If a match occurs, then the  Voynich group is translated into the Latin group in the "repl" list at the same position. Thus "4ok1" in Voynich is translated into "un" in Latin.

Once the Voynich word has been translated into Latin, the Latin word is looked up in a Latin dictionary. If the word is found, then the "cost" of the Chromosome is increased ... if the word is not found, then the cost is decreased. After all words in the Voynich text have been converted to Latin, and the aggregate cost of the Chromosome evaluated, it can be judged whether the mapping "seek" to "repl" is a good one or not.

Generating the Chromosome Population

We generate a large number of Chromosomes, each of which has a different "seek" to "repl" mapping. We do this by simply shuffling the order of the "repl" strings in each Chromosome.

Thus, one Chromosome may map "4ok1" to "s" and another may map it to "qui".

This population of Chromosomes is then evaluated: each Chromosome converts the Voynich words to Latin, and each then gets a cost. The higher the cost, the better. The highest possible cost would be a Chromosome that had a seek-repl mapping that produced a valid Latin word for each Voynich word.

Training the Chromosomes

The Chromosomes are ordered in decreasing cost, and then the best of them (i.e. at the top of the list) are "mated" together to produce offspring Chromosomes. The mating process essentially involves taking sequences of the "repl" strings from both parents and combining them to form a new "repl" string.

Some of the offspring Chromosomes are then "mutated". This involves replacing one of the "repl" strings with some randomly selected letters from the Latin character set.

The process repeats (ordering the Chromosomes, mating the best ones, mutating the offspring) until a predefined cost value is reached, or the population of Chromosomes refuses to improve itself.

In the end, the best, trained Chromosome will contain the optimal arrangement of "seek" to "repl" mappings for conversion of Voynich to Latin.

The same procedure can be used for a Voynich to English, to German, French or any other language, provided that a dictionary and substantial texts are available to process.

First Results - Voynich to Latin

This is a limited attack on the first five "sentences" of f1r, using 200 chromosomes and a Latin dictionary of around 15,000 words. The best chromosome scores 9.4 after 500 training epochs (cf a score of 20 for a one-to-one translation of Latin into Latin).

Here are the deciphered sentences:

1) Voynich: fa19s 9hae ay Akam 2oe !oy9 �scs 9 hoy 2oe89 soy9 Hay oy9 hacy 1kam 2ay Ais Kay Kay 8aN s9aIy 2ch9 oy 9ham +o8 Koay9 Kcs 8ayam s9 8om okcc9 okcoy yoeok9 ?Aay 8am oham oy ohaN saz9 1cay Kam Jay Fam 98ayai29

Latin: ?ereieas vias is asasita meas ?ereis ?astuas is quinti mensis asereis vis ereis sttunti viasita viis as?as alis alis qui? asisere? nti quere ere viis ?ita alamisis altuas ereis asis quiita quantis querenti ntiviquis ?asis qui amita ere am? asere?is viis alis ?is ?is isereere?viis

2) Voynich: � o8ay ? !oe Go9 o98ay !?s Foam #o8ay9 9+c9 2o89 oh1o9 ok1oe 1oK9 os19 8an 1oy hos 8am 2os Foe 2o89 8an oy kco8(

Latin: ? quinti ? ?vi ?amis amisere ??as ?amis ?quintiis is?ere meam famis quas qualis amasie quid ere quias qui meas ?vi meam quid ere a ntita?

3) Voynich: � 98an Gcsam oes Gc9 9kan 2o29 Jo8ae cs oh2o h2o9 okazn okcoe ohaN 2o8an sv19 8am 2o9 Hc9 ho8am G9 Jo8aIes L9 2o oe7an 7 8an om 1 oe o8am 1o8an 189 ohan 7aN K9 ho8 8am 2Hc9 Hoy 1ay 3c9 hoe1oe 1oe hoy 1ae 3o 1oe 2o8aN h+9 h1e 8oy 1o8am 2o hccap 91o k1c9 1chan +cog2oe 89898 K9 8ai�9 (ko 3oe 3c ho82c9 Jcae9 8ay an 8an H98s 81ay +Kam ohaZ1c9 �19 �koe Koes 8aoEko 2oh 1oy 1c9 8an Hc9 okoe 8aM

Latin: ? isquid ?tuasis vias ?ere quis meviis ?quias tuas eme asmeis no?d quere am? mequid as?ie qui meis vere quiqui ?is ?qui?asas ?isme vitad ere quid amita ere quiita alis viam amd ta? alis quis qui vivere vere quinti ?ere quiasere ere quinti quias ?am ere mequi? as?is cas quinti alis me quere isqu vere vistd ?ereasmeas amams alis qui??is ?at ?vi ?tu quisquis ?tuasis ere is quid vissas quis ?alis am?qui ?ie ?as alvias quiam?at meas ere qui quid vere amas qui?

4) Voynich: Go 2am !oh1C9 1oe k2o8ccs9 2c9 g98C9 19 yo 8ay (7an 1oe 8an Kae 8ay +cay ham 8ay 5c9 Kay 1o?o ham 2oam ohoe 8am ?ay Koe 8am Koe8ay 91C9 ohC9 oh9 8am oh1c9 hoham o?1oe hA719 8ae 81co 2o89 ho?19 K9 oh1c9 hcc9 hcc9 8ae 1koy 2o? 1oe 1H1ok9 1okc9 81am

Latin: ?am viis ?fnsis ere atmesantasis quis asissnsis ie ntiam ere ?tad ere quid alas ere ?tuis qui ere ?ere alis qu?am qui meis quas qui ?is alvi qui alruis isvinsis ensis eis qui fere quiqui am?ere asasereie qui quere meam qui?ie alis fere quis quis qui viatnti me? ere vivquat quantis quis

5) Voynich: h1s9 1o8am oe oek1c9 1ay Fax ap 9kcc9 1ay oy o19 81o eho89 oho8ay 1o89 8o H9 HoH9 59 8h3C9 K9 hok1o89 8ae 8oe 1ohco 8az 8ap so1c9 1o ho89

Latin: casis alis vi vivere quinti ?ere? ere quantis quinti ere amie quam asquiam quere alis qui vis vamvis ?is sas?nsis alis quiquam qui quias quere qui? quin asamqui qu quiam


Here is a "translation" from Latin to English (chromsome cost ~5)

Latin: Magnus es, domine, et laudabilis valde: magna virtus tua, et sapientiae tuae non est numerus. et laudare te vult homo, aliqua portio creaturae tuae, et homo circumferens mortalitem suam, circumferens testimonium peccati sui et testimonium, quia superbis resi stis: et tamen laudare te vult homo, aliqua portio creaturae tuae.tu excitas, ut laudare te delectet, quia fecisti nos ad te et in quietum est cor nostrum, donec requiescat in te. da mihi, domine, scire et intellegere, utrum sit prius invocare te an laudare te,
 et scire te prius sit an invocare te. sed quis te invocat nesciens te? aliud enim pro alio potest invocare nesciens. an potius in
vocaris, ut sciaris? quomodo autem invocabunt, in quem non crediderunt? aut quomodo credent sine praedicante? et laudabunt dominum
 qui requirunt eum. quaerentes enim inveniunt eum et invenientes laudabunt eum. quaeram te, domine, invocans te, et invocem te cre
dens in te: praedicatus enim es nobis. invocat te, domine, fides mea, quam dedisti mihi, quam inspirasti mihi per humanitatem fili
i tui, per ministerium praedicatoris tui.

English: ?seouething be? seneenbe? her ngehaseatkese srabe? riouethse hathethese these? her sesestoudeeth theeth teeth bend ethingsing? her ngehaseras and sound rnesne? rasesese stssrane ingyengendeth theeth? her rnesne eltheelsoutide ssstheethands seets? eltheelsoutide andtheethiis haingyera seeth her andtheethiis? these sehahatheise yeyethese? her thetheeth ngehaseras and sound rnesne? rasesese stssrane ingyengendeth theeth?the s?elthese? hand ngehaseras and bevingandnd? these oungesera tese sese and her herthehering bend hathe tethedes? seinge yethebeyend her and? sese enrse? seneenbe? seelye her herandsevouti? handdes yend staing hersneyeye and the ngehaseras and? her seelye and staing yend the hersneyeye and? these these and hersneyend beseeltese and? raise teeth sttheneraou stneandthe hersneyeye beseeltese? the stneraing hersneyease? hand seelrase? senesnesene sehaands hersneyeouv? her seeth teeth ingyesebedev? sehand senesnesene ingyebev yebe stndsseyevs? her ngehaseatsend seneenething the yethedev sing? seethyevbe teeth herndmasend sing her herndmavbe ngehaseatsend sing? seethnds and? seneenbe? hersneyede and? her hersnenges and ingyebede her and? stndsseyethese teeth be teise? hersneyend and? seneenbe? ouvebe these? seeth besethese enrse? seeth hersestsendthese enrse hathe ringthengeands oukesese these? hathe enmathesis stndsseyendssse these?

I'm still digesting these, but some thoughts:

1) The size of the Dictionary used for training is important
2) Opinion: the translation Latin->English looks less plausible than the Voynich->Latin
3) After training, the solution found is just one solution - not necessarily the optimum solution (there may be many local minima).
4) Running the GA several times with the same input parameters yields different translations
5) The cost function only uses a dictionary lookup, but should perhaps penalise chromosomes that have multiple Voynich n-grams that map to the same Latin n-gram. E.g. if all Voynich n-grams mapped to "qui" then the chromosome would score very highly, but produce repetitious nonsense.

Results with a Refined GA and with Landini's "Challenge" Text

Here are some thought-provoking results from looking at Landini's challenge text, as suggested by Knox, the VM text, and comparisons with English, Latin, German, French and Spanish. These use a new form of the Genetic Algorithm, described below.

Summary

It looks to me like that Landini either generated his text from a transcription of the VM itself, or his algorithm for generating that text is a good emulation of the encoding process used in the VM. In other words the Landini "language" is a good candidate as a plaintext language for the VM, as opposed to the European languages tested.

Results

Here is a table which shows the GA's efficiency at converting/translating between Voynich, Landini, and the other languages.



(In the table, the best possible score is 1.0 - see below for an explanation)

Asking the GA to translate English to English, or Latin to Latin, etc. results in a high efficiency score, as expected. Note that the Landini to Landini  efficiency is 0.97 - almost perfect.

The GA performs moderately at converting between the languages and the Landini text. But what is most striking (to me) is the good efficiency for converting Voynich to Landini (0.74) and Landini to Voynich (0.89)

Some Notes on the table

To look at this I revised my GA code so that it was more flexible, and I jettisoned the use of separate dictionaries. Here is how the GA now functions. It can convert/translate between any language text samples.

1) Two text files are read in: the "source" text, and the "target" text. This could be, for example, a source file containing Landini's text, and a target file containing Spanish text, if we want to convert from Landini to Spanish.

2) The text in each file is processed separately, producing two word lists, and two sets of n-Gram frequency tables.

3) The chromosomes are generated with random mappings between the source n-Grams and the target n-Grams

4) The GA evolves the chromosomes by trying to maximise their cost. The difference now is that when a target word is generated from the source text using the mappings, it is looked up in the target word list created in 2) above, rather than in a separate dictionary.

5) After training, the best chromosome can have a maximum cost value of 1.0, which would correspond to a perfect conversion between the source text and the target
text (i.e. every word produced from the source text is found in the target text dictionary)

6) So we can feed the GA with two identical texts, and after training the score of the best chromosome should be 1.0, and indeed it approaches that (it doesn't quite
get there because only the top 100 n-Grams are translated, and so some characters in the source text cannot be translated).

7) The word and n-Gram frequency lists are made from the entirety of each text, but (for this exploratory study) the training takes place on only the first 50 "words" in the source text, and uses only the first 100 n-Grams for mapping.  Thus if the 50 words of Voynich chosen contain several rare characters, then for those the mapping will fail because those rare characters do not appear in the n-Gram list, and this will result in a lower score.

8) In all cases the "X->X" score in the table (i.e. the diagonal)  represents the best score possible for that language, and is a normalisation for the other numbers in the table. I should really revise the table and divide out the off-diagonal scores by the diagonal normalisations.

9) An improvement would be to configure the n-Gram list to be, say, 200 long, and use more source (Voynich) words for the training. The downside of this is mainly execution speed.

10) These runs were with n-Grams up to 3: it would be better to go to 4 at least.

11) I think Landini gets good scores because the character set he uses is very small. Knox comments " A factor must be that the Landini Challenge has built-in frequency matches to any transcription of the VMs. Also, there is no meaningful correspondence in the letter sequence of one word to another in Landini. The difficulty fits what I said the VMs may be."


Letter and nGram Frequencies

Here are the letter frequencies for the whole of the VM, in the Voyn_101 encoding

Rank Letter Counts Frequency
1 o 25055 1.58E-01
2 9 17441 1.10E-01
3 a 14499 9.12E-02
4 c 13751 8.65E-02
5 1 11059 6.96E-02
6 e 10658 6.70E-02
7 8 10349 6.51E-02
8 h 9926 6.24E-02
9 y 6718 4.23E-02
10 k 6120 3.85E-02
11 4 5411 3.40E-02
12 m 4112 2.59E-02
13 2 3362 2.11E-02
14 C 2844 1.79E-02
15 7 2718 1.71E-02
16 s 2705 1.70E-02
17 n 1766 1.11E-02
18 p 989 6.22E-03
19 K 910 5.72E-03
20 g 879 5.53E-03
21 H 866 5.45E-03
22 A 769 4.84E-03
23 j 564 3.55E-03
24 3 519 3.26E-03
25 z 511 3.21E-03
26 5 412 2.59E-03
27 ( 402 2.53E-03
28 d 336 2.11E-03
29 i 264 1.66E-03
30 f 233 1.47E-03
31 u 177 1.11E-03
32 M 174 1.09E-03
33 * 164 1.03E-03
34 Z 154 9.69E-04
35 % 140 8.80E-04
36 N 132 8.30E-04
37 J 122 7.67E-04
38 6 118 7.42E-04
39 I 98 6.16E-04
40 x 98 6.16E-04
41 ? 96 6.04E-04
42 + 95 5.97E-04
43 W 92 5.79E-04
44 G 89 5.60E-04
45 Y 55 3.46E-04
46 E 54 3.40E-04
47 ! 50 3.14E-04
48 l 40 2.52E-04
49 t 40 2.52E-04
50 P 38 2.39E-04
51 S 38 2.39E-04
52 & 36 2.26E-04
53 F 35 2.20E-04
54 # 34 2.14E-04
55 L 32 2.01E-04
56 U 32 2.01E-04
57 b 30 1.89E-04
58 � 30 1.89E-04
59 Q 27 1.70E-04
60 T 27 1.70E-04
61 | 25 1.57E-04
62 $ 24 1.51E-04
63 w 24 1.51E-04
64 V 23 1.45E-04
65 \ 17 1.07E-04
66 X 14 8.80E-05
67 r 12 7.55E-05
68 R 11 6.92E-05
69 q 11 6.92E-05
70 � 11 6.92E-05
71 D 10 6.29E-05
72 v 10 6.29E-05
73 � 10 6.29E-05
74 � 10 6.29E-05
75 � 10 6.29E-05
76 � 10 6.29E-05
77 � 9 5.66E-05
78 ^ 8 5.03E-05
79 � 8 5.03E-05
80 � 8 5.03E-05
81 � 8 5.03E-05
82 � 8 5.03E-05
83 � 7 4.40E-05
84 � 7 4.40E-05
85 � 7 4.40E-05
86 B 6 3.77E-05
87 � 6 3.77E-05
88 � 6 3.77E-05
89 � 6 3.77E-05
90 � 6 3.77E-05
91 � 6 3.77E-05
92 � 6 3.77E-05
93 � 5 3.14E-05
94 � 5 3.14E-05
95 � 5 3.14E-05
96 � 5 3.14E-05
97 � 5 3.14E-05
98 � 5 3.14E-05
99 � 5 3.14E-05
100 � 4 2.52E-05
101 � 4 2.52E-05
102 � 4 2.52E-05
103 � 4 2.52E-05
104 � 4 2.52E-05
105 � 4 2.52E-05
106 � 4 2.52E-05
107 � 4 2.52E-05
108 � 3 1.89E-05
109 � 3 1.89E-05
110 � 3 1.89E-05
111 � 3 1.89E-05
112 � 3 1.89E-05
113 � 3 1.89E-05
114 � 3 1.89E-05
115 � 3 1.89E-05
116 � 3 1.89E-05
117 0 2 1.26E-05
118 _ 2 1.26E-05
119 � 2 1.26E-05
120 � 2 1.26E-05
121 � 2 1.26E-05
122 � 2 1.26E-05
123 � 2 1.26E-05
124 � 2 1.26E-05
125 � 2 1.26E-05
126 � 2 1.26E-05
127 � 2 1.26E-05
128 � 2 1.26E-05
129 � 2 1.26E-05
130 � 2 1.26E-05
131 � 2 1.26E-05
132 � 2 1.26E-05
133 � 1 6.29E-06
134 � 1 6.29E-06
135 � 1 6.29E-06
136 � 1 6.29E-06
137 � 1 6.29E-06
138 � 1 6.29E-06
139 � 1 6.29E-06
140 � 1 6.29E-06
141 � 1 6.29E-06
142 � 1 6.29E-06
143 � 1 6.29E-06
144 � 1 6.29E-06
145 � 1 6.29E-06
146 � 1 6.29E-06
147 � 1 6.29E-06
148 � 1 6.29E-06
149 � 1 6.29E-06
150 � 1 6.29E-06
151 � 1 6.29E-06
152 � 1 6.29E-06
153 � 1 6.29E-06
154 � 1 6.29E-06
155 � 1 6.29E-06
156 � 1 6.29E-06
157 � 1 6.29E-06
158 � 1 6.29E-06
159 � 1 6.29E-06
160 � 1 6.29E-06
161 � 1 6.29E-06
162 � 1 6.29E-06
163 � 1 6.29E-06
164 � 1 6.29E-06
165 � 1 6.29E-06
166 � 1 6.29E-06
167 � 1 6.29E-06
168 � 1 6.29E-06
169 � 1 6.29E-06
170 � 1 6.29E-06
171 � 1 6.29E-06
172 � 1 6.29E-06

Here are the word length frequency counts

Word Length Count Frequency
1 1096 0.028637124
2 4069 0.10631794
3 8412 0.21979515
4 9354 0.24440844
5 8425 0.22013482
6 4489 0.11729202
7 1721 0.0449676
8 471 0.012306647
9 155 0.004049958
10 52 0.001358696
11 13 3.40E-04
12 10 2.61E-04
13 3 7.84E-05
14 2 5.23E-05

The most frequent words

Rank Word Count Frequency
1 8am 704 0.018394649
2 oe 569 0.014867266
3 am 511 0.013351798
4 ay 391 0.010216346
5 oy 373 0.009746028
6 1oe 354 0.009249582
7 1c9 333 0.008700878
8 1c89 332 0.008674749
9 s 295 0.007707985
10 4ohan 277 0.007237667
11 8ay 273 0.007133152
12 ae 265 0.006924122
13 4oham 248 0.006479933
14 4ohC9 236 0.006166388
15 1oy 206 0.005382525
16 2c89 202 0.00527801
17 oham 198 0.005173495
18 89 192 0.005016722
19 8ae 187 0.004886079
20 2c9 183 0.004781564
21 4ohae 182 0.004755435
22 8an 174 0.004546405
23 4ohc89 171 0.004468018
24 1c79 166 0.004337375
25 9 160 0.004180602
26 1coe 157 0.004102216
27 19 150 0.003919315
28 y 150 0.003919315
29 ohae 149 0.003893186
30 okam 149 0.003893186
31 4oe 144 0.003762542
32 okay 142 0.003710284
33 4ohay 141 0.003684155
34 4ohC89 141 0.003684155
35 2oe 135 0.003527383
36 1H9 132 0.003448997
37 4oh9 132 0.003448997
38 okae 131 0.003422868
39 ohan 130 0.003396739
40 ohay 124 0.003239967
41 sam 120 0.003135452
42 an 119 0.003109323
43 ohC9 114 0.002978679
44 ok9 113 0.00295255
45 e 112 0.002926421
46 7am 105 0.00274352
47 189 101 0.002639005
48 1C9 101 0.002639005
49 oh9 100 0.002612876
50 4ohc9 99 0.002586748

Here is the 1/2/3/4 nGram frequency for the whole VM, showing only the top 100

Rank nGram Counts Frequency
1 o 96575 0.058597673
2 9 64626 0.039212354
3 c 58862 0.035715003
4 a 48135 0.029206306
5 h 40911 0.024823084
6 1 40709 0.024700519
7 8 39703 0.02409012
8 e 37842 0.022960944
9 oh 25419 0.015423186
10 k 24626 0.014942028
11 4 23615 0.014328594
12 4o 23113 0.014024003
13 y 21208 0.012868128
14 89 20822 0.012633919
15 oe 19757 0.011987722
16 1c 18455 0.011197723
17 ok 16526 0.010027286
18 4oh 13820 0.008385398
19 co 13222 0.008022557
20 c8 12984 0.007878148
21 hc 12020 0.007293234
22 ae 11891 0.007214962
23 m 11833 0.00717977
24 8a 11636 0.007060238
25 2 11192 0.006790838
26 am 11136 0.006756859
27 ha 11119 0.006746545
28 ay 11044 0.006701037
29 C 11027 0.006690722
30 7 10911 0.006620339
31 c89 10128 0.006145247
32 c9 9422 0.005716876
33 1o 9387 0.005695639
34 o8 9021 0.005473566
35 cc 8475 0.005142276
36 s 8212 0.004982698
37 oy 7839 0.004756378
38 ohc 7823 0.004746669
39 79 7477 0.004536731
40 oha 7256 0.004402638
41 kc 6670 0.004047077
42 2c 5996 0.003638122
43 n 5830 0.0035374
44 hC 5729 0.003476118
45 ka 5717 0.003468837
46 an 5695 0.003455488
47 c7 5508 0.003342024
48 4ok 5242 0.003180627
49 eh 5043 0.003059882
50 okc 4819 0.002923968
51 1c8 4775 0.002897271
52 c79 4448 0.00269886
53 h1 4431 0.002688546
54 1co 4396 0.002667309
55 k1 4357 0.002643646
56 o89 4247 0.002576902
57 4ohc 4161 0.002524721
58 oka 4152 0.00251926
59 C9 4077 0.002473753
60 4oha 4053 0.002459191
61 g 4011 0.002433707
62 hcc 3712 0.002252287
63 coe 3707 0.002249253
64 ohC 3697 0.002243185
65 co8 3575 0.002169161
66 1c89 3573 0.002167947
67 8am 3520 0.002135789
68 19 3447 0.002091496
69 p 3320 0.002014437
70 e1 3300 0.002002302
71 9h 3283 0.001991987
72 o8a 3178 0.001928278
73 1c9 3036 0.001842118
74 ko 3030 0.001838477
75 ho 2950 0.001789937
76 oeh 2895 0.001756565
77 hco 2784 0.001689215
78 1oe 2679 0.001625505
79 C8 2670 0.001620044
80 ham 2660 0.001613977
81 hae 2647 0.001606089
82 ok1 2625 0.00159274
83 j 2624 0.001592134
84 9k 2619 0.0015891
85 A 2601 0.001578178
86 ya 2577 0.001563616
87 ap 2567 0.001557548
88 H 2552 0.001548447
89 4ohC 2503 0.001518716
90 oh1 2496 0.001514469
91 7a 2490 0.001510828
92 8ae 2487 0.001509008
93 8ay 2471 0.0014993
94 han 2399 0.001455613
95 K 2394 0.001452579
96 h9 2367 0.001436197
97 cc8 2359 0.001431343
98 2o 2357 0.001430129
99 18 2329 0.00141314
100 eo 2225 0.001350037

 

Bibliography

Jan Hurych's VM Letter Frequency Analysis

http://hurontaria.baf.cz/CVM/b2.htm

Voynich Transcription

http://voynichcentral.com/transcriptions/Voynich-101/index.html

Book of Courtier (1561)

http://www.uoregon.edu/~rbear/courtier/courtier.html

Augustinus Confessiones

http://ccat.sas.upenn.edu/jod/latinconf/

1367 French

http://www.ucc.ie/celt/published/F300001-001/index.html

Latin garden description

http://www.uni-duisburg.de/Institute/CollCart/hortus/strabo/HB-ceref.htm

Sabina Welserin Cookbook (German, c 1553)

http://www.uni-giessen.de/gloning/tx/sawe.htm

Latin scribal abbreviations

http://www.ualberta.ca/~sreimer/ms-course/course/abbrevtn.htm

Mediaeval english herb names

http://www.gardenhistoryinfo.com/medieval/medievalherbs.html

Martin Luther Bible

http://quod.lib.umich.edu/cgi/l/luther/luther-idx?type=DIV2&byte=2160