Yandex neural networks will decode archival handwritten documents


Yandex has taught neural networks to decipher archival records with complex pre-revolutionary spelling. You can try the technology in action right now in the Archive Search service: it gives everyone access to more than 2.5 million pages of historical documents with text transcription. The new algorithm, built on the basis of an optical character recognition system, takes into account the peculiarities of handwriting, recognizes letters that have lost their relevance and understands the special structure of archival documents.

нейросеть и архивные документы

The company's specialists trained a neural network on a data array of hundreds of thousands of handwritten lines from real texts of the XVIII-XIX centuries and tens of millions of generated examples. Materials for training were marked up and decoded by experts, they also controlled the quality of recognition. Manuscripts that are difficult for an untrained person to parse, Yandex technology almost instantly turns into printed text. Thanks to this, you can quickly find documents in the service database with the mention of a surname, a locality or any other words.

"It may take up to half an hour for a professional to decipher one page of an archived handwritten text, and our service copes with this in a few seconds," says Elena Bubnova, head of Yandex Search. "In the future, the technology can be used to solve other tasks in Yandex products."

"Archive Search" will increase the efficiency of historians, sociologists, demographers, genealogists and will help those who are looking for information about their family. The first fund presented in the service was the Glavarchiv of Moscow — it was on its materials that the developers trained the neural network. Now the database has been replenished with documents from the archives of the Orenburg and Novgorod regions. Over time, the number of repositories and available scanned files will increase.

You can search for materials from the XVIII – early XX centuries, which are most popular with users. These are metric books, confession sheets and audit tales with the results of the population census. Documents can be found in the catalog or through the search bar. There are filters by year, archives, funds and inventories. Next to the scan of each page, a line-by-line transcript made by Yandex neural networks is displayed. If you hover the cursor over the desired fragment, it will immediately light up on the digital copy.