MsBERT: A New Model for the Reconstruction of Lacunae in Hebrew Manuscripts

Avi Shmidman, Ometz Shmidman, Hillel Gershuni, Moshe Koppel

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Hebrew manuscripts provide thousands of textual transmissions of post-Biblical Hebrew texts. In many cases, the text in the manuscripts is not fully decipherable, whether due to deterioration, perforation, burns, or otherwise. Existing BERT models for Hebrew struggle to fill these gaps, due to the many orthographical deviations found in Hebrew manuscripts. We have pretrained a new dedicated BERT model, dubbed MsBERT (short for: Manuscript BERT), designed from the ground up to handle Hebrew manuscript text. MsBERT substantially outperforms all existing Hebrew BERT models regarding the prediction of missing words in fragmentary Hebrew manuscript transcriptions in multiple genres, as well as regarding the task of differentiating between quoted passages and exegetical elaborations. We provide MsBERT for free download and unrestricted use, and we also provide an interactive and user-friendly website to allow manuscript scholars to leverage the power of MsBERT in their scholarly work of reconstructing fragmentary Hebrew manuscripts.

Original languageEnglish
Title of host publicationML4AL 2024 - 1st Workshop on Machine Learning for Ancient Languages, Proceedings of the Workshop
EditorsJohn Pavlopoulos, Thea Sommerschield, Yannis Assael, Shai Gordin, Kyunghyun Cho, Marco Passarotti, Rachele Sprugnoli, Yudong Liu, Bin Li, Adam Anderson
PublisherAssociation for Computational Linguistics (ACL)
Pages13-18
Number of pages6
ISBN (Electronic)9798891761445
StatePublished - 2024
Event1st Workshop on Machine Learning for Ancient Languages, ML4AL 2024 - Hybrid, Bangkok, Thailand
Duration: 15 Aug 2024 → …

Publication series

NameML4AL 2024 - 1st Workshop on Machine Learning for Ancient Languages, Proceedings of the Workshop

Conference

Conference1st Workshop on Machine Learning for Ancient Languages, ML4AL 2024
Country/TerritoryThailand
CityHybrid, Bangkok
Period15/08/24 → …

Bibliographical note

Publisher Copyright:
© 2024 Association for Computational Linguistics.

Fingerprint

Dive into the research topics of 'MsBERT: A New Model for the Reconstruction of Lacunae in Hebrew Manuscripts'. Together they form a unique fingerprint.

Cite this