Studying the history of the Arabic language: language technology and a large-scale historical corpus

Yonatan Belinkov, Alexander Magidow, Alberto Barrón-Cedeño, Avi Shmidman, Maxim Romanov

Research output: Contribution to journalArticlepeer-review

12 Scopus citations

Abstract

Arabic is a widely-spoken language with a long and rich history, but existing corpora and language technology focus mostly on modern Arabic and its varieties.Therefore, studying the history of the language has so far been mostly limited to manual analyses on a small scale. In this work, we present a large-scale historical corpus of the written Arabic language, spanning 1400 years. We describe our efforts to clean and process this corpus using Arabic NLP tools, including the identification of reused text.We study the history of the Arabic language using a novel automatic periodization algorithm, as well as other techniques.Our findings confirm the established division of written Arabic into Modern Standard and Classical Arabic, and confirm other established periodizations, while suggesting that written Arabic may be divisible into still further periods of development.

Original languageEnglish
Pages (from-to)771-805
Number of pages35
JournalLanguage Resources and Evaluation
Volume53
Issue number4
DOIs
StatePublished - 1 Dec 2019

Bibliographical note

Publisher Copyright:
© 2019, Springer Nature B.V.

Funding

This research was partly supported by the HBKU Qatar Computing Research Institute (QCRI), as part of a collaboration with the MIT Computer Science and Artificial Intelligence Laboratory (CSAIL). Y.B. was also supported by the Harvard Mind, Brain, Behavior Initiative. This research was also partly supported by the Israel Science Foundation (Grant No. 977/16), and by DICTA: The Israel Center For Text Analysis.

FundersFunder number
CSAIL
DICTA
HBKU Qatar Computing Research Institute
Israel Center For Text Analysis
MIT Computer Science and Artificial Intelligence Laboratory
QCRI
Israel Science Foundation977/16

    Keywords

    • Arabic
    • Corpus
    • Historical linguistics
    • Periodization
    • Text reuse

    Fingerprint

    Dive into the research topics of 'Studying the history of the Arabic language: language technology and a large-scale historical corpus'. Together they form a unique fingerprint.

    Cite this