TY - JOUR
T1 - Automatically identifying citations in Hebrew-Aramaic documents
AU - Hacohen-Kerner, Yaakov
AU - Schweitzer, Nadav
AU - Mughaz, Dror
PY - 2011/3
Y1 - 2011/3
N2 - Citations in documents contain important information about the sources that authors cite and their importance and impact. Therefore, automatic identification of citations from documents is an important task. Citations included in rabbinic literature are more difficult to identify and to extract than citations in scientific papers written in English for various reasons. The aim of this novel research is to automatically identify undated citations included a unique data set: rabbinic documents written in Hebrew-Aramaic. We formulate four feature sets: orthographic, quantitative, stopword-based, and n-gram-based. Different experiments on all combinations of these feature sets using six common machine learning methods and Infogain have been performed. A combination of all four feature sets using logistic regression achieves an accuracy of 91.98%, which is an improvement of 16.53% compared to a baseline result.
AB - Citations in documents contain important information about the sources that authors cite and their importance and impact. Therefore, automatic identification of citations from documents is an important task. Citations included in rabbinic literature are more difficult to identify and to extract than citations in scientific papers written in English for various reasons. The aim of this novel research is to automatically identify undated citations included a unique data set: rabbinic documents written in Hebrew-Aramaic. We formulate four feature sets: orthographic, quantitative, stopword-based, and n-gram-based. Different experiments on all combinations of these feature sets using six common machine learning methods and Infogain have been performed. A combination of all four feature sets using logistic regression achieves an accuracy of 91.98%, which is an improvement of 16.53% compared to a baseline result.
KW - Hebrew-Aramaic documents
KW - citation identification
KW - knowledge discovery
KW - machine learning methods
KW - undated documents
UR - http://www.scopus.com/inward/record.url?scp=79956101521&partnerID=8YFLogxK
U2 - 10.1080/01969722.2011.567893
DO - 10.1080/01969722.2011.567893
M3 - ???researchoutput.researchoutputtypes.contributiontojournal.article???
AN - SCOPUS:79956101521
SN - 0196-9722
VL - 42
SP - 180
EP - 197
JO - Cybernetics and Systems
JF - Cybernetics and Systems
IS - 3
ER -