TY - JOUR
T1 - Authorship attribution of responsa using clustering
AU - Hacohen-Kerner, Yaakov
AU - Margaliot, Orr
PY - 2014/8/18
Y1 - 2014/8/18
N2 - Authorship attribution of text documents is a "hot" domain in research; however, almost all of its applications use supervised machine learning (ML) methods. In this research, we explore authorship attribution as a clustering problem, that is, we attempt to complete the task of authorship attribution using unsupervised machine learning methods. The application domain is responsa, which are answers written by well-known Jewish rabbis in response to various Jewish religious questions. We have built a corpus of 6,079 responsa, composed by five authors who lived mainly in the 20th century and containing almost 10 M words. The clustering tasks that have been performed were according to two or three or four or five authors. Clustering has been performed using three kinds of word lists: most frequent words (FW) including function words (stopwords), most frequent filtered words (FFW) excluding function words, and words with the highest variance values (HVW); and two unsupervised machine learning methods: K-means and Expectation Maximization (EM). The best clustering tasks according to two or three or four authors achieved results above 98%, and the improvement rates were above 40% in comparison to the "majority" (baseline) results. The EM method has been found to be superior to K-means for the discussed tasks. FW has been found as the best word list, far superior to FFW. FW, in contrast to FFW, includes function words, which are usually regarded as words that have little lexical meaning. This might imply that normalized frequencies of function words can serve as good indicators for authorship attribution using unsupervised ML methods. This finding supports previous findings about the usefulness of function words for other tasks, such as authorship attribution, using supervised ML methods, and genre and sentiment classification.
AB - Authorship attribution of text documents is a "hot" domain in research; however, almost all of its applications use supervised machine learning (ML) methods. In this research, we explore authorship attribution as a clustering problem, that is, we attempt to complete the task of authorship attribution using unsupervised machine learning methods. The application domain is responsa, which are answers written by well-known Jewish rabbis in response to various Jewish religious questions. We have built a corpus of 6,079 responsa, composed by five authors who lived mainly in the 20th century and containing almost 10 M words. The clustering tasks that have been performed were according to two or three or four or five authors. Clustering has been performed using three kinds of word lists: most frequent words (FW) including function words (stopwords), most frequent filtered words (FFW) excluding function words, and words with the highest variance values (HVW); and two unsupervised machine learning methods: K-means and Expectation Maximization (EM). The best clustering tasks according to two or three or four authors achieved results above 98%, and the improvement rates were above 40% in comparison to the "majority" (baseline) results. The EM method has been found to be superior to K-means for the discussed tasks. FW has been found as the best word list, far superior to FFW. FW, in contrast to FFW, includes function words, which are usually regarded as words that have little lexical meaning. This might imply that normalized frequencies of function words can serve as good indicators for authorship attribution using unsupervised ML methods. This finding supports previous findings about the usefulness of function words for other tasks, such as authorship attribution, using supervised ML methods, and genre and sentiment classification.
KW - Hebrew
KW - authorship attribution
KW - responsa
KW - text clustering
KW - unsupervised machine learning methods
KW - word lists
UR - http://www.scopus.com/inward/record.url?scp=84906562020&partnerID=8YFLogxK
U2 - 10.1080/01969722.2014.945311
DO - 10.1080/01969722.2014.945311
M3 - ???researchoutput.researchoutputtypes.contributiontojournal.article???
AN - SCOPUS:84906562020
SN - 0196-9722
VL - 45
SP - 530
EP - 545
JO - Cybernetics and Systems
JF - Cybernetics and Systems
IS - 6
ER -