TY - GEN
T1 - Experiments with filtered detection of similar academic papers
AU - HaCohen-Kerner, Yaakov
AU - Tayeb, Aharon
PY - 2012
Y1 - 2012
N2 - In this research, we investigate the issue of efficient detection of similar academic papers. Given a specific paper, and a corpus of academic papers, most of the papers from the corpus are filtered out using a fast filter method. Then, 47 methods (baseline methods and combinations of them) are applied to detect similar papers, where 34 of the methods are variants of new methods. These 34 methods are divided into three new method sets: rare words, combinations of at least two methods, and compare methods between portions of the papers. Results achieved by some of the 34 heuristic methods are better than the results of previous heuristic methods, comparing to the results of the "Full Fingerprint" (FF) method, an expensive method that served as an expert. Nevertheless, the run time of the new methods is much more efficient than the run time of the FF method. The most interesting finding is a method called CWA(1) that computes the frequency of rare words that appear only once in both compared papers. This method has been found as an efficient measure to check whether two papers are similar.
AB - In this research, we investigate the issue of efficient detection of similar academic papers. Given a specific paper, and a corpus of academic papers, most of the papers from the corpus are filtered out using a fast filter method. Then, 47 methods (baseline methods and combinations of them) are applied to detect similar papers, where 34 of the methods are variants of new methods. These 34 methods are divided into three new method sets: rare words, combinations of at least two methods, and compare methods between portions of the papers. Results achieved by some of the 34 heuristic methods are better than the results of previous heuristic methods, comparing to the results of the "Full Fingerprint" (FF) method, an expensive method that served as an expert. Nevertheless, the run time of the new methods is much more efficient than the run time of the FF method. The most interesting finding is a method called CWA(1) that computes the frequency of rare words that appear only once in both compared papers. This method has been found as an efficient measure to check whether two papers are similar.
KW - Corpus
KW - Detection
KW - Filtering
KW - Fingerprinting
KW - Heuristic methods
KW - Similar academic papers
UR - http://www.scopus.com/inward/record.url?scp=84866647279&partnerID=8YFLogxK
U2 - 10.1007/978-3-642-33185-5_1
DO - 10.1007/978-3-642-33185-5_1
M3 - ???researchoutput.researchoutputtypes.contributiontobookanthology.conference???
AN - SCOPUS:84866647279
SN - 9783642331848
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 1
EP - 13
BT - Artificial Intelligence
T2 - 15th International Conference on Artificial Intelligence: Methodology, Systems, and Applications, AIMSA 2012
Y2 - 12 September 2012 through 15 September 2012
ER -