TY - JOUR
T1 - Automatic alphabet recognition
AU - Geffet, Maayan
AU - Wiseman, Yair
AU - Feitelson, Dror
PY - 2005
Y1 - 2005
N2 - The last step of the Information Retrieval process is to display the found documents to the user. However, some difficulties might occur at that point. English texts are usually written in the ASCII standard. Unlike the English language, many languages have different character sets, and do not have one standard. This plurality of standards causes problems, especially in a web environment, where one may download a document with an unknown standard. This paper suggests a purely automatic way of finding the standard which was used by the document writer based on the statistical letters distribution in the language. We developed a vector-space-based method that creates frequencies vectors for each letter of the language and then matches a new document's vectors to the pre-computed templates. The algorithm was applied on various types of corpora in Hebrew, Russian and English, and provides an efficient solution to the stated problem in most cases.
AB - The last step of the Information Retrieval process is to display the found documents to the user. However, some difficulties might occur at that point. English texts are usually written in the ASCII standard. Unlike the English language, many languages have different character sets, and do not have one standard. This plurality of standards causes problems, especially in a web environment, where one may download a document with an unknown standard. This paper suggests a purely automatic way of finding the standard which was used by the document writer based on the statistical letters distribution in the language. We developed a vector-space-based method that creates frequencies vectors for each letter of the language and then matches a new document's vectors to the pre-computed templates. The algorithm was applied on various types of corpora in Hebrew, Russian and English, and provides an efficient solution to the stated problem in most cases.
KW - Characters set
KW - Letters' mapping
KW - Natural language alphabet
UR - http://www.scopus.com/inward/record.url?scp=22044449465&partnerID=8YFLogxK
U2 - 10.1023/b:inrt.0000048495.64628.ea
DO - 10.1023/b:inrt.0000048495.64628.ea
M3 - ???researchoutput.researchoutputtypes.contributiontojournal.article???
SN - 1386-4564
VL - 8
SP - 25
EP - 40
JO - Information Retrieval
JF - Information Retrieval
IS - 1
ER -