In this paper, we present a comparative study of news documents classification using various supervised machine learning methods and different combinations of key-phrases (word N-grams extracted from text) and visual features (extracted from a representative image from each document). The application domain is news documents written in English that belong to four categories: Health, Lifestyle-Leisure, Nature-Environment and Politics. The use of the N-gram textual feature set alone led to an accuracy result of 81.0%, which is much better than the corresponding accuracy result (58.4%) obtained through the use of the visual feature set alone. A competition between three classification methods, a feature selection method, and parameter tuning led to improved accuracy (86.7%), achieved by the Random Forests method.
|Title of host publication||Semantic Keyword-Based Search on Structured Data Sources First COST Action IC1302 – International KEYSTONE Conference, IKC 2015, Revised Selected Papers|
|Editors||Yannis Velegrakis, Jorge Cardoso, Jorge Cardoso, Alexandre Miguel Pinto, Francesco Guerra, Geert-Jan Houben|
|Number of pages||12|
|State||Published - 2015|
|Event||1st COST Action IC1302 International KEYSTONE Conference on Semantic Keyword-Based Search on Structured Data Sources, IKC 2015 - Coimbra, Portugal|
Duration: 8 Sep 2015 → 9 Sep 2015
|Name||Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)|
|Conference||1st COST Action IC1302 International KEYSTONE Conference on Semantic Keyword-Based Search on Structured Data Sources, IKC 2015|
|Period||8/09/15 → 9/09/15|
Bibliographical noteFunding Information:
This work was supported by MULTISENSOR project, partially funded by the European Commission, under the contract number FP7-610411. The authors would also like to thank Avi Rosenfeld, Maor Tzidkani and Daniel Nissim Cohen from the Jerusalem College of Technology, Lev Academic Center, for their assistance to the authors in providing the software tool to generate the textual features used in this research. The authors would also like to acknowledge the networking support by the COST Action IC1302: semantic KEYword-based Search on sTructured data sOurcEs (KEYSTONE) and the COST Action IC1307: The European Network on Integrating Vision and Language (iV&L Net).
© Springer International Publishing Switzerland 2015.
- Document classification
- Feature selection
- N-gram features
- Supervised learning
- Visual features