Abstract
Many language identification (LID) systems are based on language models using techniques that consider the fluctuation of speech over time. Considering these fluctuations necessitates longer recording intervals to obtain reasonable accuracy. Our research extracts features from short recording intervals to enable successful classification of spoken language. The feature extraction process is based on frames of 20 ms, whereas most previous LIDs presented results based on much longer frames (3 s or longer). We defined and implemented 200 features divided into four feature sets: cepstrum features, RASTA features, spectrum features, and waveform features. We appliedeight machine learning (ML) methods on the features that were extracted from a corpus containing speech files in 10 languages from the Oregon Graduate Institute (OGI) telephone speech database and compared their performances using extensive experimental evaluation. The best optimized classification results were achieved by random forest (RF): from 76.29% on 10 languages to 89.18% on 2 languages. These results are better or comparable to the state-of-the-art results for the OGI database. Another set of experiments that was performed was gender classification from 2 to 10 languages. The accuracy and the F measure values for the RF method for all the language experiments were greater than or equal to 90.05%.
Original language | English |
---|---|
Pages (from-to) | 510-535 |
Number of pages | 26 |
Journal | Cybernetics and Systems |
Volume | 48 |
Issue number | 6-7 |
DOIs | |
State | Published - 16 Nov 2017 |
Externally published | Yes |
Bibliographical note
Publisher Copyright:© 2017 Taylor & Francis Group, LLC.
Keywords
- Classification
- Feature extraction
- Gender classification
- Language identification
- Machine learning
- Random forest
- Speech