Pedro J. Moreno, Chris Joerg, Jean Manuel Van Thong, Oren Glickman

Research output: Contribution to conferencePaperpeer-review

101 Scopus citations


In this paper we address the problem of aligning very long (often more than one hour) audio files to their corresponding textual transcripts in an effective manner. We present an efficient recursive technique to solve this problem that works well even on noisy speech signals. The key idea of this algorithm is to turn the forced alignment problem into a recursive speech recognition problem with a gradually restricting dictionary and language model. The algorithm is tolerant to acoustic noise and errors or gaps in the text transcript or audio tracks. We report experimental results on a 3 hour audio file containing TV and radio broadcasts. We will show accurate alignments on speech under a variety of real acoustic conditions such as speech over music and speech over telephone lines. We also report results when the same audio stream has been corrupted with white additive noise or compressed using a popular web encoding format such as RealAudio. This algorithm has been used in our internal multimedia indexing project. It has processed more than 200 hours of audio from varied sources, such as WGBH NOVA documentaries and NPR web audio files. The system aligns speech media content in about one to five times realtime, depending on the acoustic conditions of the audio signal.

Original languageEnglish
StatePublished - 1998
Externally publishedYes
Event5th International Conference on Spoken Language Processing, ICSLP 1998 - Sydney, Australia
Duration: 30 Nov 19984 Dec 1998


Conference5th International Conference on Spoken Language Processing, ICSLP 1998

Bibliographical note

Funding Information:
We would like to thank all the members of the Speech and Multimedia Indexing groups at the Compaq Cambridge Research Lab, in particular Brian Eberman and Mike Sokolov for their help in integrating this system to the MediaVista multimedia indexing engine. We also thank Dave Goddeau for his help in the language modeling aspects of the algorithm. We would also like to thank Dave Kovalcin from the Compaq Unix Group for his valuable feedback and remarks while testing the algorithm. Figure 5: Histogram of the time difference between the ground truth and the alignments produced by our algorithm for the RealAudio encoded 1997 hub4 evaluation data.

Publisher Copyright:
© 1998. 5th International Conference on Spoken Language Processing, ICSLP 1998. All rights reserved.


Dive into the research topics of 'A RECURSIVE ALGORITHM FOR THE FORCED ALIGNMENT OF VERY LONG AUDIO SEGMENTS'. Together they form a unique fingerprint.

Cite this