Audio-visual group recognition using diffusion maps

Yosi Keller, Ronald R. Coifman, Stéphane Lafon, Steven W. Zucker

Research output: Contribution to journalArticlepeer-review

43 Scopus citations

Abstract

Data fusion is a natural and common approach to recovering the state of physical systems. But the dissimilar appearance of different sensors remains a fundamental obstacle. We propose a unified embedding scheme for multisensory data, based on the spectral diffusion framework, which addresses this issue. Our scheme is purely data-driven and assumes no a priori statistical or deterministic models of the data sources. To extract the underlying structure, we first embed separately each input channel; the resultant structures are then combined in diffusion coordinates. In particular, as different sensors sample similar phenomena with different sampling densities, we apply the density invariant Laplace-Beltrami embedding. This is a fundamental issue in multisensor acquisition and processing, overlooked in prior approaches. We extend previous work on group recognition and suggest a novel approach to the selection of diffusion coordinates. To verify our approach, we demonstrate performance improvements in audio/visual speech recognition.

Original languageEnglish
Article number5210209
Pages (from-to)403-413
Number of pages11
JournalIEEE Transactions on Signal Processing
Volume58
Issue number1
DOIs
StatePublished - Jan 2010

Bibliographical note

Funding Information:
Manuscript received September 17, 2008; accepted July 17, 2009. First published August 21, 2009; current version published December 16, 2009. The associate editor coordinating review of this manuscript and approving it for publication was Prof. P. K. Varshney. This work was supported by AFOSR, ARO, and NGA. Y. Keller is with the School of Engineering, Bar Ilan University, Israel (e-mail: [email protected]). R. R. Coifman is with the Department of Mathematics, Yale University, New Haven, CT 06520 USA (e-mail: [email protected]). S. Lafon is with Google Inc., Mountain View, CA 94043 USA (e-mail: [email protected]). S. W. Zucker is with the Department of Computer Science, Yale University, New Haven, CT 06520 USA (e-mail: [email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TSP.2009.2030861

Funding

Manuscript received September 17, 2008; accepted July 17, 2009. First published August 21, 2009; current version published December 16, 2009. The associate editor coordinating review of this manuscript and approving it for publication was Prof. P. K. Varshney. This work was supported by AFOSR, ARO, and NGA. Y. Keller is with the School of Engineering, Bar Ilan University, Israel (e-mail: [email protected]). R. R. Coifman is with the Department of Mathematics, Yale University, New Haven, CT 06520 USA (e-mail: [email protected]). S. Lafon is with Google Inc., Mountain View, CA 94043 USA (e-mail: [email protected]). S. W. Zucker is with the Department of Computer Science, Yale University, New Haven, CT 06520 USA (e-mail: [email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TSP.2009.2030861

FundersFunder number
Air Force Office of Scientific Research
Army Research Office

    Keywords

    • Dimensionality reduction
    • Laplacian eigenmaps
    • Multisensor
    • Sensor fusion
    • Speech recognition

    Fingerprint

    Dive into the research topics of 'Audio-visual group recognition using diffusion maps'. Together they form a unique fingerprint.

    Cite this