TY - JOUR
T1 - Sound retrieval and ranking using sparse auditory representations
AU - Lyon, Richard F.
AU - Rehn, Martin
AU - Bengio, Samy
AU - Walters, Thomas C.
AU - Chechik, Gal
PY - 2010/9/1
Y1 - 2010/9/1
N2 - To create systems that understand the sounds that humans are exposed to in everyday life, we need to represent sounds with features that can discriminate among many different sound classes. Here, we use a sound-ranking framework to quantitatively evaluate such representations in a large-scale task. We have adapted a machine-vision method, the passive-aggressive model for image retrieval (PAMIR), which efficiently learns a linear mapping from a very large sparse feature space to a large query-term space. Using this approach, we compare different auditory front ends and different ways of extracting sparse features from high-dimensional auditory images. We tested auditory models that use an adaptive pole-zero filter cascade (PZFC) auditory filter bank and sparsecode feature extraction from stabilized auditory images with multiple vector quantizers. In addition to auditory image models, we compare a family of more conventional mel-frequency cepstral coefficient (MFCC) front ends. The experimental results show a significant advantage for the auditory models over vector-quantized MFCCs. When thousands of sound files with a query vocabulary of thousands of words were ranked, the best precision at top-1 was 73% and the average precision was 35%, reflecting a 18% improvement over the best competingMFCC front end.
AB - To create systems that understand the sounds that humans are exposed to in everyday life, we need to represent sounds with features that can discriminate among many different sound classes. Here, we use a sound-ranking framework to quantitatively evaluate such representations in a large-scale task. We have adapted a machine-vision method, the passive-aggressive model for image retrieval (PAMIR), which efficiently learns a linear mapping from a very large sparse feature space to a large query-term space. Using this approach, we compare different auditory front ends and different ways of extracting sparse features from high-dimensional auditory images. We tested auditory models that use an adaptive pole-zero filter cascade (PZFC) auditory filter bank and sparsecode feature extraction from stabilized auditory images with multiple vector quantizers. In addition to auditory image models, we compare a family of more conventional mel-frequency cepstral coefficient (MFCC) front ends. The experimental results show a significant advantage for the auditory models over vector-quantized MFCCs. When thousands of sound files with a query vocabulary of thousands of words were ranked, the best precision at top-1 was 73% and the average precision was 35%, reflecting a 18% improvement over the best competingMFCC front end.
UR - http://www.scopus.com/inward/record.url?scp=78149304826&partnerID=8YFLogxK
U2 - 10.1162/neco_a_00011
DO - 10.1162/neco_a_00011
M3 - ???researchoutput.researchoutputtypes.contributiontojournal.letter???
C2 - 20569181
AN - SCOPUS:78149304826
SN - 0899-7667
VL - 22
SP - 2390
EP - 2416
JO - Neural Computation
JF - Neural Computation
IS - 9
ER -