GitXplorerGitXplorer
i

gmtk-audio-classification

public
2 stars
4 forks
0 issues

Commits

List of commits on branch master.
Unverified
636341e225da2e71a6ba6e25b1e20ef480a8f8f4

Update README.md

iilge committed 8 years ago
Unverified
66fc20b825590b1856b3edf5888f29cf2f0d2a29

Update README.md

iilge committed 8 years ago
Unverified
259a5a2d6fa8a0ba0857030381137467d5c5a487

Update README.md

iilge committed 8 years ago
Unverified
0aef2233ea1c338c96a88ec50b489e4a71954a52

Update README.md

iilge committed 8 years ago
Unverified
1c7778e2ace99bcbde7b1bc53b52f0c61d0603d0

Update README.md

iilge committed 8 years ago
Unverified
7e6acdd0968ea81d8a221cff88197dfd7f361b33

some minor changes

iilge committed 9 years ago

README

The README file for this repository.

Real-time Audio Classification on Streaming Audio Features

This repository contains multiple Bayesian models trained to perform audio event classification that have been developed for the purpose of live applause detection. This application was a component of the TerraSwarm RoboCafe demo, which was presented at DARPA WaitWhat? Forum in September'15.

Overview

For documentation purposes, the repository contains multiple complete GMTK models trained using labeled audio data. The latest and most mature learning model is contained under alpha0.

GMTK and Bayesian Models

We use GMTK for training and online classification purposes. Each folder contains a subfolder named decoding, which includes model parameters obtained as a result of the GMTK training process.

The models used here are hidden markov models (HMMs) with Gaussian mixture emissions. In training, a binary HMM was assumed and mixture components for each were learned. For classification, we create an HMM with 2*M states, for which each binary state is replicated M times to temporally force classification to favor lasting features of a particular class. This helps with mitigating false positives ( sounds classified to be applause when they are actually caused by intermittent clap-like ambient sounds such as clinks or bangs). The trained models contain in the order of 10-20 mixture components for each class.

Implementation Details

Classifying audio events often require both temporal and frequency domain signatures to be learned from raw data. For the specific task of applause detection, we experimented with Auditory Filterbank Temporal Envelope (AFTE) features.

We used a raw audio signal collected by a microphone accessor at a sampling rate of 48kHz. For feature extraction, we used 1/3 s windows with an overlap window of 3200 samples (i.e., features were extracted at a rate of 15 fps). We used a filterbank of 8 channels and optimized center frequencies of the (bandpass) filterbank to be [377, 524, 707, 934, 1217, 1568, 2000, 2800] Hz, respectively. The output stream of each filter was processed by a low pass filter for detecting the envelope, and we finally took a 64-point FFT of each to yield features.

As the objective of the features is to accurately identify audio envelopes belonging to applause vs non applause sounds, we ranked the features based on their discriminatory power using a combination of t-test, entropy, and KL Divergence criteria, and selected 10-20 features that are most influential in separating positive and negative samples. Note that there exist more complex feature selection/transformation/dimensionality reduction methods that could be applied (e.g., PCA), but simple selection worked well in our case. Also, note that the feature dimension dramatically affects the success probability of the HMM training phase.

Real-time Performance

Note that at the given frame rate, we need to perform feature extraction per frame at about 60ms for real-time classification. We perform all filtering operations of the AFTE filters, as well as the envelope extraction using Overlap-and-add FFT, with a thread pool where each channel of the filterbank is handled by a separate thread. The java implementation of the feature extractor is available here as part of the Ptolemy tree.

Execution

decoding/applause.sh can be run as a WebSocket app via websocketd. We extract feature vectors using a CapeCode swarmlet and stream these to gmtkOnline ( using WebSockets), which returns decoded streams to the swarmlet. After processing the decoded stream through a smoothing filter, the results are ready to be processed by other applications.