GitXplorerGitXplorer
a

speech_emotion_detection

public
8 stars
0 forks
1 issues

Commits

List of commits on branch master.
Unverified
17f14773155d889f088bc26accda9165846c0007

Added classification report

aaitikgupta committed 5 years ago
Unverified
eeb8131822261d072d414bc96586f9b3b31713c6

Updated underlying model

aaitikgupta committed 5 years ago
Unverified
ff45e7d577ce059db7fb49fb5919f792991f54ee

Update predict_emotion function

aaitikgupta committed 5 years ago
Unverified
59fa2f9b76acf8371e301275a09c4265561bfc58

Added classification report

aaitikgupta committed 5 years ago
Unverified
41a43e87ae90933dd06f2844aa9355b9c0442632

Updated model.h5

aaitikgupta committed 5 years ago
Unverified
4f5f1d116a90dc5e7b8c42d7239a5cebba895ffc

Added more self-recordings

aaitikgupta committed 5 years ago

README

The README file for this repository.

Speech Emotion Detection built using Conv1D+LSTM layers

This is a user-driven speech emotion detection system.

Tech Stack:

  • TensorFlow
  • libROSA
  • PyAudio

Steps to reproduce:

Note: To maintain the ennvironments, I highly recommend using conda.

git clone https://github.com/aitikgupta/speech_emotion_detection.git
cd speech_emotion_detection
conda env create -f environment.yml
conda activate {environment name, for eg. conda activate kaggle}
python main.py

There are 3 things which can be done:

  1. Train the model again (This will take time, on a GeForce GTX 1650 GPU, training took around 1 hour)
  2. Randomly select sample voices and test the model on actual and predicted values
  3. Test the model on your own voice sample

Classes:

  • "male_calm"
  • "male_happy"
  • "male_sad"
  • "male_dislike"
  • "male_fearful"
  • "male_surprised"
  • "female_calm"
  • "female_happy"
  • "female_sad"
  • "female_dislike"
  • "female_fearful"
  • "female_surprised"

Inspiration:

There has been a wide variety of convolutional models on the internet, for speech emotion detection.

However, in a time-dependent data, just using convolutions doesn't account for the correlations within time intervals.

Therefore, using RNNs with Convolutions, the model can be much more robust in understanding the intent of the user.

Pipeline:

  1. The input audio is first converted into a spectrogram, which uses Fourier Transformation to convert time domain into frequency domain.
  2. The scales are shifted into log, and can also be converted into the Mel scale.
  3. For every fold in training and validation split (using Stratified K Fold):
Random processing techniques are applied to voices such as shifting, pitch tuning, adding white noise, etc. to make the model more robust. [This was key to make it usable in realtime.]
  1. The features generated are thus fed into Conv1D layers, and ultimately LSTMs.
  2. The last Dense layer contains 12 units with a Softmax activation.

Limitations:

  • "Emotions" are hard to annotate, even for humans. There is no "Perfect" dataset for such problem.
  • Here's a great article as to why convolutions and spectrograms are not "ideal" for audio processing.