Melspectrogram vs MFCCs for inputs to speech detection NN

I’m working on a small project where I want to run a lightweight NN on a raspberry pi for detecting certain keywords (“alexa”,“siri”,“ok google” style).

I was wondering if anyone could help me understand what MFCCs are and why you would use them instead of a melspectrogram as input into a speech detection model.

Tensorflow 1.4 has methods for computing both:

Wikipedia says:

MFCCs are commonly derived as follows:

  1. Take the Fourier transform of (a windowed excerpt of) a signal.
  2. Map the powers of the spectrum obtained above onto the mel scale, using triangular overlapping windows.
  3. Take the logs of the powers at each of the mel frequencies.
  4. Take the discrete cosine transform of the list of mel log powers, as if it were a signal.
  5. The MFCCs are the amplitudes of the resulting spectrum.

I guess the parts that are confusing to me are steps 4 & 5. I haven’t taken a signals class and I don’t have a good intuition for what MFCCs represent.

1 Like

I know this post is super old, but did you ever find an answer to this? Wondering the same :slight_smile: