I’m working on a small project where I want to run a lightweight NN on a raspberry pi for detecting certain keywords (“alexa”,“siri”,“ok google” style).
I was wondering if anyone could help me understand what MFCCs are and why you would use them instead of a melspectrogram as input into a speech detection model.
Tensorflow 1.4 has methods for computing both: https://www.tensorflow.org/api_guides/python/contrib.signal#Computing_spectrograms
MFCCs are commonly derived as follows:
- Take the Fourier transform of (a windowed excerpt of) a signal.
- Map the powers of the spectrum obtained above onto the mel scale, using triangular overlapping windows.
- Take the logs of the powers at each of the mel frequencies.
- Take the discrete cosine transform of the list of mel log powers, as if it were a signal.
- The MFCCs are the amplitudes of the resulting spectrum.
I guess the parts that are confusing to me are steps 4 & 5. I haven’t taken a signals class and I don’t have a good intuition for what MFCCs represent.