Emotion Detection from Voice Clips

Hey guys !
I trained a neural net to detect emotions from voice clips without relying on the content of the speech.It is able to identify the emotions with an accuracy of 83 %. I converted the sound clips into spectogram and then applied the methods we apply for image classification problems on it. The notebooks for it can be found here. I also wrote a blog regarding it which can be found here.
Any suggestions for improvements are welcome !