Facial expression recognition with a really small dataset

manu · November 4, 2017, 10:54am

Hi,

I’m trying to solve a facial recognition problem with the dataset provided in

http://www.kasrl.org/jaffe.html

In order to simplify the problem, I considered only two classes, neutral and “the rest” (the rest actually consists of six different classes: sad, angry…).

I think, the main problem is that the number of samples is really small (213, overall). Is this even feasible with CNNs? I tried reusing VGG16 (replicating the grayscale channel into RGB) with no luck, but I guess this problem is quite different from the one VGG16 was meant to solve (grayscale, and just faces).

I also tried an ad-hoc model along the lines of VGG16

model = Sequential()
model.add(BatchNormalization(input_shape=(1, 224, 224)))

model.add(Conv2D(8, (3, 3), activation='relu', padding='same', data_format=data_format))
model.add(MaxPooling2D((3, 3), strides=(3, 3), data_format=data_format))

model.add(Conv2D(16, (3, 3), activation='relu', padding='same', data_format=data_format))
model.add(MaxPooling2D((2, 2), strides=(3, 3), data_format=data_format))

model.add(Conv2D(32, (3, 3), activation='relu', padding='same', data_format=data_format))
model.add(MaxPooling2D((2, 2), strides=(3, 3), data_format=data_format))

model.add(Conv2D(64, (3, 3), activation='relu', padding='same', data_format=data_format))
model.add(MaxPooling2D((2, 2), strides=(3, 3), data_format=data_format))

model.add(Flatten())
model.add(Dense(2, activation='softmax'))

The problem is, while training, the loss is going up and down most of the time (although a decreasing trend can be observed). After 100 epochs, in a certain run, I get

loss: 0.1450 - acc: 0.9444 - val_loss: 0.1556 - val_acc: 0.9750

but another run may give different (worse) results. It seems the performance is very dependent on the initialization. What should I make of this? (the training loss going up and down during training, and different runs giving inconsistent results)? I would think that a model with 25,986 parameters would always be able to overfit the training set (here, 224x224 151 images…after splitting into training-validation-test) and yield 100% accuracy on the training set (although, probably a poor validation accuracy).

Any thoughts on this?

Cheers.

Matthew · November 4, 2017, 1:28pm

If you’re seeing large swings in loss I would think that your weights are changing too dramatically, an effect coming from a large learning rate. Maybe decrease it or try an optimizer that adapts your learning rate for you like RMSProp or Adam.

Edit:
One paper claims that adaptive gradient methods are often worse than SGD:

The Marginal Value of Adaptive Gradient Methods in Machine Learning (May 23, 2017)

“We observe that the solutions found by adaptive methods generalize worse (often significantly worse) than SGD, even when these solutions have better training performance. These results suggest that practitioners should reconsider the use of adaptive methods to train neural networks.”

A blog post discussing the paper (with learning curves)

“TL;DR: The paper is right!”

Other thoughts:

Since the task involves faces I’d try a VOC-trained FCN, since “person” was a class in the VOC dataset. Since you’re doing image classification and not image segmentation, you’ll have to do some surgery on the FCN. I’d probably try chopping off the upsampling layers to start off.

Here are some Keras implementations of FCNs:

I’m working on a portrait segmentation task right now and using the person class alone from FCN-8s (a vanilla FCN), is doing pretty well, even though much of the data used to train the FCN was of people’s entire bodies and not just their portraits.

My claim here is less about the relevance of FCNs to your task, and more about the relevance of the data that many FCNs were trained on. By the way, trained FCNs usually come from pretrained VGGs, and so you’ll get benefits from the VOC dataset and the ImageNet dataset by using an FCN.

Other than transfer learning, here are some other thoughts that come to mind:

If you’re using SGD, try using an adaptive learning rate optimizer like RMSProp or Adam
- Edit: Maybe not (see the first section of this post)
Have you tried data augmentation? Flips, rotations, zooms, and so on.
This is very domain specific, which is sometimes the wrong approach if you’re working on general data science skills, but, maybe try using a facetracker to add a useful input along with the input image. (Warning: getting that facetracker to work in Python on a non-Windows machine was a pain. Let me know if you end up trying to use it and I’ll try to remember all my steps.)
If you’re using batchnorm with VGG consider trying to train without it. Jeremy claimed
that using batchnorm with transfer learning can hurt performance without using some other tricks (although I don’t know the extent of the claim or the reasons behind it, or what the tricks are):

[We] found that the combination of batch normalisation (which nearly all modern CNN architectures use) and model pretraining and fine-tuning (which you should use in every project if possible) can result in a 500% decrease in accuracy using standard training approaches. - Jeremy

manu · February 27, 2018, 7:48pm

Better late than never (I had completely forgotten about this ) When I posted this, I was working on this project but then something else came up… Anyway, just wanted to say thank you for helping me out here. I intend to go back to this when I find some time (probably using fastai/torch).

Cheers.