Tips on improving a CNN model for a regression problem?

altvali · February 28, 2017, 4:33pm

Hello,

I’m trying to use the material in course 1 to train a robot to predict its position inside an arena. Its input consists of pictures taken at various points inside a soccer arena, and its output consists of the x, y and angle of its position.

My approach was to take the convolutional layers of vgg_bn and add a dense model on top of them based on the fisheries multi-output model from lesson 7 that was used to predict the bounding box of fishes. I’d like to show you the model, the results, my tweaks so far, and to ask if you can suggest other improvements.

The model

p=0.6
inp = Input(conv_layers[-1].output_shape[1:])
x = MaxPooling2D()(inp)
x = BatchNormalization(axis=1, mode=2)(x)
x = Dropout(p/4)(x)
x = Flatten()(x)
x = Dense(512, activation=‘relu’)(x)
x = BatchNormalization(mode=2)(x)
x = Dropout(p)(x)
x = Dense(512, activation=‘relu’)(x)
x = BatchNormalization(mode=2)(x)
x = Dropout(p/2)(x)
x_metadata = Dense(3, name=‘metadata’)(x)
model = Model([inp], x_metadata)
model.compile(Adam(lr=0.001), loss=[‘mse’], metrics=[‘accuracy’])

Results

Initially, I trained based on 2000 computer-generated images and validated against another 200 that the model never saw before. I’d like to link to an image where you can see how the images that I used look, and how the predictions compare to the validation labels. Also, these are my best results:

loss: 1471.4486 - acc: 0.9250 - val_loss: 1055.4781 - val_acc: 0.8816

Attempts to improve

The original resolution was 720x576 (to match the resolution of the robot camera), but the best results I got were at 180x144. I’ve also resized at 90x72 and 360x288 with slightly worse results, not sure what to make of this.
I’ve tried with all optimizers and the top 3 performers were, in order: Adadelta, Nadam, Adam
The initial set of 2.2k images had images taken at a greater distance from one another and at 15 degrees between angles and took about 10min to train; I’ve generated 70k images at 3-4 times closer distances and 5 degrees between angles (which made me change my code to use generators because the training data didn’t fit in memory) but the results were much worse after 6 hours and 8 epochs of training (loss: 5017.9255 - acc: 0.6843 - val_loss: 5130.9542 - val_acc: 0.6850 ) and I don’t have a good explanation for that

Where to go from here?

My next 3 ideas are the following:

Use the weights from the small dataset as starting point for the large dataset
change the conv model to use resnet instead of vgg (perhaps RMSprop will be the best optimizer then, instead of Adadelta)
apply smart data augmentation on the images (in the sense of changing the background outside of the arena, place objects inside the arena)

Can you explain why using a larger dataset results in worse performance? Can you suggest other things to try?
Thank you.

wgpubs · February 28, 2017, 6:09pm

Very cool idea.

Just curious … but is this for the FIRST robotics competitions?

altvali · February 28, 2017, 6:13pm

Hi,

It’s for Robochallenge.

wgpubs · February 28, 2017, 6:16pm

Also,

Depending on the images you are capturing from the robot … how similar are they to the images used to train the VGG models?

In fine-tuning for a classification task such as Statefarm, we may have to go deeper. The Statefarm dataset tasks you with identifying different activities a distracted driver may be involved in. This is not similar to the original imagenet challenge, and as such it’s probably a smart idea to retrain even more fully connected layers. The idea here is that imagenet has learned to identify things that are not useful for classifying driver actions, and we would like to train them further to find things that are useful.

Have you considered training from the last convolutional layer?

wgpubs · February 28, 2017, 6:18pm

Never heard of that.

I mentor a HS FIRST Robotics team myself. Would be very interested to see what you come up with if you’re interested in sharing your findings and final conclusions with our team.

Good luck.

altvali · February 28, 2017, 11:04pm

Thank you! I’ll post my results in this thread.
I’ve tried training from the last convolutional layer after you suggested, so far I see slight improvements. I’ve also reduced dropout from p=0.6 to p=0.2, which also improves the mse score. I’ll keep at it.

altvali · March 15, 2017, 3:15am

I’m back with decent results: 2.4 mse loss for the rotation degree prediction and 9.2 mse percentage loss for the width and height.
My previous one-model-to-fit-all approach didn’t get good results. I ended up basing my approach on a paper called Image Orientation Estimation with Convolutional Networks and settled on using a pipeline of 4 models:

First model would take the original image and try to classify it in 6 classes depending on which of the following is the closest angle to the actual rotation angle: 0, 60, 120, 180, 240, 300. This coarse prediction is done to make the next task easier. Got val_acc=1.00.
Second model would rotate the original image by an amount dictated by the class predicted by M1 and would do regression prediction on the rotated image for angles between -30 and 30 degrees. Got mse=2.4.
M3 and M4 would do regression on the x and y position, respectively. The training/validation set is obtained by placing the original image in the center of a larger, square image (with the side equal to the diagonal of the original image), then rotating the composite image by the negative of the sum of the angles predicted by M1 and M2, see example. This is done to eliminate rotation from the image, to make the position prediction easier, while keeping all the information of the original image and avoid cropping. Got mse for both x and y under 10.

I’m also going to list some tips that helped me:

I used ResNet in all models insted of the paper’s authors’ choice, which was AlexNet, because I got better results
chosing a higher batch_size was by far the biggest driver of improvement from all the hyper params
splitting the model in conv part / dense part, generating conv features and training only the dense part saves a huge amount of time, as opposed to going through the whole model with the images when training (on my machine, it was 13sec vs 30min per epoch)
for position regression, setting the labels in the range [-50%, +50%] made the model train faster than the range [0%, 100%]