Finding a needle in a hay stack

Hugues1965 · July 15, 2018, 7:21pm

Hi guys,

I’m using lesson 1 code to run with my data.

In lesson 1, in the train folders, we had about 11’500 cats and 11’500 dogs pictures, and no “others” pictures.

My data is a bit different:

i have 3 categories: Neutral, Up, Down
my data is about 99% neutral and the rest is split 50/50 between Up and Down (hence the needle in a hay stack title)

When i run lesson 1 code, i get excellent accuracy, 99.14%, but when i look at the confusion matrix, the model only catches the neutral images, then Up and Down are also classified as Neutral.

How can i solve this ? The point is not to recognize only neutral images but especially Up and Down.

Is the only way to go about this is to try to have more data Up and Down ? I read about data augmentation, but my images are charts, i cannot rotate them, i cannot zoom them in or out.

Guidance appreciated.

Patrick · July 15, 2018, 7:24pm

Put more importance in the training process on correct classification on Up and Down. The hackiest way to do this is to just duplicate the images in Up and Down in proportion to how important their classification is. A less hacky way would be to have your data loader pull a higher proportion of images from Up and Down then Neutral. Or to assign an ‘importance weight’ to each observation whereby a misclassification of Up and Down is penalized more. Finally, there are more sophisticated techniques such as SMOTE that you could look up.

TheShadow29 · July 15, 2018, 7:25pm

This is an example of class imbalance. I wrote a small notebook on this here https://github.com/TheShadow29/FAI-notes/blob/master/notebooks/Using-Sampler-For-Class-Imbalance.ipynb which uses sampler to combat this.

The easiest way is to take same proportion of neutral, positive, negative samples. That is throw away 98% of neutral examples

Hugues1965 · July 15, 2018, 8:09pm

thanks for the quick replies guys, i’ll start with the easy way and discard the neutrals and duplicate some Ups and Downs to reduce the imbalance.

thanks

machinethink · July 16, 2018, 8:43am

A third option is to not change the data that you have but to make sure each mini batch has the same number of examples from each class.

asotov · July 16, 2018, 11:29am

@Hugues1965 and what about your results? Does the confusion matrix now looks good, after discarding neutrals?

Hugues1965 · July 16, 2018, 12:08pm

Hi Alexey,

The confusion matrix has improved, it now predicts some of the Up and Down pics properly, although the accuracy is only 0.25 currently, so i have more work to do.

I found a way to increase the number of new Ups and Downs, currently i had below 800 for each, i should be able to get upwards from 3’000, by changing the way i create my data, that should improve my accuracy.