Why is keras more accurate than fastai for the fish competition?

Great to see a big improvement in accuracy on dogs/cats from the v1 course last year to the v2 course this year. However I see the opposite effect for the fisheries competition.

On the fisheries competition last year during the v1 course using keras, vgg and size=224 I got 98% accuracy. Using fastai, resnet34, size=350, cropped I get 95% accuracy. If I switch fastai to vgg, size=224. Then I only get 92% accuracy.

I have checked the original notebooks on github and this confirms similar. Lesson7 fish competition scores 98%.The new v2 fastai fish notebook achieves 94%.

I know the models are not exactly comparable but these are quite large differences so it would be useful to know what is going wrong.

1 Like

Welcome to the very high dimensional world of hyperparameters tuning !

Every machine learning problem has some unique attributes (complexity, quantity of data, distribution of data). No matter what is your deep learning API (Fast.ai with Pytorch, Fast.ai with Keras, Pure Keras, Pure Tensorflow, Pure Pytorch, Caffe, …), performance is mostly based on hyperparameters. A very small change on just one hyperparameter can completely change your result on a specific problem.

Consequently, I don’t think we can claim that an API is more accurate than another. Fast.ai (V1 or V2) and Pure Keras are great fast prototyping APIs to try and get very fast strong results in many problems. Fast.ai V2 with pytorch is probably currently the fastest prototyping API (least amount of written code) overall for many problems. But unfortunately (or fortunately for the CS job market), you still need to do some hyperparameter tuning.

1 Like

Do you suppose there is a way of scoring 98% by tuning the hyperparameters of fastai? or perhaps using a different underlying model? I note that there was very little tuning to achieve these results in both models but that just tweaking the learning rate and dropout does not seem to make a difference.

If a small change in hyperparameters makes this much difference then either there is a structured, manual process to optimise; or we need to search the hyperparameter space automatically within resource constraints.

1 Like

Choosing or designing a new model architecture is also a form of hyperparameter.

Your observations are in the right direction. If you are more interested in the subject you can take a look at section 11.4 (p.422) of this book chapter : http://www.deeplearningbook.org/contents/guidelines.html

Google Brain worked to develop AutoML to automatically and presumably optimally tune hyperparameters : https://research.googleblog.com/2017/11/automl-for-large-scale-image.html

But at first, I recommend you to play and change different hyperparameters by yourself to see the effect on different problems. Participating in different Kaggle competitions is a great way to improve this tuning game.

I agree that is a hyperparameter. However in this case I have used VGG in both. And the gap is not just a fraction of a percent but from 94% to 98%. So there is surely something more to this.

If it is caused by a different design decision in fastai versus keras then this should be parameterised so that it can be included in manual/automated tuning. It does not seem sensible to have to try out a model in keras and fastai because one may be better than the other in some circumstances.

2 Likes

Hi @simoneva, are you referring to 98% train or validation accuracy? I remember participating in that competition and it was notoriously hard to beat the benchmark (just average frequency of each type of fish).

Validation accuracy. You can see the v1 results 98.8% validation accuracy here:

I see. Thanks. There was a big problem with that dataset. It contained numerous similar looking images. They look like sequential frames from web-cameras or something similar. And as a result some of those images appeared in the train set, while others (very similar looking) in the validation thus inflating validation score. Unfortunately, all those models performed very poorly on unseen images.

Ah yes I seem to recall that some were from the same boat. Nevertheless the same data is used in both the v1 and v2 example notebooks; and appears to perform better using the v1 model. Assuming there is some bias I do not see any obvious reason why the bias is different in v1 and v2.

1 Like

To tell the truth, I wouldn’t bother with this dataset at all. It’s too “dirty”. Without strong pre-processing of the data (separating similar images into separate groups) all metrics are useless as they are overly optimistic due to the leakage of train set data into validation set.
Cats&Dogs seems to be much better in this regard (i didn’t use it personally though).