Using in the wild - a report from the Kaggle frontline

Hello all,

I wanted to write about my experience using in the recent TGS Salt competition - a semantic segmentation challenge with binary mask target. I won a silver finishing 76/3234 and found the competition to be a challenging but accessible one for someone tackling their first image and pure deep learning competition.

I also wanted to give an honest appraisal of using for a live kaggle competition as I hope others will find the feedback useful. I should say upfront this is the first time I’ve properly used for anything other than the part one lessons and so I accept full responsibility for any shortcomings that were my own (I’m sure it will be evident to those more knowledgeable than me that at least some were) and some of the issues I faced that may be now fixed in v1. I was using an older version of with PyTorch <0.4 as once I got things running I was too paranoid to change mid-competition in case it broke anything.

Summary of usage

In the end I went with minimal usage of using it for the following 3 things:

  • dataset/dataloader

  • learner object (not conv learner)

  • augmentations: lr flips, zoom, lighting

This means I didn’t use in the end for upsampling images or masks, any learning rate schedule policy, discriminative learning rates, TTA nor using most of’s pretrained models except SEResNext50. I am pretty sure my usage of was not optimal however as I had a single GPU it wasn’t possible to test ideas in parallel or always get to the bottom of issues as all time spent working interactively was time the models weren’t training (and they typically needed 12+ hours!).

Side note on the nature of kaggle competitions

Almost by their very nature kaggle competitions are like financial markets in their self-adjusting behaviour - advanced techniques that work out the box from libraries or are posted as kernels quickly become the baseline for most of the stronger participants and everything advances. This means we almost necessarily get pushed down somewhat complicated rabbit holes of quirky things that happen to work in this competition but might not be best practice if you were deploying a deep learning system in the wild. I think this point is more fundamental than it seems. If wants to be something used for winning kaggle competitions then I think that’s a (not quite orthogonal but pretty) different aim to using it for most mainstream deep learning projects. This means either extending to handle the peculiarities of a kaggle competition or having it easily extensible in a manner of weird non-mainstream ways. A tricky situation (which I almost stumbled into a few times) is to invest time heavily in a competition with one library only to find that to score highly everyone is doing technique X and technique X is either not supported by your library or requires a lot of workaround.

I will provide more detail on my observations below.


Reiteration: it may turn out most of these are me being an idiot (which happens far too often), my lack of full familiarity with or are fixed in v1 - all mistakes are my own and I apologise in advance. I offer this feedback meant in the best possible way - to help improve the project. Furthermore, it might (quite rightly in some cases) be that has no desire to support any of the things I raise below - that in itself is also fine and very useful to know for the future. It’s almost certainly not a good design philosophy for to worry overtly about the individual quirks of a kaggle competition but knowing where those boundaries are is helpful.

  • Support of k-fold: a reasonably minor issue with a simple enough workaround but does raise other issues which crop up in kaggle competitions such as dropping (training) data per fold. For example, I was dropping images from training data whose masks were less than X pixels - obviously I kept them in the validation set. It would be nice to have this handled naturally and does come up in other non-image competitions quite a bit with poor data.

  • Resizing binary masks: I’m pretty sure using the default openCV setting causes upsampled masks to not be just 0 or 1 and @wdhorton kindly suggested cv.INTER_NEAREST as a solution to this which I did outside of (I don’t actually think this was a big issue for performance).

  • Modifying pretrained models: it turned out removing the first maxpool layer in the resnet architecture was a good idea for this competition. I found it non-trivial to figure out how to get around this in but eventually used the default torchvision models (i.e. dropped the sfs.features stuff) and wrote my own layer group wrapper. This is something I probably found harder due to my lack of knowledge but certainly seemed like an extra hurdle. That said I learned a lot doing it. :grinning:

  • Cosine annealing: a minor point that I’m sure could be accomodated if people think it’s a good idea but why does cosine annealing drop the learning rate to 0? I would prefer to specify the min and max lr and use the schedule in that manner (note: clr allows a min lr).

  • Callbacks: once I got my head around how these work I ended up just writing my own for everything I wanted to do but I would say that useful functionality includes things like saving your n best models subject to some specified metric (e.g. not necessarily val loss) in order to average predictions in the fold from each of these. In this competition it was important to monitor the actual competition metric whilst training and use this as your guide (training was very noisy). Also, and something I raised in another thread is that at the moment it’s not clear on what basis models are being saved on (and sometimes there’s an implicit assumption the first metric passed is accuracy and is being maximized and thus models are saved on that basis).

  • Models with multiple outputs: I got this to work on my version by returning a list of outputs from the forward pass which I then parsed in a custom loss function however when I hit learn.predict() I’m pretty sure it only returns the first list item and not them all. I might be wrong on this as it’s something I didn’t pursue but returning multiple things might be needed in other competitions when jointly training models for multiple purposes and then using upstream.

  • Having a mask target with 2 channels: this is something I couldn’t figure out in the end (I tried passing masks with 3 channels) without a real hack (i.e. creating the 2 mask channel on the fly inside the forward pass! Interestingly it wasn’t that slow) which I only tried at the last minute and ultimately didn’t use. It was suggested in this competition (at the last minute) that training on the boundary of the mask as an additional task was helping performance.

  • Reading images and masks as numpy arrays: almost certainly my own limitation here but I couldn’t figure out how to read data from numpy arrays for both the images and the masks. I wanted to do this for stacking the models at the end where my images were now probability predictions per pixel from multiple models and the masks were the original masks. Note in this case all the data fit in memory but that won’t always be the case.

  • TTA: as far as I am aware it isn’t/wasn’t possible to use TTA straight out the box for in this competition and so a workaround was needed.

  • The notion of an experiment: this is something that I think as a community do well but there isn’t a formal way to handle it. In the end I used my own logging and wrappers around the training loop to keep track of the basis of each experiment. I used attrdict for the first time - very neat.

Just to be clear, most of the above are minor in the context of a brilliant deep learning library however some of the issues felt like they could have become acute in the context of a kaggle competition.

Final comments

Something positive now! Overall I did enjoy interacting with the library and certainly have learned a great deal from trying to delve under the hood - a lot more than if I’d just run it with no changes. I should also say that I fully buy into the philosophy and am really looking forward to taking the Live course starting next week from London. I also believe the world needs more people teaching and working in the manner the community and @jeremy do. There were others from taking part and they almost certainly utilised the library in a more optimal way than I did so I’d be very grateful to hear what they think. Notably @VishnuSubramanian, @wdhorton and @radek.

I hope the above observations don’t sound overly critical but given how awesome I think currently is/has the potential to become I think it’s crucial to not just get the the good (but often) echo chamber feedback in order to become stronger.

Update: Oh, I should also say that I recognise doesn’t develop itself and so if I were able to help contribute to the library in anyway I’d be delighted to (though I’m not a developer by trade I am focusing a fair bit on trying to up my Python skills).

Happy to take feedback myself!



Thank you for sharing your thoughts Mark :slight_smile: Was great to read the above as well as your comments on Kaggle.

1 Like

Thank you @maw501. Congrats on the competition!

rally nice report

Thanks, bookmarked :wink:

Super helpful post @maw501! Currently working through the DL1 course and was curious to see if anyone actually used for any active Kaggle competitions.

I do have a question re: TTA. Did you receive an error when you tried to use it or was there another reason why it wasn’t possible to use for the TGS Salt competition?

Regarding your Callbacks point and The notion of an experiment point — I wanted to recommend for saving your modeling work as experiments. I work on the team as the Product Lead and personally have been using to track the code, hyperparameters, results, etc…for the different iterations of models I produce during the course.

You can check out the layout/set up with one of our public projects

Hi @ceceshao1 - thanks for your comments. I will take a look at though have gone down the path of doing things myself for now - will see how I get on (it’s more for my learning as much as anything).

Re. TTA it’s because the version I was using didn’t support it for image segmentation per here. Note for a classification problem it’s fine as a label of cat is still a cat when you (for example) rotate it but for pixel level predictions you need to know the transform and be able to sensibly reverse it. As such people only did lr flips as far as I know (but you couldn’t determine in in the TTA preds which images were flipped). Things might have changed now in v1.

Hi @maw501,

Jakub from here.
I have just added a simple callback that lets you monitor fastai training in Neptune to our neptune-contrib library. I explain how it works in this blog post but basically, with no change to your workflow, you can track code, hyperparameters, metrics and more.

Before you ask, Neptune is now open and free for non-organizations.
Read more about it on the docs page to get a better view.