Practice walk-thru 6 / chp1 on Kaggle!

radek · June 7, 2022, 9:12am

Hey!

Using paperspace I put together a repository that I think might be a fun way for practicing what we learned in the walk-thru today.

It provides you with a starter code to submit to the Paddy Doctor community competition that is currently underway.

If you are interested, I suggest you use your git-fu and fork the repo to store your work! This way, you can practice using git AND can also potentially contribute to my repo, if that is something you would like

Git is covered, but now for kaggle. Here are the steps to install and configure kaggle cli on paperspace (and do everything all the way to submitting to the competition)

pip install --user kaggle
I have no clue why I get this warning on papersace? But with our PATH modification skills we can fix this! (anyhow, we need to “fix it” by modifying pre-run.sh, and not bash.local so that jupyter notebook can see it as well!)
error: WARNING: The script slugify is installed in '/root/.local/bin' which is not on PATH.
fix: run from console or edit pre-run.sh manually
echo 'export PATH="~/.local/bin:$PATH"' >> /storage/pre-run.sh
Now this next part is tricky. You need to go onto kaggle, generate an API token under your account, and copy it to ~/.kaggle/kaggle.json on paperspace (you might need to create the kaggle.json file, essentially, it is just a text file, if you will have issues with this please post in the thread below)
Finally, fork my repo from here.
Run the notebook and submit to kaggle!

Working on the competition can be a fun way to learn and to get started with Kaggle It is a community competition, so there are no prizes or points, but the excitement of climbing up the leaderboard and trying out new things can certainly be there!

There are a lot of things one could try on this dataset:

To what accuracy can you train the current model? (resnet34)
Can you find a different architecture that will work better?
How do you pick a good learning rate? Will the learning rate differ if you use SGD vs Adam?
What augmentations can you use to improve results?
Can you experiment with different image sizes? How does that affect results?
Can you combine predictions from two archs?
Can you train with lr decaying? Say for each subsequent epoch take 0.9 of previous lr.
Can you reduce the learning rate if you don’t see improvements for x number of epochs?
Can you stop training if the model starts to overfit? (early stopping)
How do you hook up the model to weights & biases?
How do you train with parameter sweep (to find good hyperparams) using w&b?
Can you leverage fastai functionality (low level functions) to create a PyTorch dataset and feed it to the fastai learner?
Can you use lower level fastai functionality to arrive at fastai Dataloaders (how to construct dataloaders without using ImageDataloaders or ImageBlock)?

The list of questions could go on, and on, and on

In the notebook I tried to leave clues how I went around getting answers to some of my questions when I worked on the NB.

Maybe, if there would be interest, we could work together towards answering the questions from my list above? And Jeremy could comment on some things we could do better?

I am not so much interested in what the answers are, but what would be the method of arriving at them. For instance, I genuinely didn’t know where to look for the mapping from idxs to classes (this simply slipped out of my mind). I guess the answer probably is that I should re-read the dev nbs for fastai? I jumped around in the code but didn’t manage to steer myself to where this is defined.

I am not sure, but maybe some of the questions are not really be suited to part 1 of the course

Anyhow, maybe someone will find this useful

jeremy · June 7, 2022, 10:00am

Might be fun to look at this tomorrow…

Daniel · June 8, 2022, 4:19am

Thanks for starting this post! I am looking forward to see discussions growing here.

I do feel some of the questions above are somewhat in part 2 level but they are all good with no harm to think about. Also I have a list of questions which maybe too novice, but I still want to know what you and people think of them and how they can be answered.

Questions come up the first

When I read the competition and dataset descriptions, here are the questions that pop into my mind first:

how do paddy experts use meta-data (age and variety of normal and diseased crops)? and how should we?
How do paddy experts utilize these two sets of images? And how they use age and variety to help in the classification process?
Or should we totally ignore how human expert classify at all?
with Zeiler and Ferguson images in the background of your mind, what would you imagine the features of your model could be like?
How should we use the two set of images (visual and infrared)?
Since visual images have more data, does it make more sense to train model on visual images before on infrared ones?
How do we build a model to make use of meta data for classification?
Or how many ways can we build a model to use them?
Will the different number of images on different categories affect the performance of model?
What do you usually do about the unequal number of images in each category for training the model?
Do the competition curator intentionally keep the unequal number of images this way?
Or simply because some diseases (categories) occurred more frequently or in massive areas in the years when curator collected the data?

The Kaggle process

As for the process or steps to tackle Kaggle, Jeremy has talked about it in lectures, so I will later do a review from the lecture notes on how Jeremy tackle o kaggle. For actual practice, I wonder how people implement those steps in this Kaggle competition.

How do Radek tackle this competition step by step from scratch?
What’s other people’s Kaggle competition working process?

Daniel · June 8, 2022, 4:22am

By the way, what do you mean by walk-thru6 / chp1? @radek

radek · June 8, 2022, 5:16am

walk-thru 6: yesterday’s walk through
chp1: chapter 1 from fastbook

This is a very high level (but potentially useful) framework for approaching a Kaggle competition:

twitter.com

Radek Osmulski 🇺🇦

@radekosmulski

How to win @kaggle: ✅ join a competition early ✅ read forums daily ✅ make small improvements every day ✅ find validation split that tracks LB ✅ posts by top kagglers will take you 80% of the way ✅ papers, blog posts, creativity 👉remaining 20% ✅ ensemble results https://t.co/dwLHQl2z1F

Radek Osmulski 🇺🇦 @radekosmulski

1yr ago I gave up on ML. I didn't know what to learn nor how After a 5 mths break I decided to give ML one last try. If it would not work out I would need to let it go to not continue to waste my time - maybe I am unable to learn this I then signed up for the @fastdotai course pic.twitter.com/wjORNbkctx

5:32 AM - 3 Dec 2020 383 55

Frankly speaking, I wouldn’t worry too much about all this. The motions Jeremy is showing us in the walk-thru will take you a very long way. In some sense, everything above is a step improvement

With a CV competition such as this, it might be fun to see how well you can train a single model and then possibly to ensemble predictions from 2 or 3 models, just to see how it goes.

I see the objective here being to learn and the ideas I listed in my OP (original post) can potentially be quite useful in a competition, while at the same time providing a way to dive deeper into fastai

miwojc · June 8, 2022, 9:50am

today i learned from Radek’s notebok that you can use bash commands in notebook without ! or % magic

This is explanation from this fantastic book: IPython and Shell Commands | Python Data Science Handbook

Besides %cd, other available shell-like magic functions are %cat, %cp, %env, %ls, %man, %mkdir, %more, %mv, %pwd, %rm, and %rmdir, any of which can be used without the % sign if automagic is on. This makes it so that you can almost treat the IPython prompt as if it’s a normal shell:

image1078×466 11.3 KB

Daniel · June 8, 2022, 12:44pm

Great advices! Thanks Radek!

zymoide1 · June 9, 2022, 2:18am

This is very exciting! Thanks for putting this for us. I hope to start working on this either tmrw or on Friday. These walkthrus have been challenging but these types of tasks (along with Jeremys’ excellent explanations) will prove fruitful.

radek · June 9, 2022, 12:14pm

Wow, I think English as a second language has tripped me over here I thought “step” meant tiny improvement. That is what I wanted to say.

The “basics” can take you a very long way In some sense, “basics” are all there is Or maybe all this might seem basic but still most people don’t know it

Well, doesn’t matter. What I wanted to say – in my experience, what we are learning is what matters and to go from playing our in this competition to effectively participating in a ranked one, is just a tiny difference

Daniel · June 9, 2022, 10:08pm

This is brilliant, Can’t agree more!

juanchoalric · July 31, 2022, 3:58am

Hey @radek. I added timm library to be installed in the pre-run.sh bash script so that I don’t get into trouble while using fastai.
However, I don’t know how to make my notebooks persist. I don’t know why but paperspace opens my notebooks in the root directory and I don’t have persistence as you can see in the image.
. I can move my notebooks to the /notebook directory but later I can’t open them.

bencoman · August 1, 2022, 12:56pm

Does your pre-run.sh return the current directory to be /notebook ?
One way is to use pushd/popd at start end end of your script.

juanchoalric · August 1, 2022, 5:13pm

Thanks, that was the problem. The pre-run.sh stayed in the root directory.

daveramseymusic · January 19, 2024, 2:26pm

Just went through this with the Data Philly Fast.AI study group last night and it was super useful as I was walking everyone through how to use paperspace. Thanks Radek!