Live coding 8

Yeah that’s what I’ve been doing so far tbh. I could probably add it to my startup script on the host side to do it to the folders that I mount in the guest container before firing it up, but the issue arose when I git cloned a repo within the mounted disk but it wouldn’t show it until I stopped the container, chmod-ed on the host side and relaunched the container. So, if there are any pointers on that I’d appreciate it (not something fully worked out just a general pointer in that direction)

P.S. I found this article which sort of explains what I need to do, but it assumes starting things up from a dockerfile (RUN etc etc.) but I actually just run the vanilla container right off of the paperspave registry… maybe I can just add these addgroup directives to my RUN command on the host side)

I have downloaded the latest version of timm. After importing it show a list of models (timm.list_models()…
When I try to create a learner using
learn = vision_learner(dls, ‘convnext_small_in22k’)
I get an error:
NameError: name ‘timm’ is not defined,
in line 169 model = timm.create_model(arch, pretrained=pretrained, num_classes=0, in_chans=n_in)

Any suggestions?

Is there a way to clear out any CUDA memory? I’m using the same server and want to try out a few other architectures, but am getting an error that my memory is full (even though I’m not training anything right now).

Ok, you can simply restart the kernel and that’ll do the job. If anyone knows a command we can run, please do share. :slight_smile: Thanks.

1 Like

Can’t remember on which one, but this was covered pretty well in one of the recent walkthroughs. The most likely reason is that you might have forgotten to import timm. Python is basically telling you that a name called timm is not available.

If you were able to run timm.list_models(), then this error should not really show up. It could be the case that you restarted the kernel, but did not run the cell with the timm import again ?

Restarting the notebook(and any other notebooks that might still be running & holding on to GPU memory) run is a sure way to clear allocated cuda memory.

If it’s the only notebook you’re running and keep running into this error even after restarts, then you need to decrease the batch size so that it fits in GPU memory.

I feel like talking more about Docker problems* here feels like hijacking this topic, maybe we should continue the conversation elsewhere, but if you want to see a sample directory/structure I’ve setup then you can follow it here.
This one points to the custom entrypoint I have for the uid:gid switch logic, but the directory also contains a bunch of supporting files around it (eg. Dockerfile, compose, makefile etc.)
https://github.com/suvash/nixos-nvidia-cuda-python-docker-compose/blob/main/05-files/bin/entrypoint.sh

Hopefully, this points you in the direction. Let me know if you have more questions. :raised_hands:

2 Likes

Thanks Sjuvas, but no luck. timm.list_models() works correctly but error still occurs.

That sounds odd, can you share an example of the failure, maybe the whole failing notebook(in a Github gist) ?

Thanks Suvash, you were right about it sounding odd. Reinstalling timm as mentioned on the start of walktru 9 did the job.

1 Like

Correct me if I’m wrong: linking the predictions (idxs) to file names can be tricky using get_preds as this method cannot return the actual file names. The decode method can only trace things back to image objects but not file names. If we can somehow obtain file name from get_preds then there will be no need to sort test images before calling this method which is susceptible to linking error like it happened during the first kaggle submission.

We can simply loop through each image and call get_preds to keep track of file names but this might increase the inference time.

Or we could link idxs with file names in the dls provided dls doesn’t shuffle files in the first place.

Or don’t know if there are best practices for linking predictions to actual input files?

Any ideas?

1 Like

I repplicated the parallel execution example at [19:17] while I had dmon runnning, but dmon didn’t show even a blip of activity - i.e. sm=0 mem=0. Is parallel running not monitored? Or did I accidently start a CPU-only instance, which begs the followup question…

  • Is there some way from within an instance to tell what instance-type is running?
1 Like

As I understand it, parallel does things on the CPU side. On the instance type question, I would probably just check the number of CPUs and check for availability of a GPU (nvidia-smi) and extrapolate from that. I’m not sure if Paperspace provides a command for this. Sometimes /proc filesystem has interesting system related information in it (on Linux os.)

3 Likes

[Edit:] Whoops, that was meant to reply to OP by Mattr.

Different from what you asked, but you reminded me that I always liked delete-inner and delete-around. i.e. starting with… aaa “b|bb” ccc

  • <esc>di" ==> aaa “” ccc
  • <esc>da" ==> aaa ccc
2 Likes

The items attr of a dataset will contain the file names in the order used in the dataloader.

1 Like

Thanks Jeremy. It worked. I combined dl.items and idxs and then used mapping dictionary to add the labels column. Perfect, this is what I wanted to achieve.

No need to worry about linking mismatch. Cool.

3 Likes

A more detailed note on walkthru 8

00:00 starting with question and answer session

How to get things set up in a local machine?
04:44 - How to set up kaggle on a local machine?

09:47 - Setting up to run on your own GPU server locally and remotely

pathlib

14:17 What is the pathlib? From where do import this library? How do we use Path() or what can we put in as parameters? (I need to experiment on notebook and visualize examples myself)

check multiple file sizes in normal and parallel ways

15:40

How to time a program in jupyter cell?

Explore fastcore.parallel

16:50

When parallel makes a difference

How to write a lambda function inside another function

20:52

Build an ImageDataLoaders from folder and do image transform
22:03

How to pick a better model from TIMM

23:02

How to install TIMM and find exact model names

24:28

Don’t forget metrics when building a vision_learner
28:34

Explore learn.fine-tune

30:52

Explore half-precision floating point

32:11

Why use it?

When to use it?

How to use it?

How much better/faster can it get us?
(jump in time 41:29)

Install the latest TIMM on paperspace

33:56

Explore fit_one_cycle

34:32

What does scheduler do?

Why do we start with very small learning rate for even pretrained models?

When and how to increase learning rate?

When and why to decrease learning rate again?

How to increase and decrease learning rate? (in a cycle, through cosine)

How to choose a one-cycle policy?

Explore learn.lr_find

41:13

How does lr_find differ from fit_one_cycle?

How to read the graph of lr_find in terms of slope, bottom, the suggested lr rate, etc?

Why we shouldn’t pick the bottom point for learning rate?

What are the 4 suggested points for learning rate and the ideas behind them?

When to use or not use default learning rate? and why and how?

small points: update learner by lr_find and a second tab

48:20

Will we create a new learner when we run learn.lr_find? yes

Why should we make another copy of the notebook when the original is running? to work on the next thing while the model is training

Explore DataLoaders.test_dl

48:57

What should you do when you want to explore a method which you don’t remember which class it belongs to?

What does DataLoaders.test_dl do?

How to apply DataLoaders.test_dl to test dataset?

Explore learn.get_preds

51:46

What’s the difference between learn.predict and learn.get_preds?

How to quickly check all the parameters of learn.predict? shift + tab

How to get the kaggle submission format right

52:47

Let’s check with the kaggle submission sample csv file

How to autocomplete filenames when you do things like pd.read_csv("")? just write something and press tab

Where to look in order to be certain about the format of kaggle submission format? the sample csv and kaggle site for evaluation

Explore learn.get_preds continued

54:13

How to make learn.get_preds give us the specific label answer instead of the probabilities of all labels for each test item?

What does with_decoded parameter can do to get us the label?

How to access each part when learn.get_preds returns 3 parts?

How to access all the labels of the dataset or DataLoaders? dls.vocab

How to map idxs with label vocab with pandas

55:05

How to turn a list or a TensorBase into a pandas Series and add a name to it with name parameter?

How to use pd.map? (exploration is needed)

How to check the type of dls.vocab? type(dls.vocab)

How to turn dls.vocab into a list? list(dls.vocab)

How to create a dictionary on dls.vocab? 57:53

How to use a dictionary with pd.Series.map ( a neat trick of Jeremy’s that no one knows)? 59:46

Why we are not adviced to use a function or lambda to do pd.Series.map?

How would we do a lambda with pd.Series.map anyhow?

How to add our prection results into the kaggle submission csv file?

Visually check results and submission format

1:00:17

What does Jeremy normally do for checking results for correctness? learn.show_batch()

How to turn the final result pandas file into a csv file with ? ss.to_csv("subm.csv")

How to check the format correctness? do it in terminal with ! head subm.csv

What can index=False do for our submission format?
ss.to_csv("subm.csv", index=False)

How to submit with kaggle CLI

1:02:11

How to use kaggle -h, kaggle competitions -h, kaggle competitions submit -h to learn the command we need to use?

What to do when test dataset get shuffled

1:06:15

Where did the dataset get shuffled? get_image_files()

Can we just sort the our result to have the same order as the submission file? tst_files.sorted()

When will this kind of sorting won’t work?

How does sort() differ from sorted()? inplace or not

What to do when timm is not defined

1:10:23

You can either import timm again or restart the kernel in the notebook

4 Likes

I’ve lost the earlier reference, but at [41:44] Jeremy mentions again the speed advantage of fp16. So I was curious to try both standard 32-bit and 16-bit for first kaggle submission and discovered that the both submissions got the same score, while the former was five times slower than the latter…

I’ve double-checked that I didn’t accidentally upload the same subm.csv. Downloading both from kaggle shows some differences in individual lines. For reference, this was code (excluding imports)…

download = Path('/storage/download/paddy-disease-classification.zip')
Path.BASE_PATH = path = Path.home()/'paddy'
if not download.exists():
    !cd /storage/download && kaggle competitions download -c paddy-disease-classification
if not path.exists():
    nfiles = !unzip -l {download} | wc -l
    !unzip -o {download} -d {path} | pv -l -s {nfiles[0]} > /dev/null
trn_path = path/'train_images'
tst_files = get_image_files(path/'test_images').sorted()

dls = ImageDataLoaders.from_folder(trn_path, valid_pct=0.2, seed=42, item_tfms=Resize(224))
tst_dl = dls.test_dl(tst_files)
# Alternative #1 Full precision
learn32 = vision_learner( dls, 'convnext_small_in22k', metrics=error_rate)
learn32.fine_tune(10)
preds = learn32.get_preds(dl=tst_dl, with_decoded=True)
epoch train_loss valid_loss error_rate time
0 1.424497 0.746180 0.246997 01:58
epoch train_loss valid_loss error_rate time
0 0.760931 0.472575 0.160019 05:43
1 0.598440 0.365380 0.123498 05:41
2 0.450244 0.281377 0.089861 05:41
3 0.339465 0.225487 0.072561 05:40
4 0.254239 0.203521 0.064873 05:40
5 0.191238 0.171012 0.048534 05:40
6 0.145411 0.162492 0.049015 05:40
7 0.118857 0.153959 0.046132 05:40
8 0.096845 0.149728 0.040846 05:40
9 0.095460 0.150064 0.042287 05:40
# Alternative #2 Half precision
learn16 = vision_learner( dls, 'convnext_small_in22k', metrics=error_rate).to_fp16()
learn16.fine_tune(10)
preds = learn16.get_preds(dl=tst_dl, with_decoded=True)
epoch train_loss valid_loss error_rate time
0 1.411435 0.697065 0.234022 00:45
epoch train_loss valid_loss error_rate time
0 0.738877 0.465542 0.156175 00:56
1 0.575676 0.368840 0.121096 00:56
2 0.432381 0.268040 0.088419 00:56
3 0.339179 0.228899 0.069678 00:56
4 0.250381 0.193647 0.060067 00:56
5 0.213334 0.171387 0.047573 00:56
6 0.144941 0.151021 0.041326 00:56
7 0.124611 0.144137 0.040365 00:56
8 0.099918 0.135473 0.037963 00:56
9 0.092904 0.135797 0.038443 00:56

The following cell was run common to both alternatives

probs,_,idxs = preds
idxs = pd.Series(idxs.numpy(), name="idxs")

mapping = {k:v for k,v in enumerate(dls.vocab)}
results = idxs.map(mapping)

ss = pd.read_csv(path/'sample_submission.csv')
ss['label'] = results
ss.to_csv('subm.csv', index=False)

!kaggle competitions submit -f subm.csv \
    -m 'initial convnext-small 10 epoch ft sortedXX' paddy-disease-classification

Its been a while since I’ve done any serious bash automation, so apart from the ML, I’ve been having lots of fun streamlining my pre-run.sh to be able to recreate my paperspace environment by deleting **~/.local" and ~/.conda. But its now time to put that down, or risk “ongoing development”…

image

I don’t expect needing to change anything in future, except the four leading variables. I’m not sure it really simplifies things, since it has with its own complexity, but its was gratifying to get it working.

#!/usr/bin/env bash
# set -x
PERSIST_DIRS="  .local  .conda  .ssh  .kaggle "
PERSIST_FILES=" .bash.history  .bash_aliases  .git.config "
PIP="   kaggle  timm>=0.6.2dev"
CONDA=" universal-ctags  unzip  fzf  pv "

# Ensure pesistent storage configuration folder exists
mkdir -p /storage/cfg

# User folder is wiped when machine restarts.
# Restore links from user folder to persistant storage.
# The `mv` is only executed on first run to capture factory setup. 

for dir in $PERSIST_DIRS ; do
    [ ! -d /storage/cfg/$dir ] && mkdir -p ~/$dir && mv ~/$dir /storage/cfg/$dir
    ln -sf /storage/cfg/$dir ~/$dir
done
chmod 700 /storage/cfg/.ssh

for file in $PERSIST_FILES ; do
    [ ! -f /storage/cfg/$file ] && touch ~/$file && mv ~/$file /storage/cfg/$file
    ln -sf /storage/cfg/$file ~/$file
done

# Install PIP and CONDA packages into user home.
# Note $? is 1 when grep doesn't find package in list

export PATH=~/.local/bin:~/.conda/bin:$PATH

for pkgver in $PIP ; do
    pkg=$(echo $pkgver | sed 's/[<>=].*//') # strips off version e.g. ">=0.2"
    [ $(pip list --user | grep -q $pkg; echo $? ) -eq 1 ] \
          && pip install -U --user $pkgver
done

for pkgver in $CONDA ; do
    pkg=$(echo $pkgver | sed 's/[<>=].*//')
    [ $(conda list -p ~/.conda | grep -q $pkg; echo $? ) -eq 1 ] \
          && conda install --yes -p ~/.conda -c conda-forge $pkgver
done

2 Likes

I’m not sure I understand how kaggle is calculating score.

Does that mean for the current score they only mark a random 75% of lines from our submitted subm.csv, and for the final standings they ignore that 75%, and use the 25% they never marked before. Is it the same “random set” of items for every participent?

1 Like