Faster experimentation for better learning

radek · November 15, 2017, 3:09pm

I find it extremely helpful - especially when learning something new - when I can quickly see what happens when I try this or that.

I am now going over the lesson 1 notebook again but to speed things up I am running everything on a small subset of the data (200 cat images and 200 dog images - still probably way more than I need).

Not sure really if anyone would be interested in this but if that would be the case here are the instructions (BTW this uses a slightly modified version of an awesome command shared by @jeremy on Twitter a couple of weeks ago):

From the directory of your notebook (from where you have the data folder available) run the following:

mkdir -p data/dogscats_sample/{valid,train}/{cats,dogs}
shuf -n 200 -e data/dogscats/train/cats | xargs -i cp {} data/dogscats_sample/train/cats
shuf -n 200 -e data/dogscats/train/cats/* | xargs -i cp {} data/dogscats_sample/train/cats
shuf -n 200 -e data/dogscats/train/dogs/* | xargs -i cp {} data/dogscats_sample/train/dogs
shuf -n 100 -e data/dogscats/valid/cats/* | xargs -i cp {} data/dogscats_sample/valid/cats
shuf -n 100 -e data/dogscats/valid/dogs/* | xargs -i cp {} data/dogscats_sample/valid/dogs

In your notebook, change the PATH to PATH = "data/dogscats_sample/"

The awesome command @jeremy shared on Twitter was this (please note the mv that you want to normally use when creating the train / valid / test splits):

shuf -n 5000 -e all/*.* | xargs -i mv {} all_val/

BTW not sure if anyone would find this useful, but I really like toy examples for getting a grasp on things. I was thinking of maybe training the original LeNet on CIFAR10 or MNIST using fastai. Would anyone be interested in a notebook walking through the techniques we learn applied to such toy examples?

memetzgz · November 15, 2017, 3:16pm

@radek, I needed to see this today.

I was banging my head against the wall last night, having spent 4+ hours running two notebooks on two different instances (Crestle - planets; AWS-dog breeds). I never got as far as a submission in either because of small problems that then resulted in restarts, after 45 min+ training runs, then more restarts . . . And suddenly it’s after midnight and I am no further than when I started

So this will be my new strategy tonight, getting things to work on a sample set before committing to the entire dataset!

And yes, I think a notebook walk-through would be helpful, particularly for the Part1-V1-beginner forum.

Thanks for all your great posts – I appreciate your contributions to the community!
Maureen

radek · November 15, 2017, 3:29pm

Thank you very much for your kind words @memetzgz and glad this might be of help

BTW in case anyone feels that ‘oh this is deep learning with this little data I will not do anything interesting’ these are the results from training the first model

Also, on a related note, I know of this one silly dude on the Internet who applied the techniques from first lesson of part 1 v1 and trained just on 3 images of cats and 3 images of dogs And it worked!

memetzgz · November 16, 2017, 2:39pm

Any ideas on how to create a sample dataset when we are using the from_csv method rather than storing the images in separate directories?

Looking at the dataset.py file, I don’t see a way to select a subset of images rather than the whole lot of them in the from_csv method. Also, I would imagine one would want a random selection of images, so we’d need to shuffle, but the get_ds function doesn’t have a shuffle parameter like get_dl does.

If I were a better coder, I’d code up a function to do this, but I’m still a rookie – I’m still getting lost just looking at the code

radek · November 16, 2017, 2:54pm

I am getting lost as well to be honest

I think your observations with regards to from_csv are spot on. I do not know how images are stored in csv files, but I would imagine each image is stored as a new line - this could be verified by doing a line count in the csv file, I think something like cat file.csv | wc -l should do the trick but not sure

If that would be the case that indeed 1 file == 1 line in the csv file, then there must be some way to select a subset of lines from terminal or one could write a dumb script in Python going through the file and randomly selecting a line for example if np.random.rand() > 0.8 (or whatever method exists for generating floats from uniform distribution [0, 1) - this would be equivalent to randomly grabbing 20% of lines)

We could then save whatever lines we decide to keep back to csv and read it using the fastai api without any problems

EDIT: Instructions on how to do something like this from terminal

jeremy · November 16, 2017, 5:43pm

Pandas has a sample() method IIRC.

memetzgz · November 16, 2017, 5:51pm

Thanks, @jeremy!

So would it be as simple as reading the csv file into a DataFrame, using the pandas sample method to select a random sample for train, validation, and test, saving those back to CSVs, and then feeding those filenames as parameters in the .from_csv call within the get_batches function? Am I on the right track at least?

radek · November 16, 2017, 5:56pm

Sounds about right to me I kept thinking CSVs with images are somehow special but I guess they are just…CSVs

memetzgz · November 16, 2017, 5:57pm

Okay I will give it a try

jeremy · November 16, 2017, 11:43pm

They’re not really CSVs with images. Just filenames, and labels. So nothing special at all…

jamesrequa · November 17, 2017, 2:35am

It might be a useful feature to add for from_csv to be able to take in a pandas DF that was created in the notebook rather than currently only accepting a path to a CSV file. That way you could more easily manipulate the training set using pandas functions without needing to save to CSV each time.

What do you guys think?

jeremy · November 17, 2017, 3:50am

That’s a great idea!

jamesrequa · November 17, 2017, 4:30am

Cool let me see if I can figure out how to do it and I’ll submit a PR

jeremy · November 17, 2017, 4:39am

Ideally you’d factor out the common stuff from from_csv so that your new from_df and from_csv would be only a line or so of code each…

wgpubs · November 21, 2017, 6:17pm

I love the conciseness of the code to create the directory structure and copy/move the examples!

Question: How could you put this in a notebook given that in python 3.6, it looks at the { } as placeholders for string interpolation?

radek · November 21, 2017, 6:33pm

Use double exclamation marks at line start: !!mkdir ...

wgpubs · November 21, 2017, 6:35pm

Ah geez … I was trying everything in the python universe to escape the { to no avail! Never thought it might be a Jupyter notebook thing!

Thanks much!

alessa · November 22, 2017, 10:27am

Thank you radek!
I was thinking how can I automate your script in order to be applied on any dataset with a minimum of modifications

DATASET="dogscats"

mkdir -p data/${DATASET}_sample/{valid,train}/{cats,dogs}

shuf -n 200 -e data/${DATASET}/train/cats/* | xargs -i cp {} data/${DATASET}_sample/train/cats
shuf -n 200 -e data/${DATASET}/train/dogs/* | xargs -i cp {} data/${DATASET}_sample/train/dogs
shuf -n 100 -e data/${DATASET}/valid/cats/* | xargs -i cp {} data/${DATASET}_sample/valid/cats
shuf -n 100 -e data/${DATASET}/valid/dogs/* | xargs -i cp {} data/${DATASET}_sample/valid/dogs

I am still searching how can I remove dogs and cats labels, since in another datasets we have other classes

radek · November 22, 2017, 10:50am

Ah this is really great @alessa

I am not very good with bash scripting and I suspect that iterating over a list of strings might be troublesome though definitely doable. Maybe combining the best of both worlds (jupyter notebook and linux programs) would be worth exploring? I am thinking about something along the lines:

categories = ['category_1', 'category_2']
!!mkdir -p path/to/data/dataset_name/{category_1, category_2} # would be nice to automate this
# as well but I don't know how, need some string interpolation at some point somewhere 
# maybe something like this:
!mkdir {f'-p path/{categories.join(',')}'} # or however going from list to string is done in Python
for category in categories:
    !shuf ... # and string interpolation again I guess

Either way - I don’t have an answer and sorry for me babbling but maybe some of this can be of help!

BTW here is a solution for iterating over strings in bash if you’d prefer to go that route

alessa · November 22, 2017, 11:24am

Thanks for your reply!
I finally managed to have a clean bash solution for that

#!/bin/sh
DATASET="the-nature-conservancy-fisheries-monitoring"; #dogscats

dirpath="data/${DATASET}/train/*";

for dir in ${dirpath}; do
        dname="${dir##*/}"

        mkdir -p data/${DATASET}_sample/valid/${dname}
        mkdir -p data/${DATASET}_sample/train/${dname}

        shuf -n 200 -e data/${DATASET}/train/${dname}/* | xargs -i cp {} data/${DATASET}_sample/train/${dname}
        shuf -n 100 -e data/${DATASET}/valid/${dname}/* | xargs -i cp {} data/${DATASET}_sample/valid/${dname}

done;