Faster experimentation for better learning

Pandas has a sample() method IIRC.


Thanks, @jeremy!

So would it be as simple as reading the csv file into a DataFrame, using the pandas sample method to select a random sample for train, validation, and test, saving those back to CSVs, and then feeding those filenames as parameters in the .from_csv call within the get_batches function? Am I on the right track at least?

Sounds about right to me :slight_smile: I kept thinking CSVs with images are somehow special but I guess they are just…CSVs :wink:

1 Like

Okay I will give it a try :slight_smile:

1 Like

They’re not really CSVs with images. Just filenames, and labels. So nothing special at all…


It might be a useful feature to add for from_csv to be able to take in a pandas DF that was created in the notebook rather than currently only accepting a path to a CSV file. That way you could more easily manipulate the training set using pandas functions without needing to save to CSV each time.

What do you guys think?


That’s a great idea!

1 Like

Cool let me see if I can figure out how to do it and I’ll submit a PR :grin:


Ideally you’d factor out the common stuff from from_csv so that your new from_df and from_csv would be only a line or so of code each…

I love the conciseness of the code to create the directory structure and copy/move the examples!

Question: How could you put this in a notebook given that in python 3.6, it looks at the { } as placeholders for string interpolation?

1 Like

Use double exclamation marks at line start: !!mkdir ... :slight_smile:

1 Like

Ah geez … I was trying everything in the python universe to escape the { to no avail! Never thought it might be a Jupyter notebook thing!

Thanks much!

1 Like

Thank you radek!
I was thinking how can I automate your script in order to be applied on any dataset with a minimum of modifications


mkdir -p data/${DATASET}_sample/{valid,train}/{cats,dogs}

shuf -n 200 -e data/${DATASET}/train/cats/* | xargs -i cp {} data/${DATASET}_sample/train/cats
shuf -n 200 -e data/${DATASET}/train/dogs/* | xargs -i cp {} data/${DATASET}_sample/train/dogs
shuf -n 100 -e data/${DATASET}/valid/cats/* | xargs -i cp {} data/${DATASET}_sample/valid/cats
shuf -n 100 -e data/${DATASET}/valid/dogs/* | xargs -i cp {} data/${DATASET}_sample/valid/dogs

I am still searching how can I remove dogs and cats labels, since in another datasets we have other classes

1 Like

Ah this is really great @alessa :slight_smile:

I am not very good with bash scripting and I suspect that iterating over a list of strings might be troublesome though definitely doable. Maybe combining the best of both worlds (jupyter notebook and linux programs) would be worth exploring? I am thinking about something along the lines:

categories = ['category_1', 'category_2']
!!mkdir -p path/to/data/dataset_name/{category_1, category_2} # would be nice to automate this
# as well but I don't know how, need some string interpolation at some point somewhere 
# maybe something like this:
!mkdir {f'-p path/{categories.join(',')}'} # or however going from list to string is done in Python
for category in categories:
    !shuf ... # and string interpolation again I guess

Either way - I don’t have an answer and sorry for me babbling but maybe some of this can be of help!

BTW here is a solution for iterating over strings in bash if you’d prefer to go that route :slight_smile:

1 Like

Thanks for your reply!
I finally managed to have a clean bash solution for that

DATASET="the-nature-conservancy-fisheries-monitoring"; #dogscats


for dir in ${dirpath}; do

        mkdir -p data/${DATASET}_sample/valid/${dname}
        mkdir -p data/${DATASET}_sample/train/${dname}

        shuf -n 200 -e data/${DATASET}/train/${dname}/* | xargs -i cp {} data/${DATASET}_sample/train/${dname}
        shuf -n 100 -e data/${DATASET}/valid/${dname}/* | xargs -i cp {} data/${DATASET}_sample/valid/${dname}


did you submit a PR? I’ve written a similar function for this purpose, and just saw this

1 Like

Thanks for sharing this radek. It is indeed very useful, especially when we want to work using only our laptop without internet connection, and one has to rely only on one’s own machine.

I used it and it worked but there is a line that I don’t understand very well:

shuf -n 200 -e data/dogscats/train/cats | xargs -i cp {} data/dogscats_sample/train/cats

I’m really not an expert in bash scripting so I apologize for the naivety of my question. I looked at the man entries of the different words used there (shuf, xargs) and I get more or less what they do, but I don’t really understand the role of the line. I’m not saying it shouldn’t be there (again I’m really a beginner) I’m just asking if you could provide me some explanation.


Hey @abercher - happy to hear you are finding this useful :slight_smile:

Shuf takes all the paths to files in a specified directory (I think you are missing the /* after the directory name for this to work), will take only 200 of them and will shuffle their order. This is getting then piped into xargs, which will execute the cp command supplying it the paths that were piped into xargs.


Hey @radek!
Thanks a lot for your answer. I copy-pasted this line from your first post (November 17).

The next lines of your script do have this /* and I understand them better, but I have trouble with the second one (the one I copied in my previous message). Could it be a typo?


1 Like

I think you are right - that line doesn’t do anything as far as I can tell :wink:

Mhmm the forum is not letting me edit the original post (I think too much time has elapsed), but if anyone else find this thread anytime down the road, this is what the lines should be:

mkdir -p data/dogscats_sample/{valid,train}/{cats,dogs}

shuf -n 200 -e data/dogscats/train/cats/* | xargs -i cp {} data/dogscats_sample/train/cats
shuf -n 200 -e data/dogscats/train/dogs/* | xargs -i cp {} data/dogscats_sample/train/dogs
shuf -n 100 -e data/dogscats/valid/cats/* | xargs -i cp {} data/dogscats_sample/valid/cats
shuf -n 100 -e data/dogscats/valid/dogs/* | xargs -i cp {} data/dogscats_sample/valid/dogs

Thx for spotting this @abercher and my apologies for the confusion!