Faster experimentation for better learning

Ah geez … I was trying everything in the python universe to escape the { to no avail! Never thought it might be a Jupyter notebook thing!

Thanks much!

1 Like

Thank you radek!
I was thinking how can I automate your script in order to be applied on any dataset with a minimum of modifications


mkdir -p data/${DATASET}_sample/{valid,train}/{cats,dogs}

shuf -n 200 -e data/${DATASET}/train/cats/* | xargs -i cp {} data/${DATASET}_sample/train/cats
shuf -n 200 -e data/${DATASET}/train/dogs/* | xargs -i cp {} data/${DATASET}_sample/train/dogs
shuf -n 100 -e data/${DATASET}/valid/cats/* | xargs -i cp {} data/${DATASET}_sample/valid/cats
shuf -n 100 -e data/${DATASET}/valid/dogs/* | xargs -i cp {} data/${DATASET}_sample/valid/dogs

I am still searching how can I remove dogs and cats labels, since in another datasets we have other classes

1 Like

Ah this is really great @alessa :slight_smile:

I am not very good with bash scripting and I suspect that iterating over a list of strings might be troublesome though definitely doable. Maybe combining the best of both worlds (jupyter notebook and linux programs) would be worth exploring? I am thinking about something along the lines:

categories = ['category_1', 'category_2']
!!mkdir -p path/to/data/dataset_name/{category_1, category_2} # would be nice to automate this
# as well but I don't know how, need some string interpolation at some point somewhere 
# maybe something like this:
!mkdir {f'-p path/{categories.join(',')}'} # or however going from list to string is done in Python
for category in categories:
    !shuf ... # and string interpolation again I guess

Either way - I don’t have an answer and sorry for me babbling but maybe some of this can be of help!

BTW here is a solution for iterating over strings in bash if you’d prefer to go that route :slight_smile:

1 Like

Thanks for your reply!
I finally managed to have a clean bash solution for that

DATASET="the-nature-conservancy-fisheries-monitoring"; #dogscats


for dir in ${dirpath}; do

        mkdir -p data/${DATASET}_sample/valid/${dname}
        mkdir -p data/${DATASET}_sample/train/${dname}

        shuf -n 200 -e data/${DATASET}/train/${dname}/* | xargs -i cp {} data/${DATASET}_sample/train/${dname}
        shuf -n 100 -e data/${DATASET}/valid/${dname}/* | xargs -i cp {} data/${DATASET}_sample/valid/${dname}


did you submit a PR? I’ve written a similar function for this purpose, and just saw this

1 Like

Thanks for sharing this radek. It is indeed very useful, especially when we want to work using only our laptop without internet connection, and one has to rely only on one’s own machine.

I used it and it worked but there is a line that I don’t understand very well:

shuf -n 200 -e data/dogscats/train/cats | xargs -i cp {} data/dogscats_sample/train/cats

I’m really not an expert in bash scripting so I apologize for the naivety of my question. I looked at the man entries of the different words used there (shuf, xargs) and I get more or less what they do, but I don’t really understand the role of the line. I’m not saying it shouldn’t be there (again I’m really a beginner) I’m just asking if you could provide me some explanation.


Hey @abercher - happy to hear you are finding this useful :slight_smile:

Shuf takes all the paths to files in a specified directory (I think you are missing the /* after the directory name for this to work), will take only 200 of them and will shuffle their order. This is getting then piped into xargs, which will execute the cp command supplying it the paths that were piped into xargs.


Hey @radek!
Thanks a lot for your answer. I copy-pasted this line from your first post (November 17).

The next lines of your script do have this /* and I understand them better, but I have trouble with the second one (the one I copied in my previous message). Could it be a typo?


1 Like

I think you are right - that line doesn’t do anything as far as I can tell :wink:

Mhmm the forum is not letting me edit the original post (I think too much time has elapsed), but if anyone else find this thread anytime down the road, this is what the lines should be:

mkdir -p data/dogscats_sample/{valid,train}/{cats,dogs}

shuf -n 200 -e data/dogscats/train/cats/* | xargs -i cp {} data/dogscats_sample/train/cats
shuf -n 200 -e data/dogscats/train/dogs/* | xargs -i cp {} data/dogscats_sample/train/dogs
shuf -n 100 -e data/dogscats/valid/cats/* | xargs -i cp {} data/dogscats_sample/valid/cats
shuf -n 100 -e data/dogscats/valid/dogs/* | xargs -i cp {} data/dogscats_sample/valid/dogs

Thx for spotting this @abercher and my apologies for the confusion!


Thanks a lot @radek! No problem, since the script works very well.
Have a nice day!

1 Like

Really elegant use of bash code!

I had been writing my own version, but this is much more elegant. That being said, I think the code needs to be modified to change cp to mv. Otherwise, the samples in train and valid are not completely independent. Specifically, it is possible [likely] that the same sample will appear both in train and valid. The downside of this change its that the original samples are irrevocably altered.

1 Like

Damn, that is a really nice bash command. Pipes are so handy.

@radek wonderful tip.

I am planning to create/debug my notebook on my macbook pro (16gb ram, no gpu). Then execute it in the cloud where I play with architectures/hyper-parameters. Is this something practitioners do?

If so are there some tips?
E.g. [a] selecting a subset of the images [b] selecting a ‘cheaper’ cnn architecture?

I think you might want to take the approach you describe in a situation where you would want to limit the running time of your cloud instance as much as possible. It sounds tempting but the overhead it adds can be quite substantial. I am also not sure what is the running time on CPU - if it is 10x or 100x. If it takes forever to process 500 images on a CPU then I think this can have limited viability. Especially early in the course where just running things with different hyperparams and settings is probably the best way to learn.

I think if I were in your shoes I would consider running the lesson notebooks on cloud instances and see what I can learn from that. Probably even an hour or so per lesson of working with a notebook can initially get you a long way.

If money is a concern, you might want to look into spinning up spot instances on AWS. Last I used this I was getting a P2 for around 0.2$ per hour. If you go for the 20GB detachable volume I think this will take you back 2$ per month just to keep it but you could get by with a smaller one, I think. Setting this up takes a bit of work but can get you quite far. I do not know what other cost efficient options exit, maybe someone could chime in.

I remember there being free GPU instances made available by google? Are they still a thing? You literally went to a url and a jupyter notebook opened running on an instance with a GPU. Quick google search revealed this:

For cost saving this would probably be the best if it is still available.

If you would like to do something on your laptop - which I completely agree might be an interesting direction, once you get your feet wet with the course I would look to building toy examples that run on the CPU. You could probably build something interesting from scratch using the MNIST dataset. Or, in lesson 4 (I think it’s lesson 4?), when we get to IMDB sentiment analysis, you probably could also do something there on CPU only.

Not sure if this is helpful but this is the way I see it now. You probably should look for a set up that you can comfortably afford that can allow for quickest learning. And building nets on large datasets locally to run them on a full dataset on a cloud instance seems to require too much overhead to make it practical. But at the same time, I haven’t tried it, so maybe you can make it work :slight_smile:

1 Like

What a wonderful community. Just like the teacher!

Thanks @radek for the gems of advice. Like you said, I’ll find a way to experiment initially on GPU.

I am guessing I have enough credits in my azure subscription to last out the month. The problem is connectivity - I am in Brisbane and there are azure GPU instances only in East US. And my connection is decent at ~15 mbps

Each key stroke over ssh or terminal server - appears to take like forever.

1 Like

I am not sure but I don’t think a lot can be done for the latency over SSH. There are a couple of suggestions in this serverfault thread that might be useful.

I suspect the latency is due to getting off the continent and not sure much can be done about that far from switching ISPs (and that might also not help). There does seem to exist an AWS region in Sydney and they have p2 instances, but they are much more expensive than in any other region I have looked at. The only other solution might be going in a different direction - here are the availability regions

BTW if you are a student I think there should be a way of getting some free AWS credits. IIRC there is a program through GitHub and there was also one directly from Amazon but I havent’ checked in quite a while if they are still ongoing.

1 Like

I just made a Python script to do it. :snake:

You can set your preferred splits too: 60/20/20, 70/15/15, etc.
It works pretty well on a small test sample and on a set of just under 200 .png files.

I used this Google images download library to get pictures, so the Python script works on the resulting directory downloads.
That reminds me, I’m using Linux. Haven’t tried it on anything else, though I don’t see why it wouldn’t work elsewhere.


If you want to use bing or google, I wrote a script to do so. Also, because it uses selenium, it can download more images from google, typically 700-800 vs. 200 with the direct google api.



Very nice. I’ll check it out. I’d like to pair that with mine to download and split that many photos.
Edit: Oh, it looks like yours does everything!

I’ve made a script that creates sample dataset from data already divided in train, validation and test folders (exact names should not matter)
I will try to implement that functionality for datasets with .csv files later.