Bash scripts for creating sample datasets

Hi everyone. I’ve added some bash commands to my .bashrc to make it easier to setup sample datasets. The main helpers are cpn/mvn which is “copy n” and “move n” respectively. I thought others might be interested so here is a gist:

Here is how it works. Assuming you have already set up your full train/valid/test directories and you wanted to:

  • create the sample directory
  • copy 200 random dogs and cats into both the training and validation sample sets
  • copy 500 random images into the sample test set

You would just need to do this

$ sampletree
$ mkdir sample/train/{dogs,cats} sample/valid/{dogs,cats}
$ cpn 200 train/dogs
$ cpn 200 valid/dogs
$ cpn 200 train/cats
$ cpn 200 valid/cats
$ cpn 500 test

Good one! I’ve found this command that helps move a defined share of files from each of the subfolders to another master folder - I have used it to take 10% of the samples from training dataset subfolders (10 subfolders for 10 classes) and move them to validation folder.

Here’s the link to the StackOverflow -

Here’s my code:

kg config -g -u `username` -p `password` -c `state-farm-distracted-driver-detection`
kg download


mkdir valid

cd train

find . -type f -exec dirname {} + | uniq -c | while read n d;do echo "Directory:$d Files:$n Moving first:$(($n / 10))";mkdir -p ../valid${d:1};find $d -type f | head -n $(($n / 10)) | while read file;do mv $file ../valid${d:1}/;done;done