Managing data & directories for raw/training/validation

Hey folks,

@jeremy has a nice workflow for moving files around in his different notebooks, but it rarely worked perfectly because my setup was slightly different or I made a typo or whatever. So I’d have to start over, and bash commands wasn’t easily repeatable even with a bash script.

How are you dealing with this? I ended up writing this python script, which takes the raw cats/dogs input we downloaded and moves into the right directories for training/validation/sampling.

You can use it like this:

>>> import prep_data as prep
>>> prep.main('~/nbs/data/redux', '~/nbs/data/redux/clean')
Copied 11500 dogs files
Copied 1000 dogs files
Sampling dogs
Copied 100 dogs files
Copied 20 dogs files
Copied 11500 cats files
Copied 1000 cats files
Sampling cats
Copied 100 cats files
Copied 20 cats files
Copying test data

Thanks for sharing. I tend to do my moving stuff around in a notebook, and then I zip up the processed directory. If I need to recreate it from scratch, I just delete and unzip.

1 Like

nice script! I’ve started to write my own helpers too. I might steal some of yours :wink:

As soon as I download the images, I create β€œbackup_train” and β€œbackup_test” directories where I stored the unzipped images. But I don’t create any subdirectories yet. I do that in the notebooks since each competition will be different and I want the whole process to be completely repeatable every time. If I screw up somewhere and need to β€œreset” I just delete train or test and cp -r backup_train train

In my notebooks I try to stay in the same directory the whole time. No cd… But at the beginning I create a reference variable DATA_HOME_DIR.

current_dir = os.getcwd()
DATA_HOME_DIR = current_dir+'/data/statefarm'```

After that I can reference that directory anywhere in my code: move, copy files, load images, etc.

I'll also sometimes add this:

```#Set Paths - Sample or Prod
root = DATA_HOME_DIR+'/sample' #or nothing
test_path = DATA_HOME_DIR+'/test/'
results_path = root + '/results/'
train_path = root + '/train/'
valid_path = root + '/valid/'
models_path = root + '/models/'```

Which lets me switch between prod and sample datasets quickly:
1 Like

Nice! How do you handle moving the files around? Bash commands or Python?

Btw I always use os.path.join to create paths because I find string addition with slashes error prone. The extra typing is made up for in never having silly path errors :slight_smile:

Hi Robin,

Thanks for sharing. I’m doing something similar. All my file wrangling is in Python and called from the notebook.

The general flow:

  1. Download the data from kaggle if necessary.
  2. Unzip to a train/ directory if necessary.
  3. Move some portion to a valid/ directory.
  4. Copy some fraction from train/ to sample/{train,valid}/{dogs,cats}/

When I’m ready to train on all the data (minus the validation set) I increase the sample fraction to 1.0 and leave my code reading from the sample/ files. Feel free to use any of this you find useful, and let me know if you find any bugs or have any questions.



Nice code @dennisobrien !

great script! thanks for sharing! I did it with command line manually…

shuf -zen1250 ~/data/dogscats/train/cats/* | xargs -0 mv -t ~/data/dogscats/valid/cats/


In Mac OSX the following commands work. It assumes that you have installed coreutils (which provides GNU/linux equivalent versions of cp, mv, ls and shuf).

#before shuffling and moving
β”œβ”€β”€ shuf_dest
└── shuf_origin
    β”œβ”€β”€ 1.jpeg
    β”œβ”€β”€ 2.jpeg
    β”œβ”€β”€ 3.jpeg
    β”œβ”€β”€ 4.jpeg
    β”œβ”€β”€ 5.jpeg
    β”œβ”€β”€ 6.jpeg
    β”œβ”€β”€ 7.jpeg
    β”œβ”€β”€ 8.jpeg
    └── 9.jpeg

# you have to be inside the source folder    
cd shuf_origin
gls ./ | gshuf -n 5 | xargs gmv -t ../shuf_dest/

# if you are interested in copying instead of moving using `cp`
# gls ./ | gshuf -n 5 | xargs gcp -t ../shuf_dest/

# after moving 5 images to the dest folder
❯cd ..
β”œβ”€β”€ shuf_dest
β”‚   β”œβ”€β”€ 1.jpeg
β”‚   β”œβ”€β”€ 3.jpeg
β”‚   β”œβ”€β”€ 5.jpeg
β”‚   β”œβ”€β”€ 6.jpeg
β”‚   └── 7.jpeg
└── shuf_origin
    β”œβ”€β”€ 2.jpeg
    β”œβ”€β”€ 4.jpeg
    β”œβ”€β”€ 8.jpeg
    └── 9.jpeg