Managing data & directories for raw/training/validation

robin · November 11, 2016, 8:19pm

Hey folks,

@jeremy has a nice workflow for moving files around in his different notebooks, but it rarely worked perfectly because my setup was slightly different or I made a typo or whatever. So I’d have to start over, and bash commands wasn’t easily repeatable even with a bash script.

How are you dealing with this? I ended up writing this python script, which takes the raw cats/dogs input we downloaded and moves into the right directories for training/validation/sampling.

You can use it like this:

>>> import prep_data as prep
>>> prep.main('~/nbs/data/redux', '~/nbs/data/redux/clean')
Copied 11500 dogs files
Copied 1000 dogs files
Sampling dogs
Copied 100 dogs files
Copied 20 dogs files
Copied 11500 cats files
Copied 1000 cats files
Sampling cats
Copied 100 cats files
Copied 20 cats files
Copying test data

jeremy · November 11, 2016, 8:43pm

Thanks for sharing. I tend to do my moving stuff around in a notebook, and then I zip up the processed directory. If I need to recreate it from scratch, I just delete and unzip.

brendan · November 12, 2016, 3:45am

nice script! I’ve started to write my own helpers too. I might steal some of yours

As soon as I download the images, I create “backup_train” and “backup_test” directories where I stored the unzipped images. But I don’t create any subdirectories yet. I do that in the notebooks since each competition will be different and I want the whole process to be completely repeatable every time. If I screw up somewhere and need to “reset” I just delete train or test and cp -r backup_train train

In my notebooks I try to stay in the same directory the whole time. No cd… But at the beginning I create a reference variable DATA_HOME_DIR.

current_dir = os.getcwd()
DATA_HOME_DIR = current_dir+'/data/statefarm'```

After that I can reference that directory anywhere in my code: move, copy files, load images, etc.

I'll also sometimes add this:

```#Set Paths - Sample or Prod
root = DATA_HOME_DIR+'/sample' #or nothing
test_path = DATA_HOME_DIR+'/test/'
results_path = root + '/results/'
train_path = root + '/train/'
valid_path = root + '/valid/'
models_path = root + '/models/'```

Which lets me switch between prod and sample datasets quickly:

robin · November 13, 2016, 12:41am

Nice! How do you handle moving the files around? Bash commands or Python?

Btw I always use os.path.join to create paths because I find string addition with slashes error prone. The extra typing is made up for in never having silly path errors

dennisobrien · November 13, 2016, 3:12am

Hi Robin,

Thanks for sharing. I’m doing something similar. All my file wrangling is in Python and called from the notebook.

gist.github.com

https://gist.github.com/dennisobrien/c72ef0f0c1fe125bb49e07b6b2834927

kaggle_dogs_cats_data_munging.py

from getpass import getpass
from glob import glob
import numpy as np
import os
import sh
import shutil


def get_data_dir(*args, relative=False):
    """Return the path to the data directory.

This file has been truncated. show original

The general flow:

Download the data from kaggle if necessary.
Unzip to a train/ directory if necessary.
Move some portion to a valid/ directory.
Copy some fraction from train/ to sample/{train,valid}/{dogs,cats}/

When I’m ready to train on all the data (minus the validation set) I increase the sample fraction to 1.0 and leave my code reading from the sample/ files. Feel free to use any of this you find useful, and let me know if you find any bugs or have any questions.

cheers,
Dennis

jeremy · November 13, 2016, 3:57am

Nice code @dennisobrien !

yzhao76 · January 8, 2017, 11:51pm

great script! thanks for sharing! I did it with command line manually…

shuf -zen1250 ~/data/dogscats/train/cats/* | xargs -0 mv -t ~/data/dogscats/valid/cats/

abi · November 12, 2017, 11:05pm

In Mac OSX the following commands work. It assumes that you have installed coreutils (which provides GNU/linux equivalent versions of cp, mv, ls and shuf).

#before shuffling and moving
❯tree
.
├── shuf_dest
└── shuf_origin
    ├── 1.jpeg
    ├── 2.jpeg
    ├── 3.jpeg
    ├── 4.jpeg
    ├── 5.jpeg
    ├── 6.jpeg
    ├── 7.jpeg
    ├── 8.jpeg
    └── 9.jpeg

# you have to be inside the source folder    
cd shuf_origin
gls ./ | gshuf -n 5 | xargs gmv -t ../shuf_dest/

# if you are interested in copying instead of moving using `cp`
# gls ./ | gshuf -n 5 | xargs gcp -t ../shuf_dest/

# after moving 5 images to the dest folder
❯cd ..
❯tree
.
├── shuf_dest
│   ├── 1.jpeg
│   ├── 3.jpeg
│   ├── 5.jpeg
│   ├── 6.jpeg
│   └── 7.jpeg
└── shuf_origin
    ├── 2.jpeg
    ├── 4.jpeg
    ├── 8.jpeg
    └── 9.jpeg