@jeremy has a nice workflow for moving files around in his different notebooks, but it rarely worked perfectly because my setup was slightly different or I made a typo or whatever. So Iβd have to start over, and bash commands wasnβt easily repeatable even with a bash script.
How are you dealing with this? I ended up writing this python script, which takes the raw cats/dogs input we downloaded and moves into the right directories for training/validation/sampling.
Thanks for sharing. I tend to do my moving stuff around in a notebook, and then I zip up the processed directory. If I need to recreate it from scratch, I just delete and unzip.
nice script! Iβve started to write my own helpers too. I might steal some of yours
As soon as I download the images, I create βbackup_trainβ and βbackup_testβ directories where I stored the unzipped images. But I donβt create any subdirectories yet. I do that in the notebooks since each competition will be different and I want the whole process to be completely repeatable every time. If I screw up somewhere and need to βresetβ I just delete train or test and cp -r backup_train train
In my notebooks I try to stay in the same directory the whole time. No cd⦠But at the beginning I create a reference variable DATA_HOME_DIR.
current_dir = os.getcwd()
DATA_HOME_DIR = current_dir+'/data/statefarm'```
After that I can reference that directory anywhere in my code: move, copy files, load images, etc.
I'll also sometimes add this:
```#Set Paths - Sample or Prod
root = DATA_HOME_DIR+'/sample' #or nothing
test_path = DATA_HOME_DIR+'/test/'
results_path = root + '/results/'
train_path = root + '/train/'
valid_path = root + '/valid/'
models_path = root + '/models/'```
Which lets me switch between prod and sample datasets quickly:
Nice! How do you handle moving the files around? Bash commands or Python?
Btw I always use os.path.join to create paths because I find string addition with slashes error prone. The extra typing is made up for in never having silly path errors
Thanks for sharing. Iβm doing something similar. All my file wrangling is in Python and called from the notebook.
The general flow:
Download the data from kaggle if necessary.
Unzip to a train/ directory if necessary.
Move some portion to a valid/ directory.
Copy some fraction from train/ to sample/{train,valid}/{dogs,cats}/
When Iβm ready to train on all the data (minus the validation set) I increase the sample fraction to 1.0 and leave my code reading from the sample/ files. Feel free to use any of this you find useful, and let me know if you find any bugs or have any questions.
In Mac OSX the following commands work. It assumes that you have installed coreutils (which provides GNU/linux equivalent versions of cp, mv, ls and shuf).
#before shuffling and moving
β―tree
.
βββ shuf_dest
βββ shuf_origin
βββ 1.jpeg
βββ 2.jpeg
βββ 3.jpeg
βββ 4.jpeg
βββ 5.jpeg
βββ 6.jpeg
βββ 7.jpeg
βββ 8.jpeg
βββ 9.jpeg
# you have to be inside the source folder
cd shuf_origin
gls ./ | gshuf -n 5 | xargs gmv -t ../shuf_dest/
# if you are interested in copying instead of moving using `cp`
# gls ./ | gshuf -n 5 | xargs gcp -t ../shuf_dest/
# after moving 5 images to the dest folder
β―cd ..
β―tree
.
βββ shuf_dest
β βββ 1.jpeg
β βββ 3.jpeg
β βββ 5.jpeg
β βββ 6.jpeg
β βββ 7.jpeg
βββ shuf_origin
βββ 2.jpeg
βββ 4.jpeg
βββ 8.jpeg
βββ 9.jpeg