This topic is for discussion of the seventh live coding session
Important note
During this video at one point I install mamba into my local conda env. This turns out to be a mistake, so later in the video I remove it and all the stuff it adds, and use micromamba instead. To skip all that headache, when you run through these steps don’t run the command that installs mamba at all!
Links from the walk-thru
- (please contribute here)
What was covered
- (please contribute here)
Video timeline - thank you @Daniel
00:00 Radek intro
Practice walk-thru 6 / chp1 on Kaggle
03:22 Jeremy introduces Kaggle competitions and reminded us that we have to join the competition before downloading the dataset
06:05Paperspace had some running issues even to the paid servers, and continue the session on local machine
10:08 How to install Kaggle (presumbly the code will work on both local and paperspace)? pip install --user kaggle
10:58 What’s special about the things installed/stored in a /bin/ directory? things/programs you can execute
11:17 Why we can’t run kaggle in terminal just now? because the /bin/ directory is not in $PATH
11:41 How to get the bin/ directory where kaggle is installed into $PATH? On paperspace, we can add the bin directory to $PATH through /storage/.bash.local file (here is Jeremy explain what the file does it for you) which enable kaggle to work on a new terminal next time; on local machine, you can do the same with .bashrc file; Radek and Jeremy also confirmed that if you want to run !kaggle in jupyter notebook, you need to also to add bin/ directory to $PATH in pre-run.sh file.
12:31 Another question answered
13:14 How to edit .bashrc : 1. vim .bashrc or /.bash.local if paperspace; shift + g to go to the bottom of the file; o to enter a new line and edit; type export PATH=~/.local/bin:$PATH to add the bin/ directory to $PATH; 2. type :qw to save and exit vim; 3. in terminal type source !$ (meaning source theLastCommandParameter) to run .bashrc or you can close and reopen a terminal
14:12 How to get kaggle.json and put it in the right directory? 1. type kaggle will remind you to have kaggle.json ready in the right place; 2. go to your Kaggle account and click create new API token to download the kaggle.json file; 3. sudo cp .kaggle/kaggle.json ~anotherUser/.kaggle to copy a file (here kaggle.json) from one user to another user; 4. and other command to change the ownership of a file
18:09 How to use kaggle cli to download competition dataset? cd git; mkdir paddy; cd !$; kaggle competitions download -c paddy-disease-classification to create a folder to download the dataset; How big is the paddy doctor dataset size? 1GB
18:52 mamba vs pip on installing kaggle? for simple python packages, pip install is a more obvious choice than mamba install
20:35 How to unzip the dataset file? paddy$ unzip -q paddy-disease-classification.zip to unzip the file without display the processing messages with -q for quiet
20:54 How to get kaggle.json available in paperspace? 1. use the upload button in paperspace jupyter lab to upload kaggle.json and store it in ~/.kaggle/kaggle.json; 2. make sure the file permission is properly -rw------- and search chmod for how Jeremy taught us to change user permission on files; 3. Radek in the post suggested to write your own kaggle.json and save in paperspace as a different solution
21:46 How to copy files to different local server? How to utilize .ssh/config file? involving cp, chown, scp etc
24:14 How to check your local GPU? type nvidia-smi
25:06 How to check out the paddy dataset? mv ~/padd-disease-classification.zip ./ move the zip file to current folder from home directory, and unzip it with unzip -q paddy-disease-classification.zip; and ls to see what inside;
25:32How to explore the dataset folder in terminal? ls train_images/ |head and what this command does is to take the output of ls train_images/ and send it to head to process; How to look into files in subfolder? ls train_images/bacterial_leaf_blight/ | head; How to count the number of files inside a subfolder? ls train_images/bacterial_leaf_blight/ | wc -l in which | wc -l takes the output from ls folderName and count the number of lines; What are other useful functions like | head? | tail output the last few file names, | grep 33 to output the filenames with ‘33’ in it; paddy# cat will output what inside this folder; cat train.csv | head can give us the first few rows of the csv file; cat train.csv | grep ADT45 | wc -l to search rows with ‘ADT45’ in train.csv and count the number of those rows
Go through the above steps in paperspace
29:09 How to use the correct version of pip to install kaggle? 1. check the version which pip: if it is not from the directory opt/conda/bin/pip then remove the found version with mv root/conda/bin/pip for example, and restart terminal and try which pip to check and confirm; 2. ctrl + r and type pip to find pip install kaggle --user to install kaggle; 3. However, the warning message confirmed that the kaggle is installed in a directory /root/.local/bin which is not on the $PATH, and we need to get it into $PATH; 4. we could add it to the $PATH by .bashrc in local machine or .bash.local in paperspace, but we will try Radek’s approach with pre-run.sh: just add the following command into the end of the file export PATH=~/.local/bin:$PATH and be aware that bash is very much sensitive on space, so don’t leave space inside the command; 5. if you run export PATH=~/.local/bin:$PATH directly in terminal, then you don’t have to close and restart a terminal to activate the new $PATH; 6. type kaggle, it should run and also tellings us to get kaggle.json in the right directory; 7. upload kaggle.json to paperspace; 8. did kaggle created a ~/.kaggle folder for us? yes, we can confirm by cd ~/.kaggle; ls; 9. let’s move the updated kaggle.json into ~/.kaggle by mv /notebooks/kaggle.json ./; 10. we will find the permission of kaggle.json is wrong as it is -rw-r--r-- by ls -la; 11. we can fix the permission by chmod 600 kaggle.json; 11. now copy the download command from the paddy competition site, and run ~/.kaggle# kaggle download -c paddy-disease-classification; 12. let’s move the zip file into /notebooks/paddy/ by a trick taught by Jeremy earlier in walkthru 6 ~/.kaggle# mkdir -p ../notebooks/paddy/; mv paddy-disease-classification.zip ../notebooks/paddy/;; 13. let’s unzip the dataset zip file by unzip -q paddy-disease-classification.zip;
34:47 How to install unzip for paperspace? micromamba install -c conda-forge -p ~/conda unzip and hopefully by July or August 2022 mamba and unzip will be installed by paperspace without us doing it manually;
34:50 How to get the keyboard shortcut in terminal working for paperspace terminal too? (later sessions?)
36:59 How to deal with large dataset and cost of persistence storage? 1. if dataset sits in /notebooks/ it will be charged with 0.29 dollar per GB/month; 2. if don’t want to spend any money, then move dataset into home directory ~/, and you will lose the dataset when closing the notebook/machine and have to download it again when starting the notebook again. You could write a script to automate the downloading process; 3. If you will work on this dataset for a month, then $0.29 surely worth your time.
38:04 It takes a long time for Paperspace to store large files into persistence storage, at least for folders with a lot of files. How to make utilising dataset faster in paperspace without much trouble? 1. How to find the disk space a folder take? du -sh train_images/; 2. why does Jeremy think it may be a better idea to move dataset back to home directory ~? we don’t want to take a long time to utilise the datasest when training a model; 3. how to delete multiple folders and files in one go? rm -rf test_images/ train* sample_submission.csv and train* include train_images/ and train.csv, and even deleting them takes a while; 4. How much time does it take home directory to unzip a 1GB dataset file? time unzip -q paddy-disease-classification.zip (only 8 seconds, and only 5 seconds to download), so we should make the download and unzip automated in the home directory when starting the paddy notebook.
41:14 How can we create a script to automate the download and unzip process of paddy dataset inside home directory (when starting the paddy notebook)? 1. create a directory paddy inside \notebooks\, save the paddy jupyter notebook there, and the automation script get_data.sh there too; 2. What does the get_data.sh look like? ( #question shouldn’t we do pushd ~; popd in the script below?) 3. how to make get_data.sh executable? chmod u+x get_data.sh; 4. so now, you can run this get_data.sh every time you start the paddy notebook/machine to automatically download and unzip the dataset into the home directory; (or you can put get_data.sh into pre-run.sh so that you don’t need to run get_data.sh yourself.
!#usr/bin/env bash
cd
mkdir paddy
cd paddy
kaggle download -c paddy-disease-classification
unzip -q paddy-disease*
43:08 Create a jupyter notebook for paddy competition. 1. What’s the first thing Jeremy usually do for an image competition/task like this? from fastai.vision.all import * to get all the classes and methods on vision ready for use; 2. How to get the path for dataset ready? path = Path.home()/'paddy', use path to check the dataset directory; 3. use path.ls() to show us what in there, and if we type Path.BASE_PATH = path, then path.ls() will leave the directory part out and make the content name more readable; 4. How to take a look at the train.csv file? df = pandas.read_csv(path/'train.csv'); df to read the first and last few rows of the csv file; (continued to the next paragraph below)
from fastai.vision.all import *
path = Path.home()/'paddy'
path
path.ls()
Path.BASE_PATH = path
path.ls()
df = pandas.read_csv(path/'train.csv')
df
45:19 How to take a look at the image listed in the csv file? 1. How to get the path for train_images/ and a path to a particular category bacterial_leaf_blight: trn_path = path/'train_images'; blb = trn_path/'bacterial_leaf_blight';; 2. How to display an image with its path? img = PILImage.create(blb/'100330.jpg'); img; 3. What is the size of the image? img.size (size is not a method as it seems); 4. How to get all the image paths into a list? files = get_image_files(trn_path); files; img = PILImage.create(files[0]);; 5. How to check whether the image size is consistent in all images? [PILImage.create(o).size for o in files[:10]]
from fastai.vision.all import *
path = Path.home()/'paddy'
path
path.ls()
Path.BASE_PATH = path
path.ls()
df = pandas.read_csv(path/'train.csv')
df
trn_path = path/'train_images'
blb_path = trn_path/'bacterial_leaf_blight'
img = PILImage.create(blb/'100330.jpg')
img
img.size
files = get_image_files(trn_path)
files
img = PILImage.create(files[0]);
img
[PILImage.create(o).size for o in files[:10]]
df.variety.value_counts()
ImageDataLoaders.from_folder(trn_path, valid_pct=0.2, seed=42, item_tfms=Resize(224)) dls.show_batch()
learn = vision_learner(dls, resnet34, metrics=error_rate)
learn.fine_tune(1)
49:40 What to think about image sizes? paddy image size is consistent, which is handy and interesting (why interesting?), and the size of the images seem big. Jeremy’s advice is to start with smaller sizes and then move onto original size, but 480x 640 is not terribly large.
51:00 What shall we do with the meta data variety of paddy/rice? 1. our model may not need the variety data to train itself for the task, as the images can give model enough knowledge to figure out variety; 2. but if the number of variety is too many, then this variety data may be useful/necessary to our model; 3. how to count the unique varieties of paddy using variety data? df.variety.value_counts(); 4. given only 10 unique varieties, and 70% of images belong to a single variety, so variety should be a low priority in our dataset for training model;
53:10 Build a ImageDataLoaders with train_images or its path: 1. since chap1 or 01_intro notebook of fastbook is on vision model, so maybe some code can be borrowed; 2. How to create a ImageDataLoader from train_images/ folder? dls = ImageDataLoaders.from_folder(trn_path, valid_pct=0.2, seed=42, item_tfms=Resize(224)) and we can show a batch of images with dls.show_batch();
56:50 Build a learner and fine-tune it: 1. How to build a model with vision classification model with Resnet34 using our dls? learn = vision_learner(dls, resnet34, metrics=error_rate) will download the resnet34 weights for us and create the model learn; 2. How to fine-tune this model for one epoch? learn.fine_tune(1)
57:07 How do we know our model is using GPU efficiently? 1. in terminal type nvidia-smi dmon and focus on reading two columns sm and mm (memory); 2. sm stands for GPU, we want to see the number to be high 70-90% is good, and if the error rate is quite low from training, then we can assume we are successfully training our model; 3. if sm is lower than 50% then it suggest GPU is not properly used by our model; 4. if that happens, what are the potential causes for the lower number to deal with: A most likely cause is that we are not reading and process images fast enough for model to use, which could be a result of storing dataset inside /storage/ or /notebooks/ as they are network storages meaning slow;
1:00:11 What are the potential solutions to improve on sm? 1. move dataset to local storage like home directory from storage/ or notebooks/; 2. resize images ahead of time (to make them smaller); 3. decrease the amount of augmentation; 4. pick an instance/machine with more CPUs
1:02:03 Next session for kaggle submission and kaggle notebook
