This topic is for discussion of the seventh live coding session
During this video at one point I install
mamba into my local conda env. This turns out to be a mistake, so later in the video I remove it and all the stuff it adds, and use
micromamba instead. To skip all that headache, when you run through these steps don’t run the command that installs
mamba at all!
- (please contribute here)
- (please contribute here)
Video timeline - thank you @Daniel
03:22 Jeremy introduces Kaggle competitions and reminded us that we have to join the competition before downloading the dataset
06:05Paperspace had some running issues even to the paid servers, and continue the session on local machine
10:08 How to install Kaggle (presumbly the code will work on both local and paperspace)?
pip install --user kaggle
10:58 What’s special about the things installed/stored in a
/bin/ directory? things/programs you can execute
11:17 Why we can’t run
kaggle in terminal just now? because the
/bin/ directory is not in
11:41 How to get the
bin/ directory where kaggle is installed into
$PATH? On paperspace, we can add the
bin directory to
/storage/.bash.local file (here is Jeremy explain what the file does it for you) which enable
kaggle to work on a new terminal next time; on local machine, you can do the same with
.bashrc file; Radek and Jeremy also confirmed that if you want to run
!kaggle in jupyter notebook, you need to also to add
bin/ directory to
12:31 Another question answered
13:14 How to edit
.bashrc : 1.
vim .bashrc or
/.bash.local if paperspace;
shift + g to go to the bottom of the file;
o to enter a new line and edit; type
export PATH=~/.local/bin:$PATH to add the
bin/ directory to
$PATH; 2. type
:qw to save and exit vim; 3. in terminal type
source !$ (meaning
source theLastCommandParameter) to run
.bashrc or you can close and reopen a terminal
14:12 How to get
kaggle.json and put it in the right directory? 1. type
kaggle will remind you to have
kaggle.json ready in the right place; 2. go to your Kaggle account and click
create new API token to download the
kaggle.json file; 3.
sudo cp .kaggle/kaggle.json ~anotherUser/.kaggle to copy a file (here
kaggle.json) from one user to another user; 4. and other command to change the ownership of a file
18:09 How to use kaggle cli to download competition dataset?
cd git; mkdir paddy; cd !$; kaggle competitions download -c paddy-disease-classification to create a folder to download the dataset; How big is the paddy doctor dataset size? 1GB
18:52 mamba vs pip on installing kaggle? for simple python packages,
pip install is a more obvious choice than
20:35 How to unzip the dataset file?
paddy$ unzip -q paddy-disease-classification.zip to unzip the file without display the processing messages with
-q for quiet
20:54 How to get
kaggle.json available in paperspace? 1. use the upload button in paperspace jupyter lab to upload
kaggle.json and store it in
~/.kaggle/kaggle.json; 2. make sure the file permission is properly
-rw------- and search
chmod for how Jeremy taught us to change user permission on files; 3. Radek in the post suggested to write your own
kaggle.json and save in paperspace as a different solution
21:46 How to copy files to different local server? How to utilize
.ssh/config file? involving
24:14 How to check your local GPU? type
25:06 How to check out the paddy dataset?
mv ~/padd-disease-classification.zip ./ move the zip file to current folder from home directory, and unzip it with
unzip -q paddy-disease-classification.zip; and
ls to see what inside;
25:32How to explore the dataset folder in terminal?
ls train_images/ |head and what this command does is to take the output of
ls train_images/ and send it to
head to process; How to look into files in subfolder?
ls train_images/bacterial_leaf_blight/ | head; How to count the number of files inside a subfolder?
ls train_images/bacterial_leaf_blight/ | wc -l in which
| wc -l takes the output from
ls folderName and count the number of lines; What are other useful functions like
| tail output the last few file names,
| grep 33 to output the filenames with ‘33’ in it;
paddy# cat will output what inside this folder;
cat train.csv | head can give us the first few rows of the csv file;
cat train.csv | grep ADT45 | wc -l to search rows with ‘ADT45’ in
train.csv and count the number of those rows
Go through the above steps in paperspace
29:09 How to use the correct version of
pip to install kaggle? 1. check the version
which pip: if it is not from the directory
opt/conda/bin/pip then remove the found version with
mv root/conda/bin/pip for example, and restart terminal and try
which pip to check and confirm; 2.
ctrl + r and type
pip to find
pip install kaggle --user to install kaggle; 3. However, the warning message confirmed that the kaggle is installed in a directory
/root/.local/bin which is not on the
$PATH, and we need to get it into
$PATH; 4. we could add it to the
.bashrc in local machine or
.bash.local in paperspace, but we will try Radek’s approach with
pre-run.sh: just add the following command into the end of the file
export PATH=~/.local/bin:$PATH and be aware that
bash is very much sensitive on space, so don’t leave space inside the command; 5. if you run
export PATH=~/.local/bin:$PATH directly in terminal, then you don’t have to close and restart a terminal to activate the new
$PATH; 6. type
kaggle, it should run and also tellings us to get
kaggle.json in the right directory; 7. upload
kaggle.json to paperspace; 8. did
kaggle created a
~/.kaggle folder for us? yes, we can confirm by
cd ~/.kaggle; ls; 9. let’s move the updated
mv /notebooks/kaggle.json ./; 10. we will find the permission of
kaggle.json is wrong as it is
ls -la; 11. we can fix the permission by
chmod 600 kaggle.json; 11. now copy the download command from the paddy competition site, and run
~/.kaggle# kaggle download -c paddy-disease-classification; 12. let’s move the zip file into
/notebooks/paddy/ by a trick taught by Jeremy earlier in walkthru 6
~/.kaggle# mkdir -p ../notebooks/paddy/; mv paddy-disease-classification.zip ../notebooks/paddy/;; 13. let’s unzip the dataset zip file by
unzip -q paddy-disease-classification.zip;
34:47 How to install
unzip for paperspace?
micromamba install -c conda-forge -p ~/conda unzip and hopefully by July or August 2022
unzip will be installed by paperspace without us doing it manually;
34:50 How to get the keyboard shortcut in terminal working for paperspace terminal too? (later sessions?)
36:59 How to deal with large dataset and cost of persistence storage? 1. if dataset sits in
/notebooks/ it will be charged with 0.29 dollar per GB/month; 2. if don’t want to spend any money, then move dataset into home directory
~/, and you will lose the dataset when closing the notebook/machine and have to download it again when starting the notebook again. You could write a script to automate the downloading process; 3. If you will work on this dataset for a month, then $0.29 surely worth your time.
38:04 It takes a long time for Paperspace to store large files into persistence storage, at least for folders with a lot of files. How to make utilising dataset faster in paperspace without much trouble? 1. How to find the disk space a folder take?
du -sh train_images/; 2. why does Jeremy think it may be a better idea to move dataset back to home directory
~? we don’t want to take a long time to utilise the datasest when training a model; 3. how to delete multiple folders and files in one go?
rm -rf test_images/ train* sample_submission.csv and
train.csv, and even deleting them takes a while; 4. How much time does it take home directory to unzip a 1GB dataset file?
time unzip -q paddy-disease-classification.zip (only 8 seconds, and only 5 seconds to download), so we should make the download and unzip automated in the home directory when starting the paddy notebook.
41:14 How can we create a script to automate the download and unzip process of paddy dataset inside home directory (when starting the paddy notebook)? 1. create a directory
\notebooks\, save the paddy jupyter notebook there, and the automation script
get_data.sh there too; 2. What does the
get_data.sh look like? ( #question shouldn’t we do
pushd ~; popd in the script below?) 3. how to make
chmod u+x get_data.sh; 4. so now, you can run this
get_data.sh every time you start the paddy notebook/machine to automatically download and unzip the dataset into the home directory; (or you can put
pre-run.sh so that you don’t need to run
!#usr/bin/env bash cd mkdir paddy cd paddy kaggle download -c paddy-disease-classification unzip -q paddy-disease*
43:08 Create a jupyter notebook for paddy competition. 1. What’s the first thing Jeremy usually do for an image competition/task like this?
from fastai.vision.all import * to get all the classes and methods on vision ready for use; 2. How to get the path for dataset ready?
path = Path.home()/'paddy', use
path to check the dataset directory; 3. use
path.ls() to show us what in there, and if we type
Path.BASE_PATH = path, then
path.ls() will leave the directory part out and make the content name more readable; 4. How to take a look at the
df = pandas.read_csv(path/'train.csv'); df to read the first and last few rows of the csv file; (continued to the next paragraph below)
from fastai.vision.all import * path = Path.home()/'paddy' path path.ls() Path.BASE_PATH = path path.ls() df = pandas.read_csv(path/'train.csv') df
45:19 How to take a look at the image listed in the csv file? 1. How to get the path for
train_images/ and a path to a particular category
trn_path = path/'train_images'; blb = trn_path/'bacterial_leaf_blight';; 2. How to display an image with its path?
img = PILImage.create(blb/'100330.jpg'); img; 3. What is the size of the image?
size is not a method as it seems); 4. How to get all the image paths into a list?
files = get_image_files(trn_path); files; img = PILImage.create(files);; 5. How to check whether the image size is consistent in all images?
[PILImage.create(o).size for o in files[:10]]
from fastai.vision.all import * path = Path.home()/'paddy' path path.ls() Path.BASE_PATH = path path.ls() df = pandas.read_csv(path/'train.csv') df trn_path = path/'train_images' blb_path = trn_path/'bacterial_leaf_blight' img = PILImage.create(blb/'100330.jpg') img img.size files = get_image_files(trn_path) files img = PILImage.create(files); img [PILImage.create(o).size for o in files[:10]] df.variety.value_counts() ImageDataLoaders.from_folder(trn_path, valid_pct=0.2, seed=42, item_tfms=Resize(224)) dls.show_batch() learn = vision_learner(dls, resnet34, metrics=error_rate) learn.fine_tune(1)
49:40 What to think about image sizes? paddy image size is consistent, which is handy and interesting (why interesting?), and the size of the images seem big. Jeremy’s advice is to start with smaller sizes and then move onto original size, but 480x 640 is not terribly large.
51:00 What shall we do with the meta data
variety of paddy/rice? 1. our model may not need the
variety data to train itself for the task, as the images can give model enough knowledge to figure out
variety; 2. but if the number of
variety is too many, then this
variety data may be useful/necessary to our model; 3. how to count the unique varieties of paddy using
df.variety.value_counts(); 4. given only 10 unique varieties, and 70% of images belong to a single variety, so variety should be a low priority in our dataset for training model;
53:10 Build a ImageDataLoaders with
train_images or its path: 1. since chap1 or 01_intro notebook of fastbook is on vision model, so maybe some code can be borrowed; 2. How to create a ImageDataLoader from
dls = ImageDataLoaders.from_folder(trn_path, valid_pct=0.2, seed=42, item_tfms=Resize(224)) and we can show a batch of images with
56:50 Build a learner and fine-tune it: 1. How to build a model with vision classification model with Resnet34 using our
learn = vision_learner(dls, resnet34, metrics=error_rate) will download the resnet34 weights for us and create the model
learn; 2. How to fine-tune this model for one epoch?
57:07 How do we know our model is using GPU efficiently? 1. in terminal type
nvidia-smi dmon and focus on reading two columns
mm (memory); 2.
sm stands for GPU, we want to see the number to be high 70-90% is good, and if the error rate is quite low from training, then we can assume we are successfully training our model; 3. if
sm is lower than 50% then it suggest GPU is not properly used by our model; 4. if that happens, what are the potential causes for the lower number to deal with: A most likely cause is that we are not reading and process images fast enough for model to use, which could be a result of storing dataset inside
/notebooks/ as they are network storages meaning slow;
1:00:11 What are the potential solutions to improve on
sm? 1. move dataset to local storage like home directory from
notebooks/; 2. resize images ahead of time (to make them smaller); 3. decrease the amount of augmentation; 4. pick an instance/machine with more CPUs
1:02:03 Next session for kaggle submission and kaggle notebook