Part 1 (2020) - Weekly Beginner Only Review and Q&A

wgpubs · April 27, 2020, 6:16pm

A Tokenizer is a Transform in fastai (see here).

What is kind of confusing is that if you look at it’s constructor, it takes a tokenizer … which is a class that does the actual work of tokenizing your inputs, and has a interface you need to implement should you make you’re own. In fastai parlance, it is referred to as the “tokenization function”.

If you look at the SpacyTokenizer and SentencePieceTokenizer classes, you can get an idea of how that interface is to be implemented. The minimal implementation requirements are here.

wgpubs · April 27, 2020, 6:17pm

Sorry for the confusion … by that I meant that a custom Tokenizer needs to implement the functions you see in BaseTokenizer.

And again remember, that is this the “tokenization function” … that is used by the Tokenizer transform. As I said, the nomenclature is a bit confusing.

DrC · April 27, 2020, 8:22pm

Not familiar with the “L” method used here:

return L(client.images.search(query=term))

I checked with it and without, no difference in the type or output. Can someone elaborate?

wgpubs · April 27, 2020, 8:58pm

L is here. It’s essentially an enhanced list with some nifty methods.

barnacl · April 28, 2020, 10:39pm

few updates @wgpubs

if lr = 1e-3 -> [0.001,0.001,0.001]
if lr = slice(1e-3) -> [0.0001,0.0001,0.001]
if lr = slice(1e-5,1e-3) -> array([1.e-05, 1.e-04, 1.e-03])

instead of lr/3 should be lr/10

i think a better way to put it is - train first param group with 1e-6 , last param group with 1e-4 and the middle param group/s are evenly geometrically spaced between 1e-6 and 1e-4.
usually we divide then param groups into 3 groups.

code
old docs
evenly geometrically spaced
keep in mind the docs are old so the naming has changed, there is no lr_range, but the logic is still the same

muellerzr · April 28, 2020, 10:46pm

In general, Jeremy made some experiments with this to find what worked best on Rossmann. What he found was a general rule dictated in the code:

def emb_sz_rule(n_cat):
    "Rule of thumb to pick embedding size corresponding to `n_cat`"
    return min(600, round(1.6 * n_cat**0.56))

So as you can see it’s 1.6 * the cardinality of the variable to the 0.56. Some people have had success making it just a little bit higher than what’s given, but this rule works very well. If this looks a little different from in the past this is because it was actually changed and updated

barnacl · April 28, 2020, 11:14pm

cardinality would work for columns with categorical data right? how would you use this for setting emb_sz in collaborative filtering - for latent factors ? @muellerzr

muellerzr · April 28, 2020, 11:33pm

I think Jeremy May discuss this exact thing tonight

golz · April 29, 2020, 11:57pm

I’m trying out the titanic intro competition on Kaggle. Right away Iget his error after entering ‘from kaggle import api’:

Could not find kaggle.json. Make sure it’s located in /root/.kaggle. Or use the environment method.

anybody know how to resolve?

wgpubs · April 30, 2020, 1:35am

You’re on paperspace right?

You should create a file in /root/.kaggle/ named kaggle.json … that looks like this:

{"username":"your username", "key":"you api key"}

Jeremy’s notebooks have some code to create this for you (if I recall from the last lecture), but you may have to change it to save to the correct place in paperspace.

DanielLam · April 30, 2020, 2:10am

Login into Kaggle
Click your icon on the top right, and go to My Account
Click “Create New API Token”. This downloads the .json file
Move it to /root/.kaggle

golz · April 30, 2020, 6:43pm

Here is where my kaggle.json file is, but the model still can’t find it. Below is the location and the full error. Thanks for your help!


***
creds = '{"username":"jrgolz","key":"0aa46bb2d716f5b50cbc7b980b686***"}'
!pip install kaggle
from utils import *
from kaggle import api
from pandas.api.types import is_string_dtype, is_numeric_dtype, is_categorical_dtype
from fastai2.tabular.all import *
from sklearn.ensemble import RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor
from dtreeviz.trees import *
from IPython.display import Image, display_svg, SVG

pd.options.display.max_rows = 20
pd.options.display.max_columns = 8
Requirement already satisfied: kaggle in /opt/conda/envs/fastai/lib/python3.7/site-packages (1.5.6)
Requirement already satisfied: six>=1.10 in /opt/conda/envs/fastai/lib/python3.7/site-packages (from kaggle) (1.14.0)
Requirement already satisfied: certifi in /opt/conda/envs/fastai/lib/python3.7/site-packages (from kaggle) (2019.11.28)
Requirement already satisfied: requests in /opt/conda/envs/fastai/lib/python3.7/site-packages (from kaggle) (2.23.0)
Requirement already satisfied: urllib3<1.25,>=1.21.1 in /opt/conda/envs/fastai/lib/python3.7/site-packages (from kaggle) (1.24.3)
Requirement already satisfied: python-slugify in /opt/conda/envs/fastai/lib/python3.7/site-packages (from kaggle) (4.0.0)
Requirement already satisfied: python-dateutil in /opt/conda/envs/fastai/lib/python3.7/site-packages (from kaggle) (2.8.1)
Requirement already satisfied: tqdm in /opt/conda/envs/fastai/lib/python3.7/site-packages (from kaggle) (4.43.0)
Requirement already satisfied: chardet<4,>=3.0.2 in /opt/conda/envs/fastai/lib/python3.7/site-packages (from requests->kaggle) (3.0.4)
Requirement already satisfied: idna<3,>=2.5 in /opt/conda/envs/fastai/lib/python3.7/site-packages (from requests->kaggle) (2.9)
Requirement already satisfied: text-unidecode>=1.3 in /opt/conda/envs/fastai/lib/python3.7/site-packages (from python-slugify->kaggle) (1.3)
---------------------------------------------------------------------------
OSError                                   Traceback (most recent call last)
<ipython-input-4-5fdc5d62360f> in <module>
      1 get_ipython().system('pip install kaggle')
      2 from utils import *
----> 3 from kaggle import api
      4 from pandas.api.types import is_string_dtype, is_numeric_dtype, is_categorical_dtype
      5 from fastai2.tabular.all import *

/opt/conda/envs/fastai/lib/python3.7/site-packages/kaggle/__init__.py in <module>
     21 
     22 api = KaggleApi(ApiClient())
---> 23 api.authenticate()

/opt/conda/envs/fastai/lib/python3.7/site-packages/kaggle/api/kaggle_api_extended.py in authenticate(self)
    147                 raise IOError('Could not find {}. Make sure it\'s located in'
    148                               ' {}. Or use the environment method.'.format(
--> 149                                   self.config_file, self.config_dir))
    150 
    151         # Step 3: load into configuration!

OSError: Could not find kaggle.json. Make sure it's located in /root/.kaggle. Or use the environment method.

0tist · April 30, 2020, 6:58pm

i guess the folder is incorrect, it should be in the .kaggle folder
here is the link of the solution, might help u

DrC · April 30, 2020, 10:06pm

For inference, I have a jpg image that I load and crop, then I predict on it but:

img = open_image(‘test.jpg’)
image = img.px[:, x1:x2,y1:y2]
learn.predict(image) => this give "‘Tensor’ object has no attribute ‘apply_tfms’ "

WITHOUT the cropping it works fine!!! How to apply same transforms that were applied to imagedatabunch? (went back to old fastai for this one!)

wgpubs · May 1, 2020, 12:16am

What is your open_image function?

And what are the x1,x2,y1,y2 parameters?

My guess, without seeing everything, is that when you call predict, what you are giving it is not an image object (and in particular, I think it has to be a fastai2.vision.core.PILImage object).

wgpubs · May 1, 2020, 1:20am

We’ll be starting in ~ 15 minutes for anyone interested in joining.

Link to zoom meeting: https://ucsd.zoom.us/j/96709689728

Albertotono · May 1, 2020, 6:45pm

Dear @wgpubs, sorry if I missed yesterday,
I was wondering about this chart, it the green bars the one that influence most our predictions?
or we should just looking at the lenght of the bars?

Albertotono · May 1, 2020, 7:30pm

Also this part doesn’t make sense to me, he is talking about ModelID and fiModelDesc to have an high number…
so why he decided to remove fiModelDescriptor.
I am a bit confused

wgpubs · May 2, 2020, 2:09am

I am wondering the same thing

I think it may be a typo … that what they meant to do is drop fiModelDesc because, based on the numbers, it seems like it may have a strong correlation with ModelID.

Could you link to this post from the official lecture thread? Would be nice if either Jeremy or Sylvain can confirm things either way.

wgpubs · May 2, 2020, 2:27am

As I understand things, what this shows is for a given row (so for a specific record), how we got to the final prediction we did for it (e.g., the column labeled “net”) from the mean value of our target/dependent variable (the prediction before any splits).

So in the example, that mean value is 10.10 (that is where we start) … and the prediction for this record is 9.98 (where we end up). “YearMade” had a negative effect on the price … dropping it by .423, whereas “ProductSize” had a positive effect … raising it by .201 over that mean. The waterfall plot shows what effect each of our independent variables played in getting us to that 9.98 prediction.

-.423+.201+.046+.127-.043+0.038-.098+.084-.055-.122 = -0.122
10.104-9.982 = -0.122