It doesn’t, both are treated the same way.
Good question @marii. The scaling method you propose would be problematic because it gives undue weight to outliers.
For example suppose we have a database of statistics about men, where one of the features is weight. Most men weigh between 120 and 200 pounds, but some [weigh much more] (https://en.wikipedia.org/wiki/List_of_heaviest_people].
What happens if you apply this method to standardize the weights, by dividing each by the weight of heaviest man (1400 pounds)? The relatively small number of very heavy men would have standardized weights that are near 1.0, while most men’s standardized weights would be between 100/1400 and 200/1400, or on the interval [1/14, 2/14]. So the high end of the scale, though sparsely populated, would be too heavily weighted, compared to the range which contains most of the population. Pardon the pun
What a fantastic, well-organized, action-packed adventure this lecture is! The best lesson yet, IMHO. Jeremy leads a deep dive into state-of-the-art classical machine learning and deep learning techniques for collaborative filtering and learning from structured time series data sets.
Along the way, Master Chef Jeremy (and his talented fastai
sous-chefs) serve up a delightful smorgasbord of techniques, tricks and insights, all the while showing us how to do things the fastai
way – that is, with beautiful, crisp, clean software engineering.
Incredibly, Jeremy covers all of this material at a relaxed and deliberate pace in two hours, without making us feel that he is rushing.
If you want to get the most out of this lecture:
- Listen to it a few times to make sure you don’t miss anything! Chew the food slowly.
- Run the two notebooks
08_collab.ipynb
and09_tabular.ipynb
in whatever environment you have set up - Spend enough time to study these notebooks closely and make it your business to understand them as well as you can.
- Ask questions on the Forum, if you need help.
- Challenge yourself with the Questionnaire, and
- Try some of the Further Research at the end
- Finally, don’t feel that you have to leave this lesson behind and move on to the next thing. Keep coming back until you’ve gotten the marrow of it. This might take several weeks, but it will be worth it.
I’m getting this kaggle related error at the outset: “Missing username in configuration.” Does anybody know how to resolve?
Thank you!
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-1-84bf1621b01a> in <module>
1 #hide
2 from utils import *
----> 3 from kaggle import api
4 from pandas.api.types import is_string_dtype, is_numeric_dtype, is_categorical_dtype
5 from fastai2.tabular.all import *
/opt/conda/envs/fastai/lib/python3.7/site-packages/kaggle/__init__.py in <module>
21
22 api = KaggleApi(ApiClient())
---> 23 api.authenticate()
/opt/conda/envs/fastai/lib/python3.7/site-packages/kaggle/api/kaggle_api_extended.py in authenticate(self)
150
151 # Step 3: load into configuration!
--> 152 self._load_config(config_data)
153
154 def read_config_environment(self, config_data=None, quiet=False):
/opt/conda/envs/fastai/lib/python3.7/site-packages/kaggle/api/kaggle_api_extended.py in _load_config(self, config_data)
191 for item in [self.CONFIG_NAME_USER, self.CONFIG_NAME_KEY]:
192 if item not in config_data:
--> 193 raise ValueError('Error: Missing %s in configuration.' % item)
194
195 configuration = Configuration()
ValueError: Error: Missing username in configuration.
Did you do !pip install kaggle?
I am also having other issues https://forums.fast.ai/t/kaggle-json/70088
In the tabular chapter the cont_cat_split
method is called with different max_cardinality
parameters for the DT/RF model and the NN model as follows:
#DT/RF
cont,cat = cont_cat_split(df, 1, dep_var=dep_var)
#NN
cont_nn,cat_nn = cont_cat_split(df_nn_final, max_card=9000, dep_var=dep_var)
The chapter does say that categorical columns are treated differently as it needs to create embeddings and indicates that embedding of size greater than 10k should not used and hence the 9k is used as max cardinality.
So I am having trouble understanding how a feature/column is decided to be continuous or categorical by using the limit an embedding size is supposed to be?
Also a max_card
of 1 for the random forest seems to be too low in my opinion? Wouldn’t the cardinality of any categorical column have unique values greater than 1?
Check to make sure ~/.kaggle/kaggle.json has the correct settings.
@jcatanza thanks for your reply.
If I understood well we have three kinds of ensembles:
- Bagging: paralely train weak learners on subsamples of data
- Boosting: sequentially training week learners using the result of the previous learner
- Stacking: train some weak learners and aggregate it with a metalerner.
See this great article for reference:https://towardsdatascience.com/ensemble-methods-bagging-boosting-and-stacking-c9214a10a205
I think that the methods that you are proposing are more likely to be classified as stacking(indeed the second one, but not the first) than boosting. What do you think about that?
In notebook 09_tabular.ipynb
, the command
draw_tree(m, xs, size=7, leaves_parallel=True, precision=2)
throws NameError: name 'draw_tree' is not defined
Also the command
cluster_columns(xs_imp)
throws NameError: name 'cluster_columns' is not defined
Has anyone encountered these issues, or can anyone suggest a workaround? Thanks!
I was able to get notebook 09_tabular.ipynb
to run in Google Colab. Here is the shareable link to the revised notebook.
That said, the commands draw_tree
(tree vizualization) and cluster_columns
(hierarchical cluster plot) both fail with NameError
. So the notebook runs, minus those two plots.
Update – Thanks to @muellerzr Zachary for gently but insistently pointing out that I needed to properly install utils.py
from fastbook
. Which made both draw_tree
and cluster_columns
work properly.
The notebook now executes without error.
If you haven’t already, make sure the following packages are installed for this notebook:
treeinterpreter
waterfallcharts
kaggle
dtreeviz
these can all be installed with:
pip install xxxx
I figured this out by getting a new key, saving it in my storage folder, then using the terminal to move to .kaggle, and then ensure proper permissions with chmod 600 ~/.kaggle/kaggle.json
Yup. All installed.
Yes, with
from fastcore.utils import *
It’s in the utils from fastbook.utils
, not fastcore
How to save a model for further training later on?
I am halfway with Lesson 7 but I could not found yet an example of how to save a model that I am halfway training. I would like to be able to then load it to continue the process.
This is what I tried:
I am however not sure of how the filename should look like when saving or what parameters is load_model expecting (i.e. if in a new session I am loading the model I do no longer have the learner or the optimizer… )
Could somebody help me out with an example? Thanks a lot
Thanks, Zachary. So I did this:
# install the utils.py from fastbook
%cd '/content/drive/My Drive/fastbook/'
pip install utils
%cd ..
But I still get the NameError
s for those two lines
You don’t install, simply import (as it’s just a .py file you already have in the system!)