Lesson 7 - Official topic

That is a good question, @tonibagur.
I think that the functionality of boosting can be built into neural nets.

As an example, for CNNs, you can

  • Form residuals using skip connections
  • Average an ensemble of weak learners by increasing the number of filter channels.

It doesn’t, both are treated the same way.

Good question @marii. The scaling method you propose would be problematic because it gives undue weight to outliers.

For example suppose we have a database of statistics about men, where one of the features is weight. Most men weigh between 120 and 200 pounds, but some [weigh much more] (https://en.wikipedia.org/wiki/List_of_heaviest_people].

What happens if you apply this method to standardize the weights, by dividing each by the weight of heaviest man (1400 pounds)? The relatively small number of very heavy men would have standardized weights that are near 1.0, while most men’s standardized weights would be between 100/1400 and 200/1400, or on the interval [1/14, 2/14]. So the high end of the scale, though sparsely populated, would be too heavily weighted, compared to the range which contains most of the population. Pardon the pun :rofl:

1 Like

What a fantastic, well-organized, action-packed adventure this lecture is! The best lesson yet, IMHO. Jeremy leads a deep dive into state-of-the-art classical machine learning and deep learning techniques for collaborative filtering and learning from structured time series data sets.

Along the way, Master Chef Jeremy (and his talented fastai sous-chefs) serve up a delightful smorgasbord of techniques, tricks and insights, all the while showing us how to do things the fastai way – that is, with beautiful, crisp, clean software engineering.

Incredibly, Jeremy covers all of this material at a relaxed and deliberate pace in two hours, without making us feel that he is rushing.

If you want to get the most out of this lecture:

  • Listen to it a few times to make sure you don’t miss anything! Chew the food slowly.
  • Run the two notebooks 08_collab.ipynb and 09_tabular.ipynb in whatever environment you have set up
  • Spend enough time to study these notebooks closely and make it your business to understand them as well as you can.
  • Ask questions on the Forum, if you need help.
  • Challenge yourself with the Questionnaire, and
  • Try some of the Further Research at the end
  • Finally, don’t feel that you have to leave this lesson behind and move on to the next thing. Keep coming back until you’ve gotten the marrow of it. This might take several weeks, but it will be worth it.
6 Likes

I’m getting this kaggle related error at the outset: “Missing username in configuration.” Does anybody know how to resolve?
Thank you!


---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-1-84bf1621b01a> in <module>
      1 #hide
      2 from utils import *
----> 3 from kaggle import api
      4 from pandas.api.types import is_string_dtype, is_numeric_dtype, is_categorical_dtype
      5 from fastai2.tabular.all import *

/opt/conda/envs/fastai/lib/python3.7/site-packages/kaggle/__init__.py in <module>
     21 
     22 api = KaggleApi(ApiClient())
---> 23 api.authenticate()

/opt/conda/envs/fastai/lib/python3.7/site-packages/kaggle/api/kaggle_api_extended.py in authenticate(self)
    150 
    151         # Step 3: load into configuration!
--> 152         self._load_config(config_data)
    153 
    154     def read_config_environment(self, config_data=None, quiet=False):

/opt/conda/envs/fastai/lib/python3.7/site-packages/kaggle/api/kaggle_api_extended.py in _load_config(self, config_data)
    191         for item in [self.CONFIG_NAME_USER, self.CONFIG_NAME_KEY]:
    192             if item not in config_data:
--> 193                 raise ValueError('Error: Missing %s in configuration.' % item)
    194 
    195         configuration = Configuration()

ValueError: Error: Missing username in configuration.
1 Like

Did you do !pip install kaggle?
I am also having other issues https://forums.fast.ai/t/kaggle-json/70088

In the tabular chapter the cont_cat_split method is called with different max_cardinality parameters for the DT/RF model and the NN model as follows:

#DT/RF
cont,cat = cont_cat_split(df, 1, dep_var=dep_var)

#NN
cont_nn,cat_nn = cont_cat_split(df_nn_final, max_card=9000, dep_var=dep_var)

The chapter does say that categorical columns are treated differently as it needs to create embeddings and indicates that embedding of size greater than 10k should not used and hence the 9k is used as max cardinality.

So I am having trouble understanding how a feature/column is decided to be continuous or categorical by using the limit an embedding size is supposed to be?

Also a max_card of 1 for the random forest seems to be too low in my opinion? Wouldn’t the cardinality of any categorical column have unique values greater than 1?

1 Like

Check to make sure ~/.kaggle/kaggle.json has the correct settings.

@jcatanza thanks for your reply.

If I understood well we have three kinds of ensembles:

  • Bagging: paralely train weak learners on subsamples of data
  • Boosting: sequentially training week learners using the result of the previous learner
  • Stacking: train some weak learners and aggregate it with a metalerner.
    See this great article for reference:https://towardsdatascience.com/ensemble-methods-bagging-boosting-and-stacking-c9214a10a205
    I think that the methods that you are proposing are more likely to be classified as stacking(indeed the second one, but not the first) than boosting. What do you think about that?
1 Like

In notebook 09_tabular.ipynb, the command
draw_tree(m, xs, size=7, leaves_parallel=True, precision=2)

throws NameError: name 'draw_tree' is not defined

Also the command
cluster_columns(xs_imp)
throws NameError: name 'cluster_columns' is not defined

Has anyone encountered these issues, or can anyone suggest a workaround? Thanks!

I was able to get notebook 09_tabular.ipynb to run in Google Colab. Here is the shareable link to the revised notebook.

That said, the commands draw_tree (tree vizualization) and cluster_columns (hierarchical cluster plot) both fail with NameError. So the notebook runs, minus those two plots.

Update – Thanks to @muellerzr Zachary for gently but insistently pointing out that I needed to properly install utils.py from fastbook. Which made both draw_tree and cluster_columns work properly.

The notebook now executes without error.

3 Likes

@jcatanza are you importing utils?

1 Like

If you haven’t already, make sure the following packages are installed for this notebook:

treeinterpreter
waterfallcharts
kaggle
dtreeviz

these can all be installed with:
pip install xxxx

1 Like

I figured this out by getting a new key, saving it in my storage folder, then using the terminal to move to .kaggle, and then ensure proper permissions with chmod 600 ~/.kaggle/kaggle.json

Yup. All installed.

Yes, with
from fastcore.utils import *

It’s in the utils from fastbook.utils, not fastcore :slight_smile:

How to save a model for further training later on?

I am halfway with Lesson 7 but I could not found yet an example of how to save a model that I am halfway training. I would like to be able to then load it to continue the process.

This is what I tried:

I am however not sure of how the filename should look like when saving or what parameters is load_model expecting (i.e. if in a new session I am loading the model I do no longer have the learner or the optimizer… )

Could somebody help me out with an example? Thanks a lot :hugs:

Thanks, Zachary. So I did this:

# install the utils.py from fastbook
%cd '/content/drive/My Drive/fastbook/'
pip install utils
%cd ..

But I still get the NameErrors for those two lines :thinking:

You don’t install, simply import :slight_smile: (as it’s just a .py file you already have in the system!)