Lesson 7 - Official topic

johnnyv · April 29, 2020, 2:09pm

Is there a way to get the tabular learner to take a sliding window over rows of the dataset?
You could use that for predicting a time series like machine failure? https://www.kaggle.com/c/machine-failure-prediction/data

muellerzr · April 29, 2020, 2:11pm

You can simply use what I call a “Time Step” and pass in those previous rows as an input. I did some work with this on movement identification with very good results. IE if we have a window of 3 and 8 variables one row is 24 variables. You’d need to rearrange the table probably but it does work

jwuphysics · April 29, 2020, 7:59pm

Another option may be to use a 1d convolutional neural network to learn the most relevant filters (i.e., sliding windows). Since CNNs often reduce to dense layers at the end, you could even concatenate activations from the time-series CNN model with activations from the tabular model of the other metadata, or fashion it as a Siamese network.

Interogativ · April 30, 2020, 2:16pm

I have to say that having seen all 3 previous iterations of this class, the addition of decision trees and random forests is an awesome development.

Interogativ · April 30, 2020, 2:58pm

Possible Typo in 09_tabular notebook:

I believe the cell near the end that contains:

   xs_filt2 = xs_filt.drop('fiModelDescriptor', axis=1)
   valid_xs_time2 = valid_xs_time.drop('fiModelDescriptor', axis=1)
   m2 = rf(xs_filt2, y_filt)
   m_rmse(m, xs_filt2, y_filt), m_rmse(m2, valid_xs_time2, valid_y)

contains a typo in the last line, because the xs_filt2 is not the same set of x(s) that was used to create m. I believe that line should be:

m_rmse(m, xs_filt, y_filt), m_rmse(m2, valid_xs_time2, valid_y)

jcatanza · April 30, 2020, 9:07pm

There are two types of categorical variables

Ordered where the categories are implictly numerically ordered
example: Jack, Queen, King, Ace
Unordered where numerical order is immaterial
example: Spades, Clubs, Hearts, Diamonds

How does fastai distinguish between these types?

I find a partial answer to my question in the 09_tabular.ipynb notebook,
where Jeremy shows that fastai does handle ordered categoricals differently than unordered ones, by means of th eordered=True input to the .cat.set_categories() method

df['ProductSize'] = df['ProductSize'].astype('category')

df['ProductSize'].cat.set_categories(sizes, ordered=True, inplace=True)

I haven’t yet looked into the fastai2 library to understand the details of how the two types of categoricals are treated.

jcatanza · April 30, 2020, 9:16pm

That is a good question, @tonibagur.
I think that the functionality of boosting can be built into neural nets.

As an example, for CNNs, you can

Form residuals using skip connections
Average an ensemble of weak learners by increasing the number of filter channels.

muellerzr · April 30, 2020, 9:40pm

It doesn’t, both are treated the same way.

jcatanza · April 30, 2020, 11:22pm

Good question @marii. The scaling method you propose would be problematic because it gives undue weight to outliers.

For example suppose we have a database of statistics about men, where one of the features is weight. Most men weigh between 120 and 200 pounds, but some [weigh much more] (https://en.wikipedia.org/wiki/List_of_heaviest_people].

What happens if you apply this method to standardize the weights, by dividing each by the weight of heaviest man (1400 pounds)? The relatively small number of very heavy men would have standardized weights that are near 1.0, while most men’s standardized weights would be between 100/1400 and 200/1400, or on the interval [1/14, 2/14]. So the high end of the scale, though sparsely populated, would be too heavily weighted, compared to the range which contains most of the population. Pardon the pun

jcatanza · May 1, 2020, 3:36am

What a fantastic, well-organized, action-packed adventure this lecture is! The best lesson yet, IMHO. Jeremy leads a deep dive into state-of-the-art classical machine learning and deep learning techniques for collaborative filtering and learning from structured time series data sets.

Along the way, Master Chef Jeremy (and his talented fastai sous-chefs) serve up a delightful smorgasbord of techniques, tricks and insights, all the while showing us how to do things the fastai way – that is, with beautiful, crisp, clean software engineering.

Incredibly, Jeremy covers all of this material at a relaxed and deliberate pace in two hours, without making us feel that he is rushing.

If you want to get the most out of this lecture:

Listen to it a few times to make sure you don’t miss anything! Chew the food slowly.
Run the two notebooks 08_collab.ipynb and 09_tabular.ipynb in whatever environment you have set up
Spend enough time to study these notebooks closely and make it your business to understand them as well as you can.
Ask questions on the Forum, if you need help.
Challenge yourself with the Questionnaire, and
Try some of the Further Research at the end
Finally, don’t feel that you have to leave this lesson behind and move on to the next thing. Keep coming back until you’ve gotten the marrow of it. This might take several weeks, but it will be worth it.

golz · May 1, 2020, 3:48am

I’m getting this kaggle related error at the outset: “Missing username in configuration.” Does anybody know how to resolve?
Thank you!


---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-1-84bf1621b01a> in <module>
      1 #hide
      2 from utils import *
----> 3 from kaggle import api
      4 from pandas.api.types import is_string_dtype, is_numeric_dtype, is_categorical_dtype
      5 from fastai2.tabular.all import *

/opt/conda/envs/fastai/lib/python3.7/site-packages/kaggle/__init__.py in <module>
     21 
     22 api = KaggleApi(ApiClient())
---> 23 api.authenticate()

/opt/conda/envs/fastai/lib/python3.7/site-packages/kaggle/api/kaggle_api_extended.py in authenticate(self)
    150 
    151         # Step 3: load into configuration!
--> 152         self._load_config(config_data)
    153 
    154     def read_config_environment(self, config_data=None, quiet=False):

/opt/conda/envs/fastai/lib/python3.7/site-packages/kaggle/api/kaggle_api_extended.py in _load_config(self, config_data)
    191         for item in [self.CONFIG_NAME_USER, self.CONFIG_NAME_KEY]:
    192             if item not in config_data:
--> 193                 raise ValueError('Error: Missing %s in configuration.' % item)
    194 
    195         configuration = Configuration()

ValueError: Error: Missing username in configuration.

Albertotono · May 1, 2020, 4:05am

Did you do !pip install kaggle?
I am also having other issues https://forums.fast.ai/t/kaggle-json/70088

harish3110 · May 1, 2020, 5:15am

In the tabular chapter the cont_cat_split method is called with different max_cardinality parameters for the DT/RF model and the NN model as follows:

#DT/RF
cont,cat = cont_cat_split(df, 1, dep_var=dep_var)

#NN
cont_nn,cat_nn = cont_cat_split(df_nn_final, max_card=9000, dep_var=dep_var)

The chapter does say that categorical columns are treated differently as it needs to create embeddings and indicates that embedding of size greater than 10k should not used and hence the 9k is used as max cardinality.

So I am having trouble understanding how a feature/column is decided to be continuous or categorical by using the limit an embedding size is supposed to be?

Also a max_card of 1 for the random forest seems to be too low in my opinion? Wouldn’t the cardinality of any categorical column have unique values greater than 1?

marii · May 1, 2020, 5:16am

Check to make sure ~/.kaggle/kaggle.json has the correct settings.

tonibagur · May 1, 2020, 8:14am

@jcatanza thanks for your reply.

If I understood well we have three kinds of ensembles:

Bagging: paralely train weak learners on subsamples of data
Boosting: sequentially training week learners using the result of the previous learner
Stacking: train some weak learners and aggregate it with a metalerner.
See this great article for reference:https://towardsdatascience.com/ensemble-methods-bagging-boosting-and-stacking-c9214a10a205
I think that the methods that you are proposing are more likely to be classified as stacking(indeed the second one, but not the first) than boosting. What do you think about that?

jcatanza · May 1, 2020, 10:16am

In notebook 09_tabular.ipynb, the command
draw_tree(m, xs, size=7, leaves_parallel=True, precision=2)

throws NameError: name 'draw_tree' is not defined

Also the command
cluster_columns(xs_imp)
throws NameError: name 'cluster_columns' is not defined

Has anyone encountered these issues, or can anyone suggest a workaround? Thanks!

jcatanza · May 1, 2020, 10:44am

I was able to get notebook 09_tabular.ipynb to run in Google Colab. Here is the shareable link to the revised notebook.

That said, the commands draw_tree (tree vizualization) and cluster_columns (hierarchical cluster plot) both fail with NameError. So the notebook runs, minus those two plots.

Update – Thanks to @muellerzr Zachary for gently but insistently pointing out that I needed to properly install utils.py from fastbook. Which made both draw_tree and cluster_columns work properly.

The notebook now executes without error.

muellerzr · May 1, 2020, 1:00pm

@jcatanza are you importing utils?

Interogativ · May 1, 2020, 6:42pm

If you haven’t already, make sure the following packages are installed for this notebook:

treeinterpreter
waterfallcharts
kaggle
dtreeviz

these can all be installed with:
pip install xxxx

golz · May 1, 2020, 9:19pm

I figured this out by getting a new key, saving it in my storage folder, then using the terminal to move to .kaggle, and then ensure proper permissions with chmod 600 ~/.kaggle/kaggle.json