Is there a way to get the tabular learner to take a sliding window over rows of the dataset?
You could use that for predicting a time series like machine failure? https://www.kaggle.com/c/machine-failure-prediction/data
You can simply use what I call a âTime Stepâ and pass in those previous rows as an input. I did some work with this on movement identification with very good results. IE if we have a window of 3 and 8 variables one row is 24 variables. Youâd need to rearrange the table probably but it does work
Another option may be to use a 1d convolutional neural network to learn the most relevant filters (i.e., sliding windows). Since CNNs often reduce to dense layers at the end, you could even concatenate activations from the time-series CNN model with activations from the tabular model of the other metadata, or fashion it as a Siamese network.
I have to say that having seen all 3 previous iterations of this class, the addition of decision trees and random forests is an awesome development.
Possible Typo in 09_tabular notebook:
I believe the cell near the end that contains:
xs_filt2 = xs_filt.drop('fiModelDescriptor', axis=1)
valid_xs_time2 = valid_xs_time.drop('fiModelDescriptor', axis=1)
m2 = rf(xs_filt2, y_filt)
m_rmse(m, xs_filt2, y_filt), m_rmse(m2, valid_xs_time2, valid_y)
contains a typo in the last line, because the xs_filt2 is not the same set of x(s) that was used to create m. I believe that line should be:
m_rmse(m, xs_filt, y_filt), m_rmse(m2, valid_xs_time2, valid_y)
There are two types of categorical variables
-
Ordered where the categories are implictly numerically ordered
example:Jack, Queen, King, Ace
-
Unordered where numerical order is immaterial
example:Spades, Clubs, Hearts, Diamonds
How does fastai distinguish between these types?
I find a partial answer to my question in the 09_tabular.ipynb
notebook,
where Jeremy shows that fastai
does handle ordered categoricals differently than unordered ones, by means of th eordered=True
input to the .cat.set_categories()
method
df['ProductSize'] = df['ProductSize'].astype('category')
df['ProductSize'].cat.set_categories(sizes, ordered=True, inplace=True)
I havenât yet looked into the fastai2
library to understand the details of how the two types of categoricals are treated.
That is a good question, @tonibagur.
I think that the functionality of boosting can be built into neural nets.
As an example, for CNNs, you can
- Form residuals using skip connections
- Average an ensemble of weak learners by increasing the number of filter channels.
It doesnât, both are treated the same way.
Good question @marii. The scaling method you propose would be problematic because it gives undue weight to outliers.
For example suppose we have a database of statistics about men, where one of the features is weight. Most men weigh between 120 and 200 pounds, but some [weigh much more] (https://en.wikipedia.org/wiki/List_of_heaviest_people].
What happens if you apply this method to standardize the weights, by dividing each by the weight of heaviest man (1400 pounds)? The relatively small number of very heavy men would have standardized weights that are near 1.0, while most menâs standardized weights would be between 100/1400 and 200/1400, or on the interval [1/14, 2/14]. So the high end of the scale, though sparsely populated, would be too heavily weighted, compared to the range which contains most of the population. Pardon the pun
What a fantastic, well-organized, action-packed adventure this lecture is! The best lesson yet, IMHO. Jeremy leads a deep dive into state-of-the-art classical machine learning and deep learning techniques for collaborative filtering and learning from structured time series data sets.
Along the way, Master Chef Jeremy (and his talented fastai
sous-chefs) serve up a delightful smorgasbord of techniques, tricks and insights, all the while showing us how to do things the fastai
way â that is, with beautiful, crisp, clean software engineering.
Incredibly, Jeremy covers all of this material at a relaxed and deliberate pace in two hours, without making us feel that he is rushing.
If you want to get the most out of this lecture:
- Listen to it a few times to make sure you donât miss anything! Chew the food slowly.
- Run the two notebooks
08_collab.ipynb
and09_tabular.ipynb
in whatever environment you have set up - Spend enough time to study these notebooks closely and make it your business to understand them as well as you can.
- Ask questions on the Forum, if you need help.
- Challenge yourself with the Questionnaire, and
- Try some of the Further Research at the end
- Finally, donât feel that you have to leave this lesson behind and move on to the next thing. Keep coming back until youâve gotten the marrow of it. This might take several weeks, but it will be worth it.
Iâm getting this kaggle related error at the outset: âMissing username in configuration.â Does anybody know how to resolve?
Thank you!
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-1-84bf1621b01a> in <module>
1 #hide
2 from utils import *
----> 3 from kaggle import api
4 from pandas.api.types import is_string_dtype, is_numeric_dtype, is_categorical_dtype
5 from fastai2.tabular.all import *
/opt/conda/envs/fastai/lib/python3.7/site-packages/kaggle/__init__.py in <module>
21
22 api = KaggleApi(ApiClient())
---> 23 api.authenticate()
/opt/conda/envs/fastai/lib/python3.7/site-packages/kaggle/api/kaggle_api_extended.py in authenticate(self)
150
151 # Step 3: load into configuration!
--> 152 self._load_config(config_data)
153
154 def read_config_environment(self, config_data=None, quiet=False):
/opt/conda/envs/fastai/lib/python3.7/site-packages/kaggle/api/kaggle_api_extended.py in _load_config(self, config_data)
191 for item in [self.CONFIG_NAME_USER, self.CONFIG_NAME_KEY]:
192 if item not in config_data:
--> 193 raise ValueError('Error: Missing %s in configuration.' % item)
194
195 configuration = Configuration()
ValueError: Error: Missing username in configuration.
Did you do !pip install kaggle?
I am also having other issues https://forums.fast.ai/t/kaggle-json/70088
In the tabular chapter the cont_cat_split
method is called with different max_cardinality
parameters for the DT/RF model and the NN model as follows:
#DT/RF
cont,cat = cont_cat_split(df, 1, dep_var=dep_var)
#NN
cont_nn,cat_nn = cont_cat_split(df_nn_final, max_card=9000, dep_var=dep_var)
The chapter does say that categorical columns are treated differently as it needs to create embeddings and indicates that embedding of size greater than 10k should not used and hence the 9k is used as max cardinality.
So I am having trouble understanding how a feature/column is decided to be continuous or categorical by using the limit an embedding size is supposed to be?
Also a max_card
of 1 for the random forest seems to be too low in my opinion? Wouldnât the cardinality of any categorical column have unique values greater than 1?
Check to make sure ~/.kaggle/kaggle.json has the correct settings.
@jcatanza thanks for your reply.
If I understood well we have three kinds of ensembles:
- Bagging: paralely train weak learners on subsamples of data
- Boosting: sequentially training week learners using the result of the previous learner
- Stacking: train some weak learners and aggregate it with a metalerner.
See this great article for reference:https://towardsdatascience.com/ensemble-methods-bagging-boosting-and-stacking-c9214a10a205
I think that the methods that you are proposing are more likely to be classified as stacking(indeed the second one, but not the first) than boosting. What do you think about that?
In notebook 09_tabular.ipynb
, the command
draw_tree(m, xs, size=7, leaves_parallel=True, precision=2)
throws NameError: name 'draw_tree' is not defined
Also the command
cluster_columns(xs_imp)
throws NameError: name 'cluster_columns' is not defined
Has anyone encountered these issues, or can anyone suggest a workaround? Thanks!
I was able to get notebook 09_tabular.ipynb
to run in Google Colab. Here is the shareable link to the revised notebook.
That said, the commands draw_tree
(tree vizualization) and cluster_columns
(hierarchical cluster plot) both fail with NameError
. So the notebook runs, minus those two plots.
Update â Thanks to @muellerzr Zachary for gently but insistently pointing out that I needed to properly install utils.py
from fastbook
. Which made both draw_tree
and cluster_columns
work properly.
The notebook now executes without error.
@jcatanza are you importing utils?
If you havenât already, make sure the following packages are installed for this notebook:
treeinterpreter
waterfallcharts
kaggle
dtreeviz
these can all be installed with:
pip install xxxx
I figured this out by getting a new key, saving it in my storage folder, then using the terminal to move to .kaggle, and then ensure proper permissions with chmod 600 ~/.kaggle/kaggle.json