Good question @marii. The scaling method you propose would be problematic because it gives undue weight to outliers.
For example suppose we have a database of statistics about men, where one of the features is weight. Most men weigh between 120 and 200 pounds, but some [weigh much more] (https://en.wikipedia.org/wiki/List_of_heaviest_people].
What happens if you apply this method to standardize the weights, by dividing each by the weight of heaviest man (1400 pounds)? The relatively small number of very heavy men would have standardized weights that are near 1.0, while most men’s standardized weights would be between 100/1400 and 200/1400, or on the interval [1/14, 2/14]. So the high end of the scale, though sparsely populated, would be too heavily weighted, compared to the range which contains most of the population. Pardon the pun
What a fantastic, well-organized, action-packed adventure this lecture is! The best lesson yet, IMHO. Jeremy leads a deep dive into state-of-the-art classical machine learning and deep learning techniques for collaborative filtering and learning from structured time series data sets.
Along the way, Master Chef Jeremy (and his talented fastai sous-chefs) serve up a delightful smorgasbord of techniques, tricks and insights, all the while showing us how to do things the fastai way – that is, with beautiful, crisp, clean software engineering.
Incredibly, Jeremy covers all of this material at a relaxed and deliberate pace in two hours, without making us feel that he is rushing.
If you want to get the most out of this lecture:
Listen to it a few times to make sure you don’t miss anything! Chew the food slowly.
Run the two notebooks 08_collab.ipynb and 09_tabular.ipynb in whatever environment you have set up
Spend enough time to study these notebooks closely and make it your business to understand them as well as you can.
Ask questions on the Forum, if you need help.
Challenge yourself with the Questionnaire, and
Try some of the Further Research at the end
Finally, don’t feel that you have to leave this lesson behind and move on to the next thing. Keep coming back until you’ve gotten the marrow of it. This might take several weeks, but it will be worth it.
The chapter does say that categorical columns are treated differently as it needs to create embeddings and indicates that embedding of size greater than 10k should not used and hence the 9k is used as max cardinality.
So I am having trouble understanding how a feature/column is decided to be continuous or categorical by using the limit an embedding size is supposed to be?
Also a max_card of 1 for the random forest seems to be too low in my opinion? Wouldn’t the cardinality of any categorical column have unique values greater than 1?
If I understood well we have three kinds of ensembles:
Bagging: paralely train weak learners on subsamples of data
Boosting: sequentially training week learners using the result of the previous learner
Stacking: train some weak learners and aggregate it with a metalerner.
See this great article for reference:https://towardsdatascience.com/ensemble-methods-bagging-boosting-and-stacking-c9214a10a205
I think that the methods that you are proposing are more likely to be classified as stacking(indeed the second one, but not the first) than boosting. What do you think about that?
I was able to get notebook 09_tabular.ipynb to run in Google Colab. Here is the shareable link to the revised notebook.
That said, the commands draw_tree (tree vizualization) and cluster_columns (hierarchical cluster plot) both fail with NameError. So the notebook runs, minus those two plots.
Update – Thanks to @muellerzr Zachary for gently but insistently pointing out that I needed to properly installutils.pyfromfastbook. Which made bothdraw_treeandcluster_columnswork properly.
I figured this out by getting a new key, saving it in my storage folder, then using the terminal to move to .kaggle, and then ensure proper permissions with chmod 600 ~/.kaggle/kaggle.json
How to save a model for further training later on?
I am halfway with Lesson 7 but I could not found yet an example of how to save a model that I am halfway training. I would like to be able to then load it to continue the process.
I am however not sure of how the filename should look like when saving or what parameters is load_model expecting (i.e. if in a new session I am loading the model I do no longer have the learner or the optimizer… )
Could somebody help me out with an example? Thanks a lot