Lesson 6 official topic

Here are the tiers of the API as I see them:

10 Likes

No. This violates the validation principle. Your trees are trained on a subset of data which come from a distribution. For that particular subset a tree may be worse but it may still be a good tree for the test data sourced from the same distribution.
If we selectively delete worse trees that even leads to overfitting.

1 Like

Interesting question. Would like to know the answer

A rough detailed note in terms of questions

00:00 Review: lecture 5 and OneR from scratch

02:09 How to build a TwoR model manually (towards a decision tree)?

04:43 How to create a decision tree with 4 leaves and draw the graph? How to interpret the graph? Why decision tree is loved when doing explorative analysis?

What Jeremy think of sklearn?

07:02 What is gini in terms of how good a split is? What is its source code? How to think of gini in terms of the probability of grab the same item in multiple times?

08:27 Why the mean_absolute_error of the decision tree is actually worse than the OneR version?

09:31 How to build a decision tree with 50 leaves? How well does it work?

10:54 How to make prediction and prepare a csv to upload to the kaggle leaderboard? We should start with a base model like this and improve the model everyday

12:19 How decision tree frees you from dummy variables, taking log of the fare, thinking of the outliers and long tail of distributions? Jeremy always use decision trees to create a base line model, which is hard to mess up. How decision trees turn categories like embarked into strings and then numbers to sort? How many levels does Jeremy usually use decision tree with?

15:52 How to make the model more accurate? What’s the problem if we want to grow the tree further? Leo Breiman’s bagging come as a rescue. How does Jeremy explain the bagging approach? Why this approach is so mind-blowing? 18:47

19:06 How to build many unbiased and uncorrelated models (or decision trees) for bagging? This approach to build all these models is called random forest

20:09 How to create a random forest with 100 decision trees? How to make them random trees? How to average the predictions of all the trees and submit to kaggle?

22:38 What does feature-important (a favorite of Jeremy) do? Does it care about distribution or numeric vs categorical? How does Jeremy use it? and there is an amazing story of Jeremy and feature-important for credit default problem

26:37 Does increasing num of trees always increase accuracy? Yes with tiny bumps. Does the increase of return decrease? The more trees you have, the longer the inference time, but you can speed it up by good code. Does Jeremy often use more than 100 trees? chp 8 of fastbook

29:32 What is OOB (out of bag error)? How does bagging get away from not having a validation set given each tree only use 75%of dataset for training? Does sklearn make using OOB easy?

30:37 What is bagging compared with random forest? Is random forest just bagging (meta model) applied to tabular data with lots of decision trees? Can we bag not just a lot of decision trees (into random forest) but also bag a lot of neuralnets? Will we (fastai people) do it, given most people don’t do it?

32:08 What are the insights or model interpretations can be given to us by Random Forest?

34:21 How does the bagging help us find out how confident we are about the prediction of a row of tabular data?

35:13 After you find out those important features, what to do about those less important columns or features of the tabular dataset?

35:47 Check out the book section on removing the redundant features

35:59 What does Partial dependence do? How each column/feature is related to dependent variable? Is this one particular to Random Forest? Why calculating partial dependence is not as easy as it sounds? How does partial dependence work behind the scene? Can we do more than one feature partial dependence at one time?

39:22 Can you explain why a particular prediction is made? Can tree interpreter give us feature importance (or the path from root to leaf) of the model prediction of a row of data?

41:56 Would you delete a tree which does not perform well? No, doing so would bias your bagging.

42:48 Will bagging of bags do better than one bag? No, average of averages is still average

43:48 What does Jeremy think of Random forest’s feature importance vs other model explanability techniques? When to use feature importance and other explanability techniques?

46:07 tabular section is on chp 9. Can you overfit a random forest? No, more trees make it more accurate, but not enough trees with deep levels can make your random forest overfit.

47:06 Can you confuse random forest by adding lots of noise columns/features?

48:26 What you don’t need to worry about with random forest? interaction in logistic regression, normalization

49:03 What is gradient boosting? How does boosting work? Are bagging and boosting both meta models which can be applied to decision trees? Random forest vs Gradient boosting trees. Can Gradient boosting overfit given it is more accurate? What’s Jeremy’s take on random forest vs gradient boost?

51:56 Introducing walkthrus on paddy competition and what is so thrilling about it

53:54 What is the basic process extracted from the walkthrus?

54:28 What does fastkaggle do for us? How to install and update it? Can it download kaggle data for us regardless whether you are on kaggle or not?

56:02 There are so much benefits we can get by keep working on kaggle competitions, such as forcing you to face the truth and stopping lying to yourself about how good your model is, etc.

58:39 What are the two things we should focus on? a good validation set and how to iterate in a minute. Why it is so important to iterate (telling it with a story)?

1:00:54 When does Jeremy use seed=42 and when not?

1:01:45 Do recognise how pytorch and PILImage describe the shape of a tensor/image? Pytorch (640 rows x 480 columns ) vs PILImage( 480 columns x 640 rows)

1:02:52 Does it take a lot of compute to figure out the shape or size of an image? How does Jeremy’s fastcore.parallel help to figure out the sizes of all images much faster?

1:04:12 What’s the easist thing to do with the images? What does item_tfms=Resize(480, method='squish' do? What does batch_tfms=aug_transforms(size=128, scale=0.75 do? can we use dls.show_batch(max_n=6) for any kind of data?

1:05:48 Why Jeremy usually build a model very early on and choose the ones which can iterate fast?

1:06:20 What is the project Jeremy and Thomas created to find out the best models for fine-tuning? How many different architectures have been examined? How different the two datasets they used?

1:07:22 What are the criteria for evaluating the models? How they are compared? Which model architecture did Jeremy choose for his first model and why? What does Jeremy think of studying the structure of model architectures like resnet26?

1:08:41 How did Jeremy create his first model? How does Jeremy use lr_find to pick a more appropriate learning rate? How fast is Jeremy’s first model? Why Jeremy wants it this way?

1:10:22 Should we submit as soon as we can? How do we check out the submit format first? How should we build a dataloader for test set? How do we predict all the test set and return a list of indices pointing to the most probable disease type? How do we create a dictionary with dls.vocab and use pandas map function to map indices to disease type strings? How to put our final processed result into a dataframe and save them into a csv file and check the results from the csv file in terminal?

1:14:02 How do we even make submitting to kaggle a fast automation?

1:14:40 a base line model which iterates fast and trained within a minute gets us to top 80% or bottom 20% is not bad, and is a good starting point.

1:15:15 How to even automate the process of sharing kaggle notebooks? and Why would you publish your notebooks on kaggle (or why this is very beneficial)?

1:17:06 How does Jeremy iterate models with notebooks (bon local and kaggle) in a real simple but practically effective style?

1:20:17 What does Jeremy think of AutoML? How does Jeremy approach hyperparameter optimization? How Jeremy found out squish is always better than cropping in most cases without grid search? How Jeremy find a good learning rate fast without grid search?

1:22:48 What is Jeremy’s rule of thumb? computer vision problem uses deep learning models, random forest (not bothered by GBM) for tabular dataset

1:24:16 Why the first model run so slow on Kaggle GPUs? How to make our model/notebook run faster on Kaggle gpus and cpus? How to first resize (down-size) all training data and put them into a different folder? How much faster did Jeremy get after this?

1:26:21 How badly the first model utilizes Kaggle GPU? Did Kaggle 2 CPUs get exhausted?

1:26:44 How did Jeremy pick the second model architecture for the second iteration?

1:27:53 How much better can a new novel architecture improve the accuracy versus the first model (resnet26)?

1:28:33 Why should we move from the era of resnet onto the new era of convnext? How to pick the appropriate models from the convnext family for our iterations?

1:30:01 How to iterate the model with different settings fast by putting everything into a single function train? How to quickly try resizing with random cropping without squish? What did Jeremy find out about this model iteration?

1:31:10 How to iterate the model with padding? What’s special about padding versus cropping versus squish? What’s its downsides? and how well did this iteration do?

1:32:01 What does our data augmentation do to images? How to understand Test-time augmentation (tta) in terms of mini-bagging? How easy fastai makes tta work? tta should work better but in this particular kaggle run, it didn’t. Jeremy said will come back to this next time.

1:34:12 How to iterate the model with larger images and longer epochs? How much better did this iteration get us? Up to this point, the mechanism behind all the iterations above is universal to all different problems

1:36:08 How does the pandas indexing make mapping from indices to vocab string super fast? and submit to kaggle the usual way?

1:38:16 Do we always do data augmentation for images? What data-augmentation does tta use?

1:39:29 Why does Jeremy use different aspect ratios during different iterations? What are the better things Jeremy has been experimenting?

1:41:07 Why Jeremy didn’t create images more image-like,but instead using simple padding (i.e., black stripes)?

4 Likes

not easy but thanks to your teachings I am able to get close to the top, on the Paddy Doctor, thanks for everything Jeremy.

4 Likes

Hi there all, I was wondering whether there is a fastai way to handle large tiff images, as the tiff images are all very large and cannot be fed directly into the neural network model for training, and or to get rid of the white spaces in the images.

Thanks

@jeremy: concerning Random Forests, Im thinking about a small correction: I would rather use the estimated probabilities to make a prediction

all_probs = [t.predict(val_xs) for t in trees]
avg_probs = np.stack(all_probs).mean(0)

mean_absolute_error(val_y, avg_probs)

Id like to change that to:

preds = .5 < avg_probs
mean_absolute_error(val_y, preds)

This gives you an even closer result to RandomForestClassifier

What do you think?

My environment is different from Jeremys it is a mamba install of fastbook with python 3.10 on ubuntu 20.04.4 LTS with sklearn 1.1.2 therefore this issue is because I use this version of sklearn, for people who just need to run the book as Jeremy describes read no further, but if you want to use your project with later versions of sklearn and analyse with treeinterpreter then the following work around may be useful. I did this and the notebook completed with out fault albeit with the pip install line at the top commented out.

I came upon this issue when running the clean version 09_tabular.ipynb.

The treeinterpreter module has not kept pace with the changes in sklearn.

sklearn.ensemble.forest was renamed to sklearn.ensemble._forest in 437ca05 on Oct 16, 2019. You need to install an older sklearn . Try version 0.21.3 released on Jul 30, 2019:

The from treeinterpreter import treeinterpreter line imports sklearn.ensemble.forest which long ago was renamed sklearn.ensemble._forest and fails to import. The options are to change to a pre 2019 version of sklearn or do the EDIT in the file tree interpreter.py as described below

  1. Out-comment the old import and replace with the new import.Comment out the import for the old versions of sklearn and add the new import statement
# from sklearn.ensemble.forest import ForestClassifier, ForestRegressor
from sklearn.ensemble._forest import ForestClassifier, ForestRegressor

NOTE :-
The advise to find the file in the link below by printing it’s location in the notebook failed for me I had to manually look for the location via a terminal.
I found it in
$home/mambaforge/envs/book/python3.10/site-packages/treeinterpreter/treeinterpreter.py

Working Work Around for treeinterpreter

2 Likes

In kaggle notebook gini section

If the group is all the same, the probability is 1.0 , and 0.0 if they’re all different:

Is it the opposite here? The higher gini, the more chaotic the group is.

1 Like

Hi!

I am trying to use the TabularPandas tool for predicting future sales in the playground Kaggle competition Tabular Playground Series - Sep 2022.

The results are currently poor, despite relatively thorough analysis and feature engineering, so any tips on how to improve the notebook would be really appreciated.

However, I was wondering about the need to use “procs” when creating the TabularPandas object, because I am experiencing improved results when not adding “procs” for RandomForest and XGBoost. Rather use sklearn’s LabelEncoder for categorizing relevant columns. But, I am unable to train a neural network using the FastAI framework, because after creating a learner object from tabular_learner with a dls without procs I get an error:

AttributeError Traceback (most recent call last)
Input In [174], in <cell line: 1>()
----> 1 learn = tabular_learner(dls, layers=[1000,500], config=config_tabular,
2 n_out=1,
3 #loss_func=F.mse_loss,
4 metrics=[exp_rmspe])

File ~/mambaforge/envs/fastai2/lib/python3.10/site-packages/fastai/tabular/learner.py:42, in tabular_learner(dls, layers, emb_szs, config, n_out, y_range, **kwargs)
40 if layers is None: layers = [200,100]
41 to = dls.train_ds
—> 42 emb_szs = get_emb_sz(dls.train_ds, {} if emb_szs is None else emb_szs)
43 if n_out is None: n_out = get_c(dls)
44 assert n_out, “n_out is not defined, and could not be inferred from data, set dls.c or pass n_out

File ~/mambaforge/envs/fastai2/lib/python3.10/site-packages/fastai/tabular/model.py:32, in get_emb_sz(to, sz_dict)
27 def get_emb_sz(
28 to:Tabular|TabularPandas,
29 sz_dict:dict=None # Dictionary of {‘class_name’ : size, …} to override default emb_sz_rule
30 ) → list: # List of embedding sizes for each category
31 “Get embedding size for each cat_name in Tabular or TabularPandas, or populate embedding size manually using sz_dict”
—> 32 return [_one_emb_sz(to.classes, n, sz_dict) for n in to.cat_names]

File ~/mambaforge/envs/fastai2/lib/python3.10/site-packages/fastai/tabular/model.py:32, in (.0)
27 def get_emb_sz(
28 to:Tabular|TabularPandas,
29 sz_dict:dict=None # Dictionary of {‘class_name’ : size, …} to override default emb_sz_rule
30 ) → list: # List of embedding sizes for each category
31 “Get embedding size for each cat_name in Tabular or TabularPandas, or populate embedding size manually using sz_dict”
—> 32 return [_one_emb_sz(to.classes, n, sz_dict) for n in to.cat_names]

File ~/mambaforge/envs/fastai2/lib/python3.10/site-packages/fastcore/basics.py:491, in GetAttr.getattr(self, k)
489 if self._component_attr_filter(k):
490 attr = getattr(self,self._default,None)
→ 491 if attr is not None: return getattr(attr,k)
492 raise AttributeError(k)

File ~/mambaforge/envs/fastai2/lib/python3.10/site-packages/fastcore/transform.py:212, in Pipeline.getattr(self, k)
→ 212 def getattr(self,k): return gather_attrs(self, k, ‘fs’)

File ~/mambaforge/envs/fastai2/lib/python3.10/site-packages/fastcore/transform.py:173, in gather_attrs(o, k, nm)
171 att = getattr(o,nm)
172 res = [t for t in att.attrgot(k) if t is not None]
→ 173 if not res: raise AttributeError(k)
174 return res[0] if len(res)==1 else L(res)

AttributeError: classes

Hey all,
I was trying to understand how a gradient boost works; wrote a post here which sums up my understanding.
https://prashantmdgl9.github.io/ml_experiments/2022/09/12/How-to-Explain-Gradient-Boosting.html

Please feel free to use it and also, if there are any gaps in my understanding, I will be happy if you point them out.

Cheers!

Had a doubt in the side scorer function:
Notebook Mentions : We’ll then multiply this by the number of rows, since a bigger group as more impact than a smaller group. But here tot is the total number of survivers (1’s) and not the toal records.

def _side_score(side, y):
tot = side.sum()
if tot<=1: return 0
return y[side].std()*tot

am i missing something? Thanks

Probably a very nooby point, but a heads up to anyone trying to use convnext models, whilst using conda as your package manager:

I kept getting this RuntimeError: Unknown model when running vision_learner(dls, 'convnext_small_in22k', metrics=error_rate, path='.').to_fp16().
The reason turned out to be because I had an old version of timm installed. When I installed it via conda it defaulted to 0.4.12, as this is the most recent version on Timm :: Anaconda.org. But, pypi has a newer version 0.6.7, which has the convnext models. You can pip install timm -U in your conda environment to get access to the convnext models. :slight_smile:

1 Like

Not sure if I missed this somewhere but is there anything like treeinterpreter for XGBoost?

I had the same thoughts and I think that you’re right so it could just be a typo. We can try out an example of a df where we have a completly pure column

#Let's redefine the gini function so it takes the dataframe as a parameter (so we can pass our own "pure" df)
def gini(cond, df):
    act = df.loc[cond, dep]
    return 1 - act.mean()**2 - (1-act).mean()**2

pure_df = pd.DataFrame({'Sex':['female' for i in range(10)],'Survived':np.ones(10)})
pure_df

>>
Sex	Survived
0	female	1.0
1	female	1.0
2	female	1.0
3	female	1.0
4	female	1.0
5	female	1.0
6	female	1.0
7	female	1.0
8	female	1.0
9	female	1.0

pure_gini = gini(df.Sex=='female', pure_df)
pure_gini
>> 0.0

Here’s a link that helped me understand gini impurity:

Hey Prashant - i get a 404 error when i click on that link. Have you moved it to a different URL by any chance?

the plotting functions in the xgboost library look promising (seems like xgboost.plot_tree() might be what you’re looking for)

Python API Reference — xgboost 2.0.0-dev documentation

1 Like

The notebook comment is referring to the number of rows that belong to that side (i.e. lhs or rhs) of the split. We are calculating the std deviation of that side of the tree with y[side].std() - so we want to weight this by the number of rows on that side only.

1 Like

Hey there!
here is the URL - I will update in the original post too.

https://prashantmdgl9.github.io/ml_experiments/2022/09/10/How-to-Explain-Gradient-Boosting.html