Lesson 6 official topic

not easy but thanks to your teachings I am able to get close to the top, on the Paddy Doctor, thanks for everything Jeremy.

4 Likes

Hi there all, I was wondering whether there is a fastai way to handle large tiff images, as the tiff images are all very large and cannot be fed directly into the neural network model for training, and or to get rid of the white spaces in the images.

Thanks

@jeremy: concerning Random Forests, Im thinking about a small correction: I would rather use the estimated probabilities to make a prediction

all_probs = [t.predict(val_xs) for t in trees]
avg_probs = np.stack(all_probs).mean(0)

mean_absolute_error(val_y, avg_probs)

Id like to change that to:

preds = .5 < avg_probs
mean_absolute_error(val_y, preds)

This gives you an even closer result to RandomForestClassifier

What do you think?

My environment is different from Jeremys it is a mamba install of fastbook with python 3.10 on ubuntu 20.04.4 LTS with sklearn 1.1.2 therefore this issue is because I use this version of sklearn, for people who just need to run the book as Jeremy describes read no further, but if you want to use your project with later versions of sklearn and analyse with treeinterpreter then the following work around may be useful. I did this and the notebook completed with out fault albeit with the pip install line at the top commented out.

I came upon this issue when running the clean version 09_tabular.ipynb.

The treeinterpreter module has not kept pace with the changes in sklearn.

sklearn.ensemble.forest was renamed to sklearn.ensemble._forest in 437ca05 on Oct 16, 2019. You need to install an older sklearn . Try version 0.21.3 released on Jul 30, 2019:

The from treeinterpreter import treeinterpreter line imports sklearn.ensemble.forest which long ago was renamed sklearn.ensemble._forest and fails to import. The options are to change to a pre 2019 version of sklearn or do the EDIT in the file tree interpreter.py as described below

  1. Out-comment the old import and replace with the new import.Comment out the import for the old versions of sklearn and add the new import statement
# from sklearn.ensemble.forest import ForestClassifier, ForestRegressor
from sklearn.ensemble._forest import ForestClassifier, ForestRegressor

NOTE :-
The advise to find the file in the link below by printing it’s location in the notebook failed for me I had to manually look for the location via a terminal.
I found it in
$home/mambaforge/envs/book/python3.10/site-packages/treeinterpreter/treeinterpreter.py

Working Work Around for treeinterpreter

2 Likes

In kaggle notebook gini section

If the group is all the same, the probability is 1.0 , and 0.0 if they’re all different:

Is it the opposite here? The higher gini, the more chaotic the group is.

1 Like

Hi!

I am trying to use the TabularPandas tool for predicting future sales in the playground Kaggle competition Tabular Playground Series - Sep 2022.

The results are currently poor, despite relatively thorough analysis and feature engineering, so any tips on how to improve the notebook would be really appreciated.

However, I was wondering about the need to use “procs” when creating the TabularPandas object, because I am experiencing improved results when not adding “procs” for RandomForest and XGBoost. Rather use sklearn’s LabelEncoder for categorizing relevant columns. But, I am unable to train a neural network using the FastAI framework, because after creating a learner object from tabular_learner with a dls without procs I get an error:

AttributeError Traceback (most recent call last)
Input In [174], in <cell line: 1>()
----> 1 learn = tabular_learner(dls, layers=[1000,500], config=config_tabular,
2 n_out=1,
3 #loss_func=F.mse_loss,
4 metrics=[exp_rmspe])

File ~/mambaforge/envs/fastai2/lib/python3.10/site-packages/fastai/tabular/learner.py:42, in tabular_learner(dls, layers, emb_szs, config, n_out, y_range, **kwargs)
40 if layers is None: layers = [200,100]
41 to = dls.train_ds
—> 42 emb_szs = get_emb_sz(dls.train_ds, {} if emb_szs is None else emb_szs)
43 if n_out is None: n_out = get_c(dls)
44 assert n_out, “n_out is not defined, and could not be inferred from data, set dls.c or pass n_out

File ~/mambaforge/envs/fastai2/lib/python3.10/site-packages/fastai/tabular/model.py:32, in get_emb_sz(to, sz_dict)
27 def get_emb_sz(
28 to:Tabular|TabularPandas,
29 sz_dict:dict=None # Dictionary of {‘class_name’ : size, …} to override default emb_sz_rule
30 ) → list: # List of embedding sizes for each category
31 “Get embedding size for each cat_name in Tabular or TabularPandas, or populate embedding size manually using sz_dict”
—> 32 return [_one_emb_sz(to.classes, n, sz_dict) for n in to.cat_names]

File ~/mambaforge/envs/fastai2/lib/python3.10/site-packages/fastai/tabular/model.py:32, in (.0)
27 def get_emb_sz(
28 to:Tabular|TabularPandas,
29 sz_dict:dict=None # Dictionary of {‘class_name’ : size, …} to override default emb_sz_rule
30 ) → list: # List of embedding sizes for each category
31 “Get embedding size for each cat_name in Tabular or TabularPandas, or populate embedding size manually using sz_dict”
—> 32 return [_one_emb_sz(to.classes, n, sz_dict) for n in to.cat_names]

File ~/mambaforge/envs/fastai2/lib/python3.10/site-packages/fastcore/basics.py:491, in GetAttr.getattr(self, k)
489 if self._component_attr_filter(k):
490 attr = getattr(self,self._default,None)
→ 491 if attr is not None: return getattr(attr,k)
492 raise AttributeError(k)

File ~/mambaforge/envs/fastai2/lib/python3.10/site-packages/fastcore/transform.py:212, in Pipeline.getattr(self, k)
→ 212 def getattr(self,k): return gather_attrs(self, k, ‘fs’)

File ~/mambaforge/envs/fastai2/lib/python3.10/site-packages/fastcore/transform.py:173, in gather_attrs(o, k, nm)
171 att = getattr(o,nm)
172 res = [t for t in att.attrgot(k) if t is not None]
→ 173 if not res: raise AttributeError(k)
174 return res[0] if len(res)==1 else L(res)

AttributeError: classes

Hey all,
I was trying to understand how a gradient boost works; wrote a post here which sums up my understanding.
https://prashantmdgl9.github.io/ml_experiments/2022/09/12/How-to-Explain-Gradient-Boosting.html

Please feel free to use it and also, if there are any gaps in my understanding, I will be happy if you point them out.

Cheers!

Had a doubt in the side scorer function:
Notebook Mentions : We’ll then multiply this by the number of rows, since a bigger group as more impact than a smaller group. But here tot is the total number of survivers (1’s) and not the toal records.

def _side_score(side, y):
tot = side.sum()
if tot<=1: return 0
return y[side].std()*tot

am i missing something? Thanks

Probably a very nooby point, but a heads up to anyone trying to use convnext models, whilst using conda as your package manager:

I kept getting this RuntimeError: Unknown model when running vision_learner(dls, 'convnext_small_in22k', metrics=error_rate, path='.').to_fp16().
The reason turned out to be because I had an old version of timm installed. When I installed it via conda it defaulted to 0.4.12, as this is the most recent version on Timm :: Anaconda.org. But, pypi has a newer version 0.6.7, which has the convnext models. You can pip install timm -U in your conda environment to get access to the convnext models. :slight_smile:

1 Like

Not sure if I missed this somewhere but is there anything like treeinterpreter for XGBoost?

I had the same thoughts and I think that you’re right so it could just be a typo. We can try out an example of a df where we have a completly pure column

#Let's redefine the gini function so it takes the dataframe as a parameter (so we can pass our own "pure" df)
def gini(cond, df):
    act = df.loc[cond, dep]
    return 1 - act.mean()**2 - (1-act).mean()**2

pure_df = pd.DataFrame({'Sex':['female' for i in range(10)],'Survived':np.ones(10)})
pure_df

>>
Sex	Survived
0	female	1.0
1	female	1.0
2	female	1.0
3	female	1.0
4	female	1.0
5	female	1.0
6	female	1.0
7	female	1.0
8	female	1.0
9	female	1.0

pure_gini = gini(df.Sex=='female', pure_df)
pure_gini
>> 0.0

Here’s a link that helped me understand gini impurity:

Hey Prashant - i get a 404 error when i click on that link. Have you moved it to a different URL by any chance?

the plotting functions in the xgboost library look promising (seems like xgboost.plot_tree() might be what you’re looking for)

Python API Reference — xgboost 2.0.0-dev documentation

1 Like

The notebook comment is referring to the number of rows that belong to that side (i.e. lhs or rhs) of the split. We are calculating the std deviation of that side of the tree with y[side].std() - so we want to weight this by the number of rows on that side only.

1 Like

Hey there!
here is the URL - I will update in the original post too.

https://prashantmdgl9.github.io/ml_experiments/2022/09/10/How-to-Explain-Gradient-Boosting.html

That works for me - great article! I submitted a pull request to fix some small typos - I think it would be good to play about with the styling a little bit so the text is centered in the page and the tables are formatted a bit more clearly. I’m not familiar with fastpages but I’m sure there’ll be a way to configure that somewhere in the setup!

Thanks! I merged your pull request.

I agree with you on the formatting issues. Internally, it uses Jekyll which isn’t rendering well here on Git. On the markdown, it looks well, the tables come with borders and other formatting is intact but when shifted to git, it comes in raw format.

1 Like

Hello! it looks like on the side_scorer function below:
def _side_score(side, y):
tot = side.sum()
if tot<=1: return 0
return y[side].std()*tot

I am struggling to understand why multiplying the standard deviation by “tot” matters and why we are penalising the side_score if a lot of elements end up in one particular split.

I see that we of course normalise by dividing the final score by the length of the dataset but why did we choose to multiple in the first place

1 Like

At 19:30 in regards to bagging and random forests, Jeremy says creating subsets by grabbing a random 50% will create uncorrelated models, but there must be a limit to this, right? At some point I’ll have grabbed all of the possible subsets and any additional subset I grab will be a copy of a previous subset, and thus have a correlation of 1 with another subset.

This is probably unlikely for any reasonably sized dataset, but I assume the intuition still holds if I by chance grab a subset where only one element is different. Slightly less for 2, 3, or 4 elements, and even less for 50, 100, etc. What does reasonably uncorrelated look like in practice? How big does the dataset have to be for the models to meet that bar?

Also, I now see that the get_tree function in the notebook uses random.choice(n, int(propn)) to get a prop sized bucket, but it does it with replacement. Would it be better or worse to pass random.choice(n, int(propn), False) to get samples without replacement? In my tests on the Titanic set, I think I’m getting better results, but is making the selection with replacement making the buckets more or less correlated?

If there it takes you 1000 subsets to encounter a situation, then the impact of that correlation=1 subset will be 0.1% on the result. YMMV IANAM**

**Mathematician