Lesson 5 official topic

Think of it as a class being instantiated without storing it in a variable, and then you pass an argument directly to the instantiated class.

So instead of…

random_splitter = RandomSplitter(seed=42)
random_splitter(df)

…we’re doing…

RandomSplitter(seed=42)(df)
2 Likes

I believe the Alone feature is supposed to be df['Alone'] = df.Family==0 i.e. if the number of people in the family is 0, then the person is alone?

>>> sorted(df['Family'].unique())
[0, 1, 2, 3, 4, 5, 6, 7, 10]

The problem with df['Alone'] = df.Family!=1 is, df['Alone'] will be True not only for 0, but also for other values except for 1.

I am not sure if I am missing something, but in the course22 repo, the ‘clean’ folder simply doesn’t exist. Where is it?

Hi all,

I’m working through the “Why you should use a framework” notebook and was unclear how Normalize works.

In this notebook we are working with Kaggle’s Titanic competition dataset, and create a DataLoaders object as follows:

dls = TabularPandas(
    df,
    splits=splits,
    procs=[Categorify, FillMissing, Normalize],
    cat_names=["Sex", "Pclass", "Embarked", "Deck", "Title"],
    cont_names=["Age", "SibSp", "Parch", "LogFare", "Alone", "TicketFreq", "Family"],
    y_names="Survived",
    y_block=CategoryBlock()
).dataloaders(path=".")

After I create dls I call .show_batch and notice that the Age column for example is not normalized:

Does that mean Normalize takes place during training?

The docs on TabularProc says:

These transforms are applied as soon as the data is available rather than as data is called from the DataLoader

However I’m not sure what that means.

When I am running this notebook, I got an error.
get_dummy will get True/False instead of 1/0s

I have to change the cell 17 from

indep_cols = ['Age', 'SibSp', 'Parch', 'LogFare'] + added_cols

t_indep = tensor(df[indep_cols].values, dtype=torch.float)
t_indep

to

indep_cols = ['Age', 'SibSp', 'Parch', 'LogFare'] + added_cols

t_indep = tensor(df[indep_cols].astype(float).values, dtype=torch.float)
t_indep

make it work. Do you guys have the same issue, or it is my local version python library?

hi everyone

Recently i try to run code in collab (09_tabular.ipynb). There was some obstacles

Maby someone find it useful :slight_smile:

  1. There is a part when we use “dtreeviz” library. In my instance it wasn’t run as it is, and then i rewrite it little bit it works
!pip install -q -U dtreeviz
import dtreeviz
samp_idx = np.random.permutation(len(y))[:500]

viz_cmodel = dtreeviz.model(m,
                           tree_index=3,
                           X_train=xs.iloc[samp_idx],
                           y_train=y.iloc[samp_idx],
                           feature_names=xs.columns,
                           target_name=dep_var)
viz_cmodel.view(scale=3)
##and also with different flags
viz_cmodel.view(orientation='LR', scale=3)

All this and more examples i find at tensorflow doc

FYI: there are a lot of warning - “WARNING:matplotlib.font_manager:findfont: Font family ‘Arial’ not found.” I try a some stuff from stackoverflow but nothing helps, if u find out please let me know :slight_smile: (!sudo apt install msttcorefonts -qq !rm ~/.cache/matplotlib -rf !sudo apt install font-manager)

  1. (minor) somehow colab didn’t want to download a files, but then i change path to some another place in google drive all works

  2. Also with plotting dependence i rewrite like this:

from sklearn.inspection import PartialDependenceDisplay

fig,ax = plt.subplots(figsize=(12, 4))
PartialDependenceDisplay.from_estimator(m, valid_xs_final, ['YearMade','ProductSize'],
                        grid_resolution=20, ax=ax);
2 Likes

Yes, that is a faster way of doing it, though he losses many people who don’t program all day long.

I’ve got the same issue. I’ve run this before but no problem at all. This was also just work in youtube lesson.

During lesson 5, Jeremy mentioned that when using k-1 dummy variables you have to include a constant. Does anyone know why this is?

So, following this lecture, I went one step even deeper and built a neural network, entirely from scratch (no PyTorch, or Tensorflow or Keras, just NumPy and Pandas) to use on the Titanic Dataset. Pretty good results for an NN, as I got the same result I got when the model used was a carefully designed Scikit-learn’s DecisionTreeClassifier (see “Results” in the notebook).

See the notebook here: ANN in NumPy on Titanic Dataset [0.765] | Kaggle

Here’s the results I’ve gotten so far (from-scratch_submission.csv was this model):

If you liked the notebook, please give it an upvote! Thanks!

2 Likes

Great question! Got me thinking too
(Source: ChatGPT)
A constant is included and optimized to give a value when all the dummy variables or predictors (independent variable) is/are zero. A constant need not be added when you think that the output can be zero when the predictors/independent variables are zero.

These are two reasons I understood and found intuitive among other reasons.

Hello, I wanted to read chapter 9 of the book but when i try to launch this code:

from kaggle import api

if not path.exists():
    path.mkdir(parents=true)
    api.competition_download_cli(comp, path=path)
    shutil.unpack_archive(str(path/f'{comp}.zip'), str(path))

#path.ls(file_type='text')
path.ls()

the output i receive is this:
(#0) []
so it does not download anything, i join kaggle competition and provide my username and key for authentication, i think I didn’t change the code other than for username and key, do you have any suggestions what is the problem?

What happens if you put a print statement on the line before the mkdir?

On the topic of encoding categorical variables:
We can use n dummy variables (i.e. one for each possible category) or n-1 (i.e. omitting one and letting it be implicit in the bias term). Why isn’t there a clear preference for n-1 in order to reduce the dimensionality? (google “curse of dimensionality”) Is it because neural networks are flexible enough to work around it?

I had this same issue.
The if not path.exists(): could be preventing the actual download from happening. I had this problem when my api token was borked but the folder was created in the first line of that statement.

I would go into your terminal and remove the /root/.fastai/archive/bulldozers<WHATEVER> folder

Hey gang!
I’ve been blogging about every lesson and I recently published week 5. It includes a simple recap as well as a new tenacious animal for your inspiration.

Biases weights in each layer

I am little confused by the biases in the deep version of the NN. My understanding is that each layer should have either a single bias unit or none, but that each node in the layer after that one has a weight for that single bias unit in the previous layer as shown in the following diagram:

500px-Network3322

[From Rohan #5: What are bias units?. This is the fifth entry in my journey… | by Rohan Kapur | A Year of Artificial Intelligence]

That means that a layer with N nodes should have N trainable bias weights if the previous layer indeed had a bias unit. If there was no bias unit in the previous layer, there are no trainable bias weights.

For example, in the Excel version of the single layer NN for the Titanic Competition [starts at 1:16:25 in the Lesson 3 video] the input layer has a single bias unit (the ‘Ones’ column in the input) but there are two trainable bias parameters (the ‘Const’ column in the Parameter block, which has two rows) for the first (and only) hidden layer.

However, in the deep NN for this lesson, there is only a single trainable bias parameter for each of the hidden layers. If we were following the lead of the Excel version, there would be no bias parameters for the first hidden layer (because the input layer has no bias unit), 10 bias parameters for the second hidden layer (assuming with stay with the default architecture) and 2 bias parameters for the output layer. At least, that’s my understanding.

Here’s another example that uses the same pattern:

I modified the existing NN to have the extra bias parameters just described. The result was exactly the same as for original NN, which I suppose isn’t surprising given that the original deep NN also gets the same result as the single-layer NN.

Irrespective of the result, am I wrong about the number of bias weights per layer?

I ran into the same issue (TypeError: can’t convert np.ndarray of type numpy.object_. The only supported types are: float64, float32, float16, complex64, complex128, int64, int32, int16, int8, uint8, and bool.) and came up with the same solution.

From the video, it shows that get_dummies() fills in with integers:

pd.get_dummies() has a parameter, dtype, which determines the output type of the dummy variables. The current default is bool.
image

So, if you run the code now, you get the following:

It is the combination of float columns and bool columns that create the problem when the DataFrame is converted to a numpy array via:

The output type is numpy.ndarray; unlike DataFrame, all the elements must be the same type. When creating a ndarray from a DataFrame, the DataFrame values property has to find a common element type.

That is why the error message says TypeError: can’t convert np.ndarray of type numpy.object_.

image

The tensor constructor wants all the elements to be one of the following: The only supported types are: float64, float32, float16, complex64, complex128, int64, int32, int16, int8, uint8, and bool.

My best guess as to why the code works in the video is that the default for get_dummies() dtype parameter was int at the time and that the default is now bool.

The cleanest solution is use the get_dummies() dtype parameter to set the dummy values to int.

And creating the tensor now works just fine:

2 Likes

An alternative solution is to use DataFrame.to_numpy as instructed to do in the DataFrame.values documentation. Left on its own, it will also convert all the elements to object_, but you can provide a dtype parameter to force the type of the output. In this case, you set it to float and it will convert bool elements to float.

So, using get_dummies() with the dtype default of bool:

You can still create the tensor by using to_numpy() with dtype = float:

Using sigmoid to adjust binary classification results requires adjustment to input.

In the Linear Model and Neural Net from Scratch notebook, Jeremy adds a call to sigmoid() to force the range of predicted values to remain strictly between 0 and 1.

Looking at the plot for sigmoid, I noticed the input value goes from minus infinity to infinity and is centered on zero, while the output ranges from 0 to 1.
image

The mapping drew my attention because an input value of 0 mapped to an output of 0.5;

In [49]: torch.sigmoid(torch.tensor(0))
Out[49]: tensor(0.5000)

Any input greater than 0 maps to an output greater than 0.5.

In [51]: torch.sigmoid(torch.tensor(0.25))
Out[51]: tensor(0.5622)

We are performing binary classification in which predicted values <= 0.5 is one value (not survived in this case) and predicted values >0.5 is the other (survived). The result is that whereas a predicted value between >0 and 0.5 was considered as a Not Survived, adjusting by sigmoid turns predicted values between >0 and 0.5 to a Survived.

It is best illustrated by histograms.

Here is the distribution of predicted values before sigmoid:
image

Lets add on top of that the distribution of the predicted values having been passed through sigmoid:
image

The goal behind using sigmoid() was met: all the values are now within the [0,1] range (unlike the original predicted values).

Unfortunately, in addition to shrinking the range of values, the entire distribution was shifted right.

The obvious solution is to subtract 0.5 from the predicted values before being passed into sigmoid() so that the middle of the distribution is now at 0; here is what the adjusted sigmoid() output distribution looks like on top of the original predicted values:
image

This adjustment both brings in the range within [0,1] and keeps the center of the distribution where it was, maintaining the same classification for all of the values.

Here is how calc_preds() now looks:
def calc_preds(coeffs, indeps): return torch.sigmoid((indeps*coeffs).sum(axis=1) - 0.5)

I made the change to the notebook and carried it through, and it didn’t change any of the results in terms of loss function trajectory or achieved accuracy in all of the different implementations (linear model, matrix, neural net, and deep learning), despite that rather large shift in the distribution with respect to the classification criterion (i.e., predicted value > 0.5 for survived).

Even though there wasn’t any change in predictive accuracy for this problem, it is important to not change how predicted values are classified when applying transformations.