Lesson 5 official topic

ForBo7 · June 26, 2023, 6:06am

Think of it as a class being instantiated without storing it in a variable, and then you pass an argument directly to the instantiated class.

So instead of…

random_splitter = RandomSplitter(seed=42)
random_splitter(df)

…we’re doing…

RandomSplitter(seed=42)(df)

prasanth007 · June 30, 2023, 1:25am

I believe the Alone feature is supposed to be df['Alone'] = df.Family==0 i.e. if the number of people in the family is 0, then the person is alone?

>>> sorted(df['Family'].unique())
[0, 1, 2, 3, 4, 5, 6, 7, 10]

The problem with df['Alone'] = df.Family!=1 is, df['Alone'] will be True not only for 0, but also for other values except for 1.

tglearns · August 14, 2023, 7:31pm

I am not sure if I am missing something, but in the course22 repo, the ‘clean’ folder simply doesn’t exist. Where is it?

vbakshi · August 15, 2023, 1:30am

Hi all,

I’m working through the “Why you should use a framework” notebook and was unclear how Normalize works.

In this notebook we are working with Kaggle’s Titanic competition dataset, and create a DataLoaders object as follows:

dls = TabularPandas(
    df,
    splits=splits,
    procs=[Categorify, FillMissing, Normalize],
    cat_names=["Sex", "Pclass", "Embarked", "Deck", "Title"],
    cont_names=["Age", "SibSp", "Parch", "LogFare", "Alone", "TicketFreq", "Family"],
    y_names="Survived",
    y_block=CategoryBlock()
).dataloaders(path=".")

After I create dls I call .show_batch and notice that the Age column for example is not normalized:

Does that mean Normalize takes place during training?

The docs on TabularProc says:

These transforms are applied as soon as the data is available rather than as data is called from the DataLoader

However I’m not sure what that means.

xychenmsn · August 18, 2023, 12:13pm

github.com

fastai/course22/blob/master/05-linear-model-and-neural-net-from-scratch.ipynb

{"cells":[{"cell_type":"markdown","metadata":{},"source":["## Introduction"]},{"cell_type":"markdown","metadata":{},"source":["In this notebook we're going to build and train a deep learning model \"from scratch\" -- by which I mean that we're not going to use any pre-built architecture, or optimizers, or data loading frameworks, etc.\n","\n","We'll be assuming you already know the basics of how a neural network works. If you don't, read this notebook first: [How does a neural net really work?\n","](https://www.kaggle.com/code/jhoward/how-does-a-neural-net-really-work). We'll be using Kaggle's [Titanic](https://www.kaggle.com/competitions/titanic/) competition in this notebook, because it's very small and simple, but also has displays many of the tricky real-life issues that we need to handle in most practical projects. (Note, however, that this competition is a small \"learner\" competition on Kaggle, so don't expect to actually see much benefits from using a neural net just yet; that will come once we try our some real competitions!)\n","\n","It's great to be able to run the same notebook on your own machine or Colab, as well as Kaggle. To allow for this, we use this code to download the data as needed when not on Kaggle (see [this notebook](https://www.kaggle.com/code/jhoward/getting-started-with-nlp-for-absolute-beginners/) for details about this technique):"]},{"cell_type":"code","execution_count":null,"metadata":{"execution":{"iopub.execute_input":"2022-05-30T22:34:17.763822Z","iopub.status.busy":"2022-05-30T22:34:17.763494Z","iopub.status.idle":"2022-05-30T22:34:17.771348Z","shell.execute_reply":"2022-05-30T22:34:17.770444Z","shell.execute_reply.started":"2022-05-30T22:34:17.763787Z"},"trusted":true},"outputs":[],"source":["import os\n","from pathlib import Path\n","\n","iskaggle = os.environ.get('KAGGLE_KERNEL_RUN_TYPE', '')\n","if iskaggle: path = Path('../input/titanic')\n","else:\n","    path = Path('titanic')\n","    if not path.exists():\n","        import zipfile,kaggle\n","        kaggle.api.competition_download_cli(str(path))\n","        zipfile.ZipFile(f'{path}.zip').extractall(path)"]},{"cell_type":"markdown","metadata":{"hidden":true},"source":["Note that the data for Kaggle comps always lives in the `../input` folder. The easiest way to get the path is to click the \"K\" button in the top-right of the Kaggle notebook, click on the folder shown there, and click the copy button.\n","\n","We'll be using *numpy* and *pytorch* for array calculations in this notebook, and *pandas* for working with tabular data, so we'll import them and set them to display using a bit more space than they default to."]},{"cell_type":"code","execution_count":null,"metadata":{"execution":{"iopub.execute_input":"2022-05-30T22:34:17.811857Z","iopub.status.busy":"2022-05-30T22:34:17.810967Z","iopub.status.idle":"2022-05-30T22:34:17.817725Z","shell.execute_reply":"2022-05-30T22:34:17.816849Z","shell.execute_reply.started":"2022-05-30T22:34:17.811797Z"},"trusted":true},"outputs":[],"source":["import torch, numpy as np, pandas as pd\n","np.set_printoptions(linewidth=140)\n","torch.set_printoptions(linewidth=140, sci_mode=False, edgeitems=7)\n","pd.set_option('display.width', 140)"]},{"cell_type":"markdown","metadata":{"heading_collapsed":true},"source":["## Cleaning the data"]},{"cell_type":"markdown","metadata":{"hidden":true},"source":["This is a *tabular data* competition -- the data is in the form of a table. It's provided as a Comma Separated Values (CSV) file. We can open it using the *pandas* library, which will create a `DataFrame`."]},{"cell_type":"code","execution_count":null,"metadata":{"execution":{"iopub.execute_input":"2022-05-30T22:34:17.899238Z","iopub.status.busy":"2022-05-30T22:34:17.898249Z","iopub.status.idle":"2022-05-30T22:34:17.932714Z","shell.execute_reply":"2022-05-30T22:34:17.931738Z","shell.execute_reply.started":"2022-05-30T22:34:17.899131Z"},"hidden":true,"scrolled":true,"trusted":true},"outputs":[],"source":["df = pd.read_csv(path/'train.csv')\n","df"]},{"cell_type":"markdown","metadata":{"hidden":true},"source":["As we learned in the *How does a neural net really work* notebook, we going to want to multiply each column by some coefficients. But we can see in the `Cabin` column that there are `NaN` values, which is how Pandas refers to missing values. We can't multiply something by a missing value!\n","\n","Let's check which columns contain `NaN` values. Pandas' `isna()` function returns `True` (which is treated as `1` when used as a number) for `NaN` values, so we can just add them up for each column:"]},{"cell_type":"code","execution_count":null,"metadata":{"execution":{"iopub.execute_input":"2022-05-30T22:34:17.955557Z","iopub.status.busy":"2022-05-30T22:34:17.95524Z","iopub.status.idle":"2022-05-30T22:34:17.966199Z","shell.execute_reply":"2022-05-30T22:34:17.96534Z","shell.execute_reply.started":"2022-05-30T22:34:17.955525Z"},"hidden":true,"trusted":true},"outputs":[],"source":["df.isna

This file has been truncated. show original

When I am running this notebook, I got an error.
get_dummy will get True/False instead of 1/0s

I have to change the cell 17 from

indep_cols = ['Age', 'SibSp', 'Parch', 'LogFare'] + added_cols

t_indep = tensor(df[indep_cols].values, dtype=torch.float)
t_indep

to

indep_cols = ['Age', 'SibSp', 'Parch', 'LogFare'] + added_cols

t_indep = tensor(df[indep_cols].astype(float).values, dtype=torch.float)
t_indep

make it work. Do you guys have the same issue, or it is my local version python library?

pkm · August 23, 2023, 2:51am

hi everyone

Recently i try to run code in collab (09_tabular.ipynb). There was some obstacles

Maby someone find it useful

There is a part when we use “dtreeviz” library. In my instance it wasn’t run as it is, and then i rewrite it little bit it works

!pip install -q -U dtreeviz
import dtreeviz
samp_idx = np.random.permutation(len(y))[:500]

viz_cmodel = dtreeviz.model(m,
                           tree_index=3,
                           X_train=xs.iloc[samp_idx],
                           y_train=y.iloc[samp_idx],
                           feature_names=xs.columns,
                           target_name=dep_var)
viz_cmodel.view(scale=3)
##and also with different flags
viz_cmodel.view(orientation='LR', scale=3)

All this and more examples i find at tensorflow doc

FYI: there are a lot of warning - “WARNING:matplotlib.font_manager:findfont: Font family ‘Arial’ not found.” I try a some stuff from stackoverflow but nothing helps, if u find out please let me know (!sudo apt install msttcorefonts -qq !rm ~/.cache/matplotlib -rf !sudo apt install font-manager)

(minor) somehow colab didn’t want to download a files, but then i change path to some another place in google drive all works
Also with plotting dependence i rewrite like this:

from sklearn.inspection import PartialDependenceDisplay

fig,ax = plt.subplots(figsize=(12, 4))
PartialDependenceDisplay.from_estimator(m, valid_xs_final, ['YearMade','ProductSize'],
                        grid_resolution=20, ax=ax);

newoptionz · September 14, 2023, 11:27pm

Yes, that is a faster way of doing it, though he losses many people who don’t program all day long.

sopykt · September 24, 2023, 6:27am

I’ve got the same issue. I’ve run this before but no problem at all. This was also just work in youtube lesson.

astrawalker · October 7, 2023, 4:33am

During lesson 5, Jeremy mentioned that when using k-1 dummy variables you have to include a constant. Does anyone know why this is?

utkrsh · October 9, 2023, 6:44pm

So, following this lecture, I went one step even deeper and built a neural network, entirely from scratch (no PyTorch, or Tensorflow or Keras, just NumPy and Pandas) to use on the Titanic Dataset. Pretty good results for an NN, as I got the same result I got when the model used was a carefully designed Scikit-learn’s DecisionTreeClassifier (see “Results” in the notebook).

See the notebook here: ANN in NumPy on Titanic Dataset [0.765] | Kaggle

Here’s the results I’ve gotten so far (from-scratch_submission.csv was this model):

If you liked the notebook, please give it an upvote! Thanks!

squid · October 15, 2023, 9:15am

Great question! Got me thinking too
(Source: ChatGPT)
A constant is included and optimized to give a value when all the dummy variables or predictors (independent variable) is/are zero. A constant need not be added when you think that the output can be zero when the predictors/independent variables are zero.

These are two reasons I understood and found intuitive among other reasons.

BotheG · October 28, 2023, 10:06am

Hello, I wanted to read chapter 9 of the book but when i try to launch this code:

from kaggle import api

if not path.exists():
    path.mkdir(parents=true)
    api.competition_download_cli(comp, path=path)
    shutil.unpack_archive(str(path/f'{comp}.zip'), str(path))

#path.ls(file_type='text')
path.ls()

the output i receive is this:
(#0) []
so it does not download anything, i join kaggle competition and provide my username and key for authentication, i think I didn’t change the code other than for username and key, do you have any suggestions what is the problem?

bencoman · November 2, 2023, 12:19pm

What happens if you put a print statement on the line before the mkdir?

Axel · November 17, 2023, 6:32pm

On the topic of encoding categorical variables:
We can use n dummy variables (i.e. one for each possible category) or n-1 (i.e. omitting one and letting it be implicit in the bias term). Why isn’t there a clear preference for n-1 in order to reduce the dimensionality? (google “curse of dimensionality”) Is it because neural networks are flexible enough to work around it?

zander · November 28, 2023, 9:13pm

I had this same issue.
The if not path.exists(): could be preventing the actual download from happening. I had this problem when my api token was borked but the folder was created in the first line of that statement.

I would go into your terminal and remove the /root/.fastai/archive/bulldozers<WHATEVER> folder

zander · November 28, 2023, 9:14pm

Hey gang!
I’ve been blogging about every lesson and I recently published week 5. It includes a simple recap as well as a new tenacious animal for your inspiration.

anderslindstrom · December 9, 2023, 11:15pm

Biases weights in each layer

I am little confused by the biases in the deep version of the NN. My understanding is that each layer should have either a single bias unit or none, but that each node in the layer after that one has a weight for that single bias unit in the previous layer as shown in the following diagram:

500px-Network3322

[From Rohan #5: What are bias units?. This is the fifth entry in my journey… | by Rohan Kapur | A Year of Artificial Intelligence]

That means that a layer with N nodes should have N trainable bias weights if the previous layer indeed had a bias unit. If there was no bias unit in the previous layer, there are no trainable bias weights.

For example, in the Excel version of the single layer NN for the Titanic Competition [starts at 1:16:25 in the Lesson 3 video] the input layer has a single bias unit (the ‘Ones’ column in the input) but there are two trainable bias parameters (the ‘Const’ column in the Parameter block, which has two rows) for the first (and only) hidden layer.

However, in the deep NN for this lesson, there is only a single trainable bias parameter for each of the hidden layers. If we were following the lead of the Excel version, there would be no bias parameters for the first hidden layer (because the input layer has no bias unit), 10 bias parameters for the second hidden layer (assuming with stay with the default architecture) and 2 bias parameters for the output layer. At least, that’s my understanding.

Here’s another example that uses the same pattern:

I modified the existing NN to have the extra bias parameters just described. The result was exactly the same as for original NN, which I suppose isn’t surprising given that the original deep NN also gets the same result as the single-layer NN.

Irrespective of the result, am I wrong about the number of bias weights per layer?

snikpohw · December 20, 2023, 6:34pm

I ran into the same issue (TypeError: can’t convert np.ndarray of type numpy.object_. The only supported types are: float64, float32, float16, complex64, complex128, int64, int32, int16, int8, uint8, and bool.) and came up with the same solution.

From the video, it shows that get_dummies() fills in with integers:

pd.get_dummies() has a parameter, dtype, which determines the output type of the dummy variables. The current default is bool.

So, if you run the code now, you get the following:

It is the combination of float columns and bool columns that create the problem when the DataFrame is converted to a numpy array via:

The output type is numpy.ndarray; unlike DataFrame, all the elements must be the same type. When creating a ndarray from a DataFrame, the DataFrame values property has to find a common element type.

That is why the error message says TypeError: can’t convert np.ndarray of type numpy.object_.

The tensor constructor wants all the elements to be one of the following: The only supported types are: float64, float32, float16, complex64, complex128, int64, int32, int16, int8, uint8, and bool.

My best guess as to why the code works in the video is that the default for get_dummies() dtype parameter was int at the time and that the default is now bool.

The cleanest solution is use the get_dummies() dtype parameter to set the dummy values to int.

And creating the tensor now works just fine:

snikpohw · December 20, 2023, 7:07pm

An alternative solution is to use DataFrame.to_numpy as instructed to do in the DataFrame.values documentation. Left on its own, it will also convert all the elements to object_, but you can provide a dtype parameter to force the type of the output. In this case, you set it to float and it will convert bool elements to float.

So, using get_dummies() with the dtype default of bool:

You can still create the tensor by using to_numpy() with dtype = float:

snikpohw · December 22, 2023, 10:23pm

Using sigmoid to adjust binary classification results requires adjustment to input.

In the Linear Model and Neural Net from Scratch notebook, Jeremy adds a call to sigmoid() to force the range of predicted values to remain strictly between 0 and 1.

Looking at the plot for sigmoid, I noticed the input value goes from minus infinity to infinity and is centered on zero, while the output ranges from 0 to 1.

The mapping drew my attention because an input value of 0 mapped to an output of 0.5;

In [49]: torch.sigmoid(torch.tensor(0))
Out[49]: tensor(0.5000)

Any input greater than 0 maps to an output greater than 0.5.

In [51]: torch.sigmoid(torch.tensor(0.25))
Out[51]: tensor(0.5622)

We are performing binary classification in which predicted values <= 0.5 is one value (not survived in this case) and predicted values >0.5 is the other (survived). The result is that whereas a predicted value between >0 and 0.5 was considered as a Not Survived, adjusting by sigmoid turns predicted values between >0 and 0.5 to a Survived.

It is best illustrated by histograms.

Here is the distribution of predicted values before sigmoid:

Lets add on top of that the distribution of the predicted values having been passed through sigmoid:

The goal behind using sigmoid() was met: all the values are now within the [0,1] range (unlike the original predicted values).

Unfortunately, in addition to shrinking the range of values, the entire distribution was shifted right.

The obvious solution is to subtract 0.5 from the predicted values before being passed into sigmoid() so that the middle of the distribution is now at 0; here is what the adjusted sigmoid() output distribution looks like on top of the original predicted values:

This adjustment both brings in the range within [0,1] and keeps the center of the distribution where it was, maintaining the same classification for all of the values.

Here is how calc_preds() now looks:
def calc_preds(coeffs, indeps): return torch.sigmoid((indeps*coeffs).sum(axis=1) - 0.5)

I made the change to the notebook and carried it through, and it didn’t change any of the results in terms of loss function trajectory or achieved accuracy in all of the different implementations (linear model, matrix, neural net, and deep learning), despite that rather large shift in the distribution with respect to the classification criterion (i.e., predicted value > 0.5 for survived).

Even though there wasn’t any change in predictive accuracy for this problem, it is important to not change how predicted values are classified when applying transformations.