Wiki thread: lesson 1

Hi All. I had a play using fast.ai v1 instead and seem to be able to get everything to work. The short of it is that a bunch of functions from structured.py need to be copy + pasted over, and the feather loading is slightly different. Nothing else major had to change that I came across.

I have made a condensed gist of lessons 1 + 2 notebooks into one, that works with the current version of fast.ai. Hope it helps: https://gist.github.com/mnye/bb1653562b6e2d85ee44478cfdf0f5a1

I am not sure why these functions were completely thrown away from the repo, but there is a new tabular section for NN which might be worth taking a look at. It would be interesting to hear from @jeremy what his plan is for this course and in particular if things are in for a shake up now v1 is out?

2 Likes

Thanks for doing this! I would love to see a fastai v1 compatible version of all the course. If there are important missing bits of missing functionality, I’d be happy to discuss ways to make them work. I’d like to find a more integrated way of doing things overall - fastai v1 is much more carefully designed than 0.7, so hopefully we can find neat ways of incorporating all the functionality required.

(This will require a community effort however - it’s not something I have time to do myself at the moment.)

1 Like

I was secretly hoping you would have run a course this year with v1 or will be soon, and would update accordingly :slight_smile: As mentioned, there is just a handful of helper functions required (at least for the random forest portion of the course), so I think it would not be hard to keep it working / alive.

Integrating it with the new structure (which looks quite impressive!) I can’t comment on, but I’m hoping to play with the new features in the coming weeks. I have found the random forest portion of the course fascinating though (such a good insight despite already having been exposed to them previously) and it would be great to keep the simple functionality of them alive.

Can someone please help me to download the data for lesson 1 of machine learning?
in kaggle its asking for my phone number and I am from INDIA, so sms(pin) cant be reached
pls help

Hi @harrshjain – All 5 of the notebooks (and the associated datasets) for the ML lessons are available on Kaggle, if that helps.

1 Like

Wow, you really saved the day!
Thankyou sir!

1 Like

This error is coming in jupyter notebook, although I have alrerady installed all the packages of fastai and updated them. What should I do?

you want to use v0.7 of fastai, not v1. make sure you have the right version of the libraries installed

Hi,

When doing the initial processing of a dataframe, is it better to run the function add_datepart to all columns of dtype ‘datetime64’ ?

I’ve come up with the following function to run the add_datepart() function if the column is of the datetime dtype:

columns = list(df_raw)
n_columns = len(columns)
for n in range(n_columns):
if df_raw[columns[n]].dtype  == '<M8[ns]':
    add_datepart(df_raw, columns[n])

Do you think this is good?

Hi,

I would like to know how one can add its own functions to the fastai library to make it available to all notebooks.

For example, I have written the following small function to convert every datetime column into categories:

columns = list(df_raw)
n_columns = len(columns)
for n in range(n_columns):
if df_raw[columns[n]].dtype  == '<M8[ns]':
    add_datepart(df_raw, columns[n])

I’d like to save it somewhere so I can use it in future notebooks. I guess I can just write a python file, but I don’t know where to save it. Also, I’m afraid that it will be overwritten whenever I git pull. Does anyone have any advice to give me on this?

Hi,

I’d like to know how to better approach categories order after running the function train_cats(). In lesson 1, Jeremy rearranges the order of the category ‘UsageBand’.

Do we have to look at each category created and update their order if it is wrong? It seems like a slow process to do this for each category column.

Does anyone have any experience with this?

Based on a fork of this and the work @mrbruce did above, I got Lesson 1 working in a kaggle kernel.

The kernel is here: https://www.kaggle.com/beezus666/fast-ai-machine-learning-lesson-1

Just one little tweak from @mrbruce’s work was I had to change is_string_dtype to pd.api.types.is_string_dtype

Also, I might be doing something wrong, as I’m getting pretty different results in some spots from others were getting. I’m going to go through the lesson again with this working and see what’s what.

Just completed Lesson 1, under proc_df procedure why are we replacing the missing values with median of the column?

In the notebook the SalePrice is converted to log of that SalePrice. But doing that makes our predicted values also have log(SalePrice). So when we submit this output to kaggle we will be getting very bad result. So actually it is better to compute the log in the definition of RMSE.

def rmse(x,y): return math.sqrt(((np.log(x) - np.log(y)) ** 2).mean())

Correct me if I’m wrong

1 Like

HI,
I have not been able to get past the first lesson for over three days. I keep getting this error, and I have tried and tried my best to solve it . Please I need help so I can progress in the course. I have tried it on my local computer and also on crestle I keep getting the same errors for both of them
ModuleNotFoundError: No module named ‘fastai.structured’

1 Like

Great lecture. I have a question regarding using the proc_df on test data. Let say in the train data, for the variable var1, the categories are [NaN, ‘A’,‘B’, ‘C’]. We do proc_df on train data and we have {NaN:0, ‘A’:1, ‘B’:2, ‘C’:3}. But let say in the test data, the categories for var1 are [NaN,‘A’,‘C’] with ‘B’ non-existent. If we apply proc_df on test, will we get {NaN:0, ‘A’:1, ‘C’:2} which is a mismatch with the train data? Would that be an issue?

Thanks!

Hi,

Could you please try running the following command in your notebook
!pip install fastai=0.7.0

Note: Remove ! from the beginning of the command if you are running pip install in your terminal.

The reason for this error is you might have fastai v1 installed on your system, and the course notebooks were written using fastai 0.7.0. Lot of APIs were updated in v1 and it is not backward compatible.

Also, you can use virtualenv for each project with different versions of fastai if you are using fastai v1 for some other projects as well.

Reference: https://docs.python-guide.org/dev/virtualenvs/

Hope this solves your issue :slight_smile:

2 Likes

What I have understood from the explanation was -

For categorical values, random forest fills in -1 in place of missing values, but for continuous values Random Forest won’t work correctly. Hence, we replace the missing values with median of that particular column.

There are multiple ways to fill in missing values, here we have used median value of the column. One can use mean or any other aggregation as well.

Hi,

I can see many various errors people getting with setting up ML course environment. I successfully reproduced about dozen/half of them:-) (e.g. ModuleNotFoundError: No module named ‘bcolz’)
Comparing to DL courses which you just start and it works, this one I followed instructions and various tricks and it never fully fix it, just get yet another error after error. I did start from new DL Ubuntu image on AWS (used p2large), got new instance for several possible solutions.

I was wondering if there is somewhere a clean copy (container?) of the course, even if it has older dependencies (fastai 0.7 whatever…) until we have a working new version that would be great…

Thanks very much,
Alex

Question about proc_df

Hi everyone,
I’ve got a question about how you would handle it if in your train and test data set different columns have NaN in them or not. I used the Kaggle data from the housing price competition. In the training data column A has no NaN inside and column B also not but in the test data column B has a few NaN inside therefore proc_df creates the column B_na. Now the test data set has one column more and can’t be used.

To make it work I just dropped all the feature_na columns proc_df created in the test and training data set. What better way would there be? Create a _na column for every column with only false inside if no value is NaN?

Thanks for your help,
Jonas

1 Like