Wiki/lesson thread: Lesson 2

Moustafa · October 2, 2019, 5:00pm

worked on ubuntu as well. Thank you

sumit0601 · March 13, 2020, 9:46am

Hello, as Jeremy mentioned to practice alot on other datasets. I tried to do it with https://www.kaggle.com/camnugent/california-housing-prices data.
However the results are quite poor. These are the steps:
df_raw.median_house_value = np.log(df_raw.median_house_value)

train_cats(df_raw)

df, y, _ = proc_df(df_raw, 'median_house_value')

n_valid = 6000

m = RandomForestRegressor(n_estimators=20, max_depth=3) %time m.fit(X_train, y_train) print_score(m)

The result is like:
[0.3588849961920751, 0.40417080529648597, 0.5900486565650618, 0.5157139520074927]

which is pretty bad. Though this is a linear regression problem but Jeremy said that random forests comes quite close to the general purpose machine learning algorithm, then why is it so bad? Did I missed something?

SuryaViswanath · April 7, 2020, 12:02pm

@jeremy Could you please tell us if the split_vals is required? Also where can we get the complete documentation of the functions that are removed so that we can have an understanding as to the available packages in the fast.ai library. Thanks

mmohamed93 · April 18, 2020, 9:29pm

split_vals is required to have a validation set, in this case Jeremy chose his validation set similar to that of Kaggle, which are the recent 12,000 enteries. you can choose a different validaiton set of you want, but always make sure that the validation set is fit for purpose, for example in a time series problem like the blue book for bulldozers, you always have to keep the recent entries as your validation because this is what will going to happen in the real world, for classification problems for examole make sure your validation set is balanced and truely represnts the test set and the real world at all cases.

spiralmoon · April 24, 2020, 7:19pm

Maybe Docker helps.

rgiglesias · May 2, 2020, 1:41pm

To make it work with virtualEnv (windows for me):

import os
os.environ["PATH"] += os.pathsep + 'C:\\Users\\user\\.conda\\envs\\fastai\\Library\\bin\\graphviz\\'

just only with uninstall and conda install didn’t work for me

vizzard110 · May 2, 2020, 6:03pm

I am getting this error after starting jupyter notebook

krithi07 · May 8, 2020, 11:45am

@peterkoman
I had the same question but Jeremy’s comment here clarified it

rajat008 · June 20, 2020, 6:26pm

In lesson 2 it was said that the next split is determined by finding the feature and the split point which minimises the weighted average of the mse of the splits. Is it possible that despite a split not resulting in immediate minimum mse, will return a smaller mse on the overall tree?

LeTorres · July 24, 2020, 11:10am

Thanks for your Information, I Will try with the same data an I can tell you how it works,

If you continue practiicing with another Dataset may you can share your results with us , so we can try the same and see how it works

Best regards!

LeTorres · July 24, 2020, 11:17am

@andreist I have kind of the same results … I’m sure that I’ve not meses up something with the code …
I think may we can review all the code, And see step by step the función proc_df to understand it and see if there is something different respect to the function that Jereyme used.

And I have differents values with the oob_score , the result are negative … different for Jeremy’s kermel ,

I have this oob_score value -.131 ( I used 10 estimators) with this Messages:

UserWarning: Some inputs do not have OOB scores. This probably means too few trees were used to compute any reliable oob estimates.
warn("Some inputs do not have OOB scores. "

And I used 40 Estimators , and I got 0.906 oob_score with 40 Trees…

Someone can Help us ?

gopesh97 · July 26, 2020, 1:26pm

I checked the proc_df function in the github repo at

github.com

fastai/fastai/blob/master/old/fastai/structured.py#L299


        df[name] = pd.Categorical(col).codes+1


def scale_vars(df, mapper):
    warnings.filterwarnings('ignore', category=sklearn.exceptions.DataConversionWarning)
    if mapper is None:
        map_f = [([n],StandardScaler()) for n in df.columns if is_numeric_dtype(df[n])]
        mapper = DataFrameMapper(map_f).fit(df)
    df[mapper.transformed_names_] = mapper.transform(df)
    return mapper


def proc_df(df, y_fld=None, skip_flds=None, ignore_flds=None, do_scale=False, na_dict=None,
            preproc_fn=None, max_n_cat=None, subset=None, mapper=None):
    """ proc_df takes a data frame df and splits off the response variable, and
    changes the df into an entirely numeric dataframe. For each column of df 
    which is not in skip_flds nor in ignore_flds, na values are replaced by the
    median value of the column.
    Parameters:
    -----------
    df: The data frame you wish to process.
    y_fld: The name of the response variable
    skip_flds: A list of fields that dropped from df.

It is calling get_samples to get the subset, and the get_sample function is actually shuffling the data. So i dont understand how is the issue rectified?

gopesh97 · July 26, 2020, 1:42pm

LINUX - UBUNTU
I solved this without having to install conda on my system
sudo apt-get install graphviz
this should work.

SophiavH · August 16, 2020, 3:48am

I’m confused about what to do. I had no trouble downloading/uploading my data, but when I trained the model it had an error rate of 57%! Also, when I ran the line that says “learn.lr_find()” the results showed training losses consistently above 1.5 and the column for validation losses just read “#na#”. What does that mean? I tried changing my data sets but the problem persists. Thanks for the help!

PalaashAgrawal · August 16, 2020, 9:10am

learn.lr_find() only uses the training dataset. So don’t worry about the #na# in the validation column!
Regarding your error rate! You may be missing something- hard to tell without looking at your steps and your data. Though, sometimes the problem itself is hard for the model to classify on, so its possible you may not be able to push your efficiency beyond a point, but you can certainly improve(atleast better than 57% error)! WHat kind of data are you working on?
Cheers!

SophiavH · August 16, 2020, 6:55pm

Good to know! I’m using photos of girraffes and dogs that I downloaded from google images per the lecture’s instructions. I initially thought that the problem was too hard because I was using photos of labradoodles and goldendoodles, but I had the same output when I replaced the goldendoodles with giraffes.

PalaashAgrawal · August 16, 2020, 7:09pm

Oh i see. Well that’s quite a simple problem by modern deep learning standards. You should definitely expect better outcomes. How do you build your model? Have you used a proper split? Is the image labeling correct in your data? You should check everything…you’ll probably find a mistake somewhere…and having solved that your accuracy will go well above 90%!

SophiavH · August 16, 2020, 9:01pm

I found inaccuracies in my image labelling. Now I have an error rate of only 2%! Thanks for the help!