Problems with tabular learner - accuracy 59% always

sascha · April 1, 2019, 6:20pm

Hello,

my name is Sascha and I am a newbie.
I first want to say thanks for this great software and the great course videos.

Im trying to use the tabular learner for the first time, without success.
My problem is that the accuracy will not improve to more than 59 %

Data: I use predictions from 5 different models.
I have 5 columns with values between 0 and 1
And a 6th column containing the target (Y).
220k rows are in the dataset.

Used code:

from fastai.tabular import *

dep_var = 'label'
cont_names = ['label_1', 'label_2', 'label_3', 'label_4', 'label_5']

data = (TabularList.from_df(ptrain, cont_names=cont_names)
            .split_by_rand_pct(seed=78)
            .label_from_df(cols=dep_var)
            .add_test(TabularList.from_df(ptest, cont_names=cont_names))
            .databunch())

learntab = tabular_learner(data,layers=[100,200,300],emb_drop=0.,metrics=accuracy)

from fastai.callbacks import *

learntab.fit_one_cycle(3, 1e-2,callbacks=[SaveModelCallback(learntab,monitor='accuracy',mode='max'),CSVLogger(learntab,filename='ensemble')])

|epoch|train_loss|valid_loss|accuracy|time|
|0|0.679971|0.681155|0.593958|00:39|
|1|0.678074|116.269554|0.595208|00:46|
|2|0.672019|1.463650|0.596004|00:46|

data.show_batch()

|0.0000|0.1037|0.7720|0.1834|0.0008|0|
|0.9990|0.4025|0.4366|0.0093|0.0012|1|
|0.4017|0.0004|0.1756|0.2037|0.0066|0|
|0.9954|0.0009|0.2378|0.0168|0.0049|0|
|0.0068|0.0108|0.1024|0.1048|0.0053|0|

(cat_x,cont_x),y = next(iter(data.train_dl))
for o in (cat_x, cont_x, y): print(to_np(o[:5]))

[0 0 0 0 0]
[[2.129060e-02 2.522550e-06 4.058339e-02 1.559591e-01 1.870525e-03]
[9.996858e-01 2.867426e-03 8.365842e-01 3.037727e-02 6.861029e-03]
[9.870045e-01 9.996849e-01 3.015203e-02 3.265378e-02 4.049588e-03]
[2.546746e-02 9.881952e-01 3.170503e-02 9.409481e-02 5.366698e-03]
[9.984887e-01 1.000000e+00 5.436885e-01 1.209527e-02 6.735846e-03]]
[0 0 0 1 0]

learntab.data.valid_ds.items
array([0, 1, 2, 3, ..., 43989, 43990, 43991, 43992], dtype=object)

learntab.data.train_ds.items
array([0, 1, 2, 3, ..., 175971, 175972, 175973, 175974], dtype=object)

I use kaggel kernels
fastai version 1.0.50.post1

Tried different layer structures from 10,20 to 100,200,300
differents dropouts and with/without batch normalisation
and i got always the same accuracy (59%)

Thanks for you help,
Sascha

wgpubs · April 1, 2019, 6:59pm

Try these two things and report back …

Remove all the callbacks from your fit calls (this is just to simplify things and ensure nothing there is causing the issue).
Run learn.lr_find() before fitting your model. Make the lr you set in fit_one_cycle = the “best” LR reported from LR finder (e.g., where the slope is steepest before going up).

If you look at your validation loss it skyrockets in epoch 2 and still worse in worse in epoch 3 than it starts off at. Classic signs that LR is too high.

Also, what are you trying to predict? A Categorical or a Continuous dep. variable? You may need to update your DataBunch API code depending on the answer to that.

sascha · April 1, 2019, 7:55pm

Hi wgpubs,

thanks for your help.
Already tried without the callbacks and different learning rates by running lr finder.
But im always on 59% accuracy.
I try to predict a categorical (binary 0 or 1) value.

Now I tried to use XGBoost and guess what - the accuracy is 59,6 %

Maybe I messed up the pandas dataframes?
I have several csv files which are predicted training sets from several models.
Each file contains ID(filename) ,prediction (float between 0 and 1) and the label.

I loaded each file into a dataframe.
Set ID (filename of image) as index.
set_index(['id'])

Joined all frames together:

ptrain = pd.concat([df1, df2,df3,df4,df5], axis=1, join='inner')
ptrain.reset_index(inplace=True)

For the test data (predictions of testdata) i did the same.
I just added the missing column “label” with value 0 because they have no label.

After this i resetted the index “id” using
reset_index(inplace=True)
and removed the id column which contained only a unique filename for each row.

ptrain.head()

|label_1|label_2|label_3|label_4|label_5|label|

|0|0.064980|0.996665|0.119776|0.018277|0.006172|0|
|1|0.000265|0.000263|0.668452|0.010450|0.003120|1|
|2|0.012991|0.009983|0.000413|0.022464|0.006293|0|
|3|0.000280|0.999998|0.218882|0.019264|0.004348|0|
|4|0.011094|0.999940|0.663192|0.102229|0.012752|0|

Maybe i misunderstood the concept.
My plan was:
Creating a merged table containing all “train set” predictions of all models (each model one column)
to train a tabular model.
Of course some data has to be preserved as validation set.

After this, predict using test data (merged in same format) using the trained tabular model.

Best regards,
Sascha

wgpubs · April 1, 2019, 8:01pm

I’d take a look at the rossman notebook where the tabular bits are explored to get a better sense if you are perhaps right. I’d especially look at the rossman DataFrame and your’s to see if anything particular looks off … cuz you’re right, if the data isn’t structured right you’re not going to get good results regardless of what model you use.

keithhays · April 10, 2019, 1:31am

You are splitting randomly which I found doesn’t work to well with tabular data particularly if it is time series. You need something like this:
def train_validate_test_split(df, train_percent=.6, validate_percent=.2, seed=None):
np.random.seed(seed)
perm = np.random.permutation(df.index)
m = len(df.index)
train_end = int(train_percent * m)
validate_end = int(validate_percent * m) + train_end
train = df.loc[perm[:train_end]]
validate = df.loc[perm[train_end:validate_end]]
test = df.loc[perm[validate_end:]]
return train, validate, test

train, validate, test = train_validate_test_split(working,train_percent=.8, validate_percent=0, seed=42)
train = train.sort_values([“date”,“Field2”,“Field3”])
test = test.sort_values([“date”,“Field2”,“Field3”])
train = train.reset_index(drop=True)
test = test.reset_index(drop=True)
Then just follow Rossmann in the next notebook
cut = train_df[‘date’][(train_df[‘date’] == train_df[‘date’][len(test_df)])].index.max()
cut

date can be another sequentially increasing field.

Kirillino · February 13, 2020, 9:00pm

Hi,

I have encountered the same issue, as my Accuracy does not change at all

I am trying to train a model where all of the tabular data is continuous including my target.
I am using a tabular learner configured the same way as in lesson 4

I have tried a lot more staff but nothing works

Please help

muellerzr · February 13, 2020, 9:10pm

Can we have a bit more information? If you are doing regression you should be following Rossmann (lesson 7), which is regression, not classification.

Kirillino · February 14, 2020, 3:41pm

What i have is a large set of tabular data, where every column is a continuous variable(0 to 2000+), with a target being a value between 0 and 450.

One of the things to note is that a significant amount of data is zeroes

Never the less i was able to train a model in MS Azure Machine Learning Studio to a Mean Absolute Error of 1.16

Abhisar · February 9, 2021, 4:20pm

Hey Keithhays,

Could you please let me know how I can implement what you have said up above in the split function? I have the same problem but the accuracy is stuck at 0.803382.

dls = TabularDataLoaders.from_csv(
"data.csv",
y_name = data['y'],
cont_names = cont_names_data,
procs = [Normalize, FillMissing]
)
splits = RandomSplitter(valid_pct=0.2)(range_of(data))
to = TabularPandas(
    data,
    cont_names = cont_names_data,
    procs = [Normalize, FillMissing],
    splits = splits,
    y_names= 'y'
)

dls = to.dataloaders(bs=64)
dls.show_batch()

I have this right now. But how do I use your solution here? I have given it an attempt but don’t know where to put the train and test. If you could help me that would be great!

def train_validate_test_split(df, train_percent=.6, validate_percent=.2, seed=None):
    np.random.seed(seed)
    perm = np.random.permutation(df.index)
    m = len(df.index)
    train_end = int(train_percent * m)
    validate_end = int(validate_percent * m) + train_end
    train = df.loc[perm[:train_end]]
    validate = df.loc[perm[train_end:validate_end]]
    test = df.loc[perm[validate_end:]]
    return train, validate, test

train, validate, test = train_validate_test_split(data,train_percent=.8, validate_percent=0, seed=42)
train = train.reset_index(drop=True)
test = test.reset_index(drop=True)

Please let me know how do I pass in the new train and test.

Thank you,
Abhisar Anand

marcossantana · February 11, 2021, 5:55pm

Why kind of variable is your y? @Abhisar
Remember to pass y_block = CategoryBlock to TabularPandas if you have a classificaiton problem.

Abhisar · February 11, 2021, 6:12pm

Hey @marcossantana,
I am trying to train a time-series classification model, so I have 178 pieces of data which would be what I will be feeding into the model, and the output should be either 1 or 0. I have modified the code so here is the latest version:

dls = TabularDataLoaders.from_csv(
    "data.csv",
    y_names = "y",
    cont_names = cont_names_data,
    procs = [Normalize, FillMissing, Categorify]
)
splits = RandomSplitter(valid_pct=0.2, seed=None)(range_of(data))

to = TabularPandas(
    data,
    cont_names = cont_names_data,
    procs = [Normalize, FillMissing, Categorify],
    splits = splits,
    y_block = CategoryBlock,
    y_names= "y"
)

dls = to.dataloaders(bs=256)
dls.show_batch()

Here I have tried to change y_names but it keeps failing and giving me several errors. Let me know if I am doing anything wrong.

Thank you,
Abhisar Anand

marcossantana · February 11, 2021, 6:38pm

What kind of errors? I solved a similar problem a few minutes ago and the problem was with the CategoryBlock.