Problems with tabular learner - accuracy 59% always

Hello,

my name is Sascha and I am a newbie.
I first want to say thanks for this great software and the great course videos.

Im trying to use the tabular learner for the first time, without success.
My problem is that the accuracy will not improve to more than 59 %

Data: I use predictions from 5 different models.
I have 5 columns with values between 0 and 1
And a 6th column containing the target (Y).
220k rows are in the dataset.

Used code:

from fastai.tabular import *

dep_var = 'label'
cont_names = ['label_1', 'label_2', 'label_3', 'label_4', 'label_5']

data = (TabularList.from_df(ptrain, cont_names=cont_names)
            .split_by_rand_pct(seed=78)
            .label_from_df(cols=dep_var)
            .add_test(TabularList.from_df(ptest, cont_names=cont_names))
            .databunch())

learntab = tabular_learner(data,layers=[100,200,300],emb_drop=0.,metrics=accuracy)

from fastai.callbacks import *

learntab.fit_one_cycle(3, 1e-2,callbacks=[SaveModelCallback(learntab,monitor='accuracy',mode='max'),CSVLogger(learntab,filename='ensemble')])

|epoch|train_loss|valid_loss|accuracy|time|
|0|0.679971|0.681155|0.593958|00:39|
|1|0.678074|116.269554|0.595208|00:46|
|2|0.672019|1.463650|0.596004|00:46|

data.show_batch()

|label_1|label_2|label_3|label_4|label_5|target|

|0.0000|0.1037|0.7720|0.1834|0.0008|0|
|0.9990|0.4025|0.4366|0.0093|0.0012|1|
|0.4017|0.0004|0.1756|0.2037|0.0066|0|
|0.9954|0.0009|0.2378|0.0168|0.0049|0|
|0.0068|0.0108|0.1024|0.1048|0.0053|0|

(cat_x,cont_x),y = next(iter(data.train_dl))
for o in (cat_x, cont_x, y): print(to_np(o[:5]))

[0 0 0 0 0]
[[2.129060e-02 2.522550e-06 4.058339e-02 1.559591e-01 1.870525e-03]
[9.996858e-01 2.867426e-03 8.365842e-01 3.037727e-02 6.861029e-03]
[9.870045e-01 9.996849e-01 3.015203e-02 3.265378e-02 4.049588e-03]
[2.546746e-02 9.881952e-01 3.170503e-02 9.409481e-02 5.366698e-03]
[9.984887e-01 1.000000e+00 5.436885e-01 1.209527e-02 6.735846e-03]]
[0 0 0 1 0]

learntab.data.valid_ds.items
array([0, 1, 2, 3, ..., 43989, 43990, 43991, 43992], dtype=object)

learntab.data.train_ds.items
array([0, 1, 2, 3, ..., 175971, 175972, 175973, 175974], dtype=object)

I use kaggel kernels
fastai version 1.0.50.post1

Tried different layer structures from 10,20 to 100,200,300
differents dropouts and with/without batch normalisation
and i got always the same accuracy (59%) :wink:

Thanks for you help,
Sascha

Try these two things and report back ā€¦

  1. Remove all the callbacks from your fit calls (this is just to simplify things and ensure nothing there is causing the issue).
  2. Run learn.lr_find() before fitting your model. Make the lr you set in fit_one_cycle = the ā€œbestā€ LR reported from LR finder (e.g., where the slope is steepest before going up).

If you look at your validation loss it skyrockets in epoch 2 and still worse in worse in epoch 3 than it starts off at. Classic signs that LR is too high.

Also, what are you trying to predict? A Categorical or a Continuous dep. variable? You may need to update your DataBunch API code depending on the answer to that.

Hi wgpubs,

thanks for your help.
Already tried without the callbacks and different learning rates by running lr finder.
But im always on 59% accuracy.
I try to predict a categorical (binary 0 or 1) value.

Now I tried to use XGBoost and guess what - the accuracy is 59,6 % :slight_smile:

Maybe I messed up the pandas dataframes?
I have several csv files which are predicted training sets from several models.
Each file contains ID(filename) ,prediction (float between 0 and 1) and the label.

I loaded each file into a dataframe.
Set ID (filename of image) as index.
set_index(['id'])

Joined all frames together:

ptrain = pd.concat([df1, df2,df3,df4,df5], axis=1, join='inner')
ptrain.reset_index(inplace=True)

For the test data (predictions of testdata) i did the same.
I just added the missing column ā€œlabelā€ with value 0 because they have no label.

After this i resetted the index ā€œidā€ using
reset_index(inplace=True)
and removed the id column which contained only a unique filename for each row.

ptrain.head()

|label_1|label_2|label_3|label_4|label_5|label|

|0|0.064980|0.996665|0.119776|0.018277|0.006172|0|
|1|0.000265|0.000263|0.668452|0.010450|0.003120|1|
|2|0.012991|0.009983|0.000413|0.022464|0.006293|0|
|3|0.000280|0.999998|0.218882|0.019264|0.004348|0|
|4|0.011094|0.999940|0.663192|0.102229|0.012752|0|

Maybe i misunderstood the concept.
My plan was:
Creating a merged table containing all ā€œtrain setā€ predictions of all models (each model one column)
to train a tabular model.
Of course some data has to be preserved as validation set.

After this, predict using test data (merged in same format) using the trained tabular model.

Best regards,
Sascha

Iā€™d take a look at the rossman notebook where the tabular bits are explored to get a better sense if you are perhaps right. Iā€™d especially look at the rossman DataFrame and yourā€™s to see if anything particular looks off ā€¦ cuz youā€™re right, if the data isnā€™t structured right youā€™re not going to get good results regardless of what model you use.

You are splitting randomly which I found doesnā€™t work to well with tabular data particularly if it is time series. You need something like this:
def train_validate_test_split(df, train_percent=.6, validate_percent=.2, seed=None):
np.random.seed(seed)
perm = np.random.permutation(df.index)
m = len(df.index)
train_end = int(train_percent * m)
validate_end = int(validate_percent * m) + train_end
train = df.loc[perm[:train_end]]
validate = df.loc[perm[train_end:validate_end]]
test = df.loc[perm[validate_end:]]
return train, validate, test

train, validate, test = train_validate_test_split(working,train_percent=.8, validate_percent=0, seed=42)
train = train.sort_values([ā€œdateā€,ā€œField2ā€,ā€œField3ā€])
test = test.sort_values([ā€œdateā€,ā€œField2ā€,ā€œField3ā€])
train = train.reset_index(drop=True)
test = test.reset_index(drop=True)
Then just follow Rossmann in the next notebook
cut = train_df[ā€˜dateā€™][(train_df[ā€˜dateā€™] == train_df[ā€˜dateā€™][len(test_df)])].index.max()
cut

date can be another sequentially increasing field.

Hi,

I have encountered the same issue, as my Accuracy does not change at all

I am trying to train a model where all of the tabular data is continuous including my target.
I am using a tabular learner configured the same way as in lesson 4

I have tried a lot more staff but nothing works

Please help

Can we have a bit more information? If you are doing regression you should be following Rossmann (lesson 7), which is regression, not classification.

What i have is a large set of tabular data, where every column is a continuous variable(0 to 2000+), with a target being a value between 0 and 450.

One of the things to note is that a significant amount of data is zeroes

Never the less i was able to train a model in MS Azure Machine Learning Studio to a Mean Absolute Error of 1.16

Hey Keithhays,

Could you please let me know how I can implement what you have said up above in the split function? I have the same problem but the accuracy is stuck at 0.803382.

dls = TabularDataLoaders.from_csv(
"data.csv",
y_name = data['y'],
cont_names = cont_names_data,
procs = [Normalize, FillMissing]
)
splits = RandomSplitter(valid_pct=0.2)(range_of(data))
to = TabularPandas(
    data,
    cont_names = cont_names_data,
    procs = [Normalize, FillMissing],
    splits = splits,
    y_names= 'y'
)

dls = to.dataloaders(bs=64)
dls.show_batch()

I have this right now. But how do I use your solution here? I have given it an attempt but donā€™t know where to put the train and test. If you could help me that would be great!

def train_validate_test_split(df, train_percent=.6, validate_percent=.2, seed=None):
    np.random.seed(seed)
    perm = np.random.permutation(df.index)
    m = len(df.index)
    train_end = int(train_percent * m)
    validate_end = int(validate_percent * m) + train_end
    train = df.loc[perm[:train_end]]
    validate = df.loc[perm[train_end:validate_end]]
    test = df.loc[perm[validate_end:]]
    return train, validate, test

train, validate, test = train_validate_test_split(data,train_percent=.8, validate_percent=0, seed=42)
train = train.reset_index(drop=True)
test = test.reset_index(drop=True)

Please let me know how do I pass in the new train and test.

Thank you,
Abhisar Anand

Why kind of variable is your y? @Abhisar
Remember to pass y_block = CategoryBlock to TabularPandas if you have a classificaiton problem.

Hey @marcossantana,
I am trying to train a time-series classification model, so I have 178 pieces of data which would be what I will be feeding into the model, and the output should be either 1 or 0. I have modified the code so here is the latest version:

dls = TabularDataLoaders.from_csv(
    "data.csv",
    y_names = "y",
    cont_names = cont_names_data,
    procs = [Normalize, FillMissing, Categorify]
)
splits = RandomSplitter(valid_pct=0.2, seed=None)(range_of(data))

to = TabularPandas(
    data,
    cont_names = cont_names_data,
    procs = [Normalize, FillMissing, Categorify],
    splits = splits,
    y_block = CategoryBlock,
    y_names= "y"
)

dls = to.dataloaders(bs=256)
dls.show_batch()

Here I have tried to change y_names but it keeps failing and giving me several errors. Let me know if I am doing anything wrong.

Thank you,
Abhisar Anand

What kind of errors? I solved a similar problem a few minutes ago and the problem was with the CategoryBlock.

1 Like