Problems with tabular learner - accuracy 59% always

Hello,

my name is Sascha and I am a newbie.
I first want to say thanks for this great software and the great course videos.

Im trying to use the tabular learner for the first time, without success.
My problem is that the accuracy will not improve to more than 59 %

Data: I use predictions from 5 different models.
I have 5 columns with values between 0 and 1
And a 6th column containing the target (Y).
220k rows are in the dataset.

Used code:

from fastai.tabular import *

dep_var = 'label'
cont_names = ['label_1', 'label_2', 'label_3', 'label_4', 'label_5']

data = (TabularList.from_df(ptrain, cont_names=cont_names)
            .split_by_rand_pct(seed=78)
            .label_from_df(cols=dep_var)
            .add_test(TabularList.from_df(ptest, cont_names=cont_names))
            .databunch())

learntab = tabular_learner(data,layers=[100,200,300],emb_drop=0.,metrics=accuracy)

from fastai.callbacks import *

learntab.fit_one_cycle(3, 1e-2,callbacks=[SaveModelCallback(learntab,monitor='accuracy',mode='max'),CSVLogger(learntab,filename='ensemble')])

|epoch|train_loss|valid_loss|accuracy|time|
|0|0.679971|0.681155|0.593958|00:39|
|1|0.678074|116.269554|0.595208|00:46|
|2|0.672019|1.463650|0.596004|00:46|

data.show_batch()

|label_1|label_2|label_3|label_4|label_5|target|

|0.0000|0.1037|0.7720|0.1834|0.0008|0|
|0.9990|0.4025|0.4366|0.0093|0.0012|1|
|0.4017|0.0004|0.1756|0.2037|0.0066|0|
|0.9954|0.0009|0.2378|0.0168|0.0049|0|
|0.0068|0.0108|0.1024|0.1048|0.0053|0|

(cat_x,cont_x),y = next(iter(data.train_dl))
for o in (cat_x, cont_x, y): print(to_np(o[:5]))

[0 0 0 0 0]
[[2.129060e-02 2.522550e-06 4.058339e-02 1.559591e-01 1.870525e-03]
[9.996858e-01 2.867426e-03 8.365842e-01 3.037727e-02 6.861029e-03]
[9.870045e-01 9.996849e-01 3.015203e-02 3.265378e-02 4.049588e-03]
[2.546746e-02 9.881952e-01 3.170503e-02 9.409481e-02 5.366698e-03]
[9.984887e-01 1.000000e+00 5.436885e-01 1.209527e-02 6.735846e-03]]
[0 0 0 1 0]

learntab.data.valid_ds.items
array([0, 1, 2, 3, ..., 43989, 43990, 43991, 43992], dtype=object)

learntab.data.train_ds.items
array([0, 1, 2, 3, ..., 175971, 175972, 175973, 175974], dtype=object)

I use kaggel kernels
fastai version 1.0.50.post1

Tried different layer structures from 10,20 to 100,200,300
differents dropouts and with/without batch normalisation
and i got always the same accuracy (59%) :wink:

Thanks for you help,
Sascha

Try these two things and report back …

  1. Remove all the callbacks from your fit calls (this is just to simplify things and ensure nothing there is causing the issue).
  2. Run learn.lr_find() before fitting your model. Make the lr you set in fit_one_cycle = the “best” LR reported from LR finder (e.g., where the slope is steepest before going up).

If you look at your validation loss it skyrockets in epoch 2 and still worse in worse in epoch 3 than it starts off at. Classic signs that LR is too high.

Also, what are you trying to predict? A Categorical or a Continuous dep. variable? You may need to update your DataBunch API code depending on the answer to that.

Hi wgpubs,

thanks for your help.
Already tried without the callbacks and different learning rates by running lr finder.
But im always on 59% accuracy.
I try to predict a categorical (binary 0 or 1) value.

Now I tried to use XGBoost and guess what - the accuracy is 59,6 % :slight_smile:

Maybe I messed up the pandas dataframes?
I have several csv files which are predicted training sets from several models.
Each file contains ID(filename) ,prediction (float between 0 and 1) and the label.

I loaded each file into a dataframe.
Set ID (filename of image) as index.
set_index(['id'])

Joined all frames together:

ptrain = pd.concat([df1, df2,df3,df4,df5], axis=1, join='inner')
ptrain.reset_index(inplace=True)

For the test data (predictions of testdata) i did the same.
I just added the missing column “label” with value 0 because they have no label.

After this i resetted the index “id” using
reset_index(inplace=True)
and removed the id column which contained only a unique filename for each row.

ptrain.head()

|label_1|label_2|label_3|label_4|label_5|label|

|0|0.064980|0.996665|0.119776|0.018277|0.006172|0|
|1|0.000265|0.000263|0.668452|0.010450|0.003120|1|
|2|0.012991|0.009983|0.000413|0.022464|0.006293|0|
|3|0.000280|0.999998|0.218882|0.019264|0.004348|0|
|4|0.011094|0.999940|0.663192|0.102229|0.012752|0|

Maybe i misunderstood the concept.
My plan was:
Creating a merged table containing all “train set” predictions of all models (each model one column)
to train a tabular model.
Of course some data has to be preserved as validation set.

After this, predict using test data (merged in same format) using the trained tabular model.

Best regards,
Sascha

I’d take a look at the rossman notebook where the tabular bits are explored to get a better sense if you are perhaps right. I’d especially look at the rossman DataFrame and your’s to see if anything particular looks off … cuz you’re right, if the data isn’t structured right you’re not going to get good results regardless of what model you use.

You are splitting randomly which I found doesn’t work to well with tabular data particularly if it is time series. You need something like this:
def train_validate_test_split(df, train_percent=.6, validate_percent=.2, seed=None):
np.random.seed(seed)
perm = np.random.permutation(df.index)
m = len(df.index)
train_end = int(train_percent * m)
validate_end = int(validate_percent * m) + train_end
train = df.loc[perm[:train_end]]
validate = df.loc[perm[train_end:validate_end]]
test = df.loc[perm[validate_end:]]
return train, validate, test

train, validate, test = train_validate_test_split(working,train_percent=.8, validate_percent=0, seed=42)
train = train.sort_values([“date”,“Field2”,“Field3”])
test = test.sort_values([“date”,“Field2”,“Field3”])
train = train.reset_index(drop=True)
test = test.reset_index(drop=True)
Then just follow Rossmann in the next notebook
cut = train_df[‘date’][(train_df[‘date’] == train_df[‘date’][len(test_df)])].index.max()
cut

date can be another sequentially increasing field.

Hi,

I have encountered the same issue, as my Accuracy does not change at all

I am trying to train a model where all of the tabular data is continuous including my target.
I am using a tabular learner configured the same way as in lesson 4

I have tried a lot more staff but nothing works

Please help

Can we have a bit more information? If you are doing regression you should be following Rossmann (lesson 7), which is regression, not classification.

What i have is a large set of tabular data, where every column is a continuous variable(0 to 2000+), with a target being a value between 0 and 450.

One of the things to note is that a significant amount of data is zeroes

Never the less i was able to train a model in MS Azure Machine Learning Studio to a Mean Absolute Error of 1.16