Tabular: set up new train and valid `dls` with initially prepared `to` object

DmitryG · September 28, 2020, 12:59pm

HI,

I try to find the way to set up new train and valid dls with initially prepared to object.

Below are the two versions of the code, first is a simple example from tutorial and it works well,
second is an attempt to get dataloaders with initially prepared to.

I expected both snippets to work exactly the same, but there is zero learning improvement learning with the second snippet.

This basic example works as expected:

df is a pd.DataFrame

splits = RandomSplitter(valid_pct=0.2)(range_of(df))
cont_names = df.select_dtypes(['float64','int64']).columns.tolist()
cat_names =  df.select_dtypes(['bool','category']).columns.tolist()
cat_names = [c for c in cat_names if c not in ['sold']]
splits = RandomSplitter(valid_pct=0.2)(range_of(df))
to = TabularPandas(df, procs=[Categorify, FillMissing,Normalize],
                   cat_names = cat_names,
                   cont_names = cont_names,
                   y_names='sold',
                   splits=splits,
                   inplace=True,
                   reduce_memory=False)

dls = to.dataloaders(bs=64*50).cuda()
learn = tabular_learner(dls, metrics=accuracy)
learn.fit_one_cycle(40, lr_max=1e-04)

This code causing causing accuracy metric to bounce around 0.5 all the time :

# create `to` object with setup data

cont_names = df.select_dtypes(['float64','int64']).columns.tolist()
cat_names =  df.select_dtypes(['bool','category']).columns.tolist()
cat_names = [c for c in cat_names if c not in ['sold']]

 splits = RandomSplitter(valid_pct=0.2)(range_of(df))
 to = TabularPandas(df, procs=[Categorify, FillMissing,Normalize],
               cat_names = cat_names,
               cont_names = cont_names,
               y_names='sold',
               splits=splits,
               inplace=True,
               reduce_memory=False)



splits = RandomSplitter(valid_pct=0.2)(range_of(df))

train_ds = df[df.index.isin(splits[0])]
valid_ds = df[dff.index.isin(splits[1])]

new_train_to = to.new(train_ds)
new_valid_to = to.new(valid_ds)
new_train_to.process()
new_valid_to.process()

trn_dl = TabDataLoader(new_train_to,bs=64*50)
val_dl = TabDataLoader(new_valid_to,bs=64*50)

dls =  DataLoaders(trn_dl, val_dl).cuda()
learn = tabular_learner(dls, metrics=accuracy)
learn.fit_one_cycle(40, lr_max=1e-04)

Thank you!

krasin · December 21, 2020, 7:07pm

Hi Dmitry,

Have you managed to solve the problem? I have similar.

Thanks

DmitryG · December 30, 2020, 6:03am

Hi krasin, the reason was that you need to shuffle train dataloader. Dataloaders constructer does that for you automatically, but ones you do is yourself with TabDataLoader, you have to add shuffle = True

new_train_to = to.new(train_ds)
new_valid_to = to.new(valid_ds)
new_train_to.process()
new_valid_to.process()

trn_dl  =TabDataLoader(new_train_to,bs=64*50, shuffle=True)
val_dl = TabDataLoader(new_valid_to,bs=64*50)

dls =  DataLoaders(trn_dl, val_dl).cuda()