Multilabel classification with tabular data

randy912 · September 21, 2020, 9:35pm

I’ve seen various blog posts and a few posts on this forum about this topic but none have answered my question. I am doing multilabel classification on tabular data.

Here is what I have. train is the training data (800 columns) and train_targets are the labels (206 columns, all values are either 0 or 1):

cat_names = ['cat1', 'cat2', 'cat3']
cont_names = [x for x in train.columns if x not in cat_names]

train_label_col = []
for i, row in enumerate(train_labels.itertuples()):
  vals = [','.join(str(ele).split()) for ele in row[1:]]
  train_label_col.append(' '.join(vals))

train['label'] = train_label_col

procs = [Categorify, FillMissing, Normalize]
splits = RandomSplitter()(range_of(train))

to = TabularPandas(train, procs, cat_names, cont_names, y_names="label", y_block=MultiCategoryBlock(), splits=splits)

All of the above works fine, but when I run
dls = to.dataloaders(bs=1024)
I get the “Could not do one pass in your dataloader, there is something wrong in it” warning, and when I run dls.show_batch(3) it throws TypeError: can't convert np.ndarray of type numpy.object_. The only supported types are: float64, float32, float16, complex64, complex128, int64, int32, int16, int8, uint8, and bool.

When I run learn = tabular_learner(dls, y_range=(0,1), layers=[500, 250], n_out=1, loss_func=F.binary_cross_entropy) it works, but learn.fit_one_cycle(5, 1e-2) throws the same error as above.

Any help is greatly appreciated

randy912 · September 22, 2020, 1:25am

Fixed it. The block of code where train_label_col is made is completely unnecessary. All I had to do was concatenate train and train_labels and then supply all of the train_labels column names to the y_names parameter in TabularPandas.

cat_names = ['cp_type', 'cp_time', 'cp_dose']
cont_names = [x for x in train.columns if x not in cat_names]

procs = [Categorify, FillMissing, Normalize]
splits = RandomSplitter(valid_pct=0.2)(range_of(train))

data = pd.concat([train, train_targets], axis=1)

to = TabularPandas(data, procs=[Categorify, FillMissing,Normalize],
               cat_names = cat_names,
               cont_names = cont_names,
               y_names=[x for x in train_targets.columns],
               splits=splits)

gautam_e · August 18, 2021, 7:54am

Thanks for this post and the solution. If I’m not mistaken, this works only when the labels are one-hot-encoded (?). Did you try without one-hot-encoding the labels? I could not get that to work, unfortunately.

Adhe11 · June 30, 2023, 2:03pm

try using to.train.ys instead of to.train.y.