Tabular data with multiple output classes

pl3 · March 16, 2019, 8:27pm

I have a dataframe with multiple target columns as categories.

Dtypes:

tar_0    category
tar_1    category
tar_2    category
dtype: object

Example .head(3):

	tar_0	tar_1	tar_2
0	2	4	7
1	3	4	6
2	1	4	8

I can’t seem to get the datablock api to work with this setup using .label_from_df(cols=tar_cols). Is it currently possible or do I have to use another approach?

Thanks

sgugger · March 17, 2019, 2:10am

Since this is multilabel (with always three labels from what you show), you should just your data frame to have all your targets in one column, with the different labels separated by a space (or any delimiter you like). This way, you can use the data block API with label_delim='...' in your call.

wgpubs · March 17, 2019, 2:48am

But I don’t think this is really a multi-label problem since he isn’t trying to predict just 1 or 0 for each column … it looks more like he’s trying to solve multiple multi-classification problems at the same time. I’m not sure if fastai handles that scenario out-of-the-box.

One idea, is to turn this into a regression problem which I think would work.

pl3 · March 17, 2019, 3:37am

I’m trying to model the top 3 positions of a jai alai match. The classes are for the player/team that gets in 1st/2nd/3rd position. In jai alai, the positions that place (top 3) have some interdependencies based on how the games play out since they play in an order.

I’m testing a few different approaches with the targets:

the combination of top 3 positions as a single class (order doesn’t matter), output of bs*56 (# of class combinations)
binary indicators for each position placing in top 3, output of bs*8 (# of teams) with BCEWithLogitsLoss
class foreach of the top 3 in each position, output of bs38 (# of top spots * # of positions)

I got the first two so far, but haven’t been able to do the 3rd with the data block yet. I’m not familiar enough with the api to figure it out at the moment, but can do with a can do manual data class/bunch.

wgpubs · March 17, 2019, 3:50am

Can you describe how many columns you are trying to predict and the range for each column?

You have three listed above but I’m not sure if there are only three and what the range is for each.

pl3 · March 17, 2019, 4:37am

Just 3 columns (1st, 2nd, and 3rd place), with each having 8 classes (# of teams)

wgpubs · March 17, 2019, 4:43am

Yah, I’m not sure if the framework supports this right out of the box (we’ll see what Sylvain says).

I have seen folks doing something similar by defining it as a regression problem vs. a classification problem. Something to consider.

pl3 · March 17, 2019, 1:11pm

I don’t think doing a regression problem makes sense since its the team numbers 1-8. I’ll try it though since the network might just be able to figure it out if I give it enough layers.

sgugger · March 17, 2019, 2:43pm

There is no ItemList type in fastai that matches your need so you will need to write your own. Note that you’ll also need to adapt your model to return 3 by 8 probs and write a custom loss function.

pl3 · March 17, 2019, 2:57pm

That what I figured when I needed the extra dimension. Thanks for your help, I’ll post what I come up with when I write it.

ecatkins · August 13, 2019, 3:11pm

@pl3 Did you ever make progress on this?

pl3 · August 19, 2019, 3:22am

It’s been awhile since I looked at this code, but I tried a few ways to model the target variable, but never like the 3rd bullet I had commented on above. Here’s what I had:

# individual tri box cat
# tar_cols = ['ID_exa_box']
tar_cols = ['ID_tri_box']
out_sz = len(win_cats['tri'])

# each position in box
# tar_cols = ['tri_box']
# tar_cols = ['ID_exa_box_0','ID_exa_box_1']
# tar_cols = ['ID_tri_box_0','ID_tri_box_1','ID_tri_box_2']
# out_sz = 3 * len(win_cats['pos'])

# each position indicator
# tar_cols = [f'exa_box_{i}_ind' for i in range(1,9)]
# tar_cols = [f'tri_box_{i}_ind' for i in range(1,9)]
# out_sz = len(win_cats['pos'])

I was testing 3 target variable approaches, on exacta box (top 2 spots any order) and trifecta box (top 3 spots any order). I created the columns for all approaches in the dataframe, then was running the code, testing each block of code.

The first block of code was an ID variable for the combination (e.g. 123=0, 124=1, etc.).

The second block was setting the the ID of the lowest numbered team in the box for each position (e.g. 124=0 1 3.)

The last chunk, I set a binary flag for each team that is 1 if they placed in the box.

With this approach I was able to use the datablock api:

data = (TabularList
        .from_df(df_trn, cat_names=inp_cols)
        .split_by_idxs(idx_trn, idx_val)
        .label_from_df(cols=tar_cols)
        .add_test(test)
        .databunch(bs=32)
       )

This seemed to worked, but never got great results. I never spent much more time on it though.

bikashg · December 20, 2019, 10:26pm

Can you please point to an example like that? I am trying to model the problem as a regression but can’t build the proper databunch so far.

wgpubs · December 20, 2019, 10:30pm

" If your task is regression, be sure the pass label_cls = FloatList so that learners created from your databunch initialize correctly."

Does that help?

mmr · January 11, 2020, 9:46am

Can you give an example how to use it.

tb450 · September 11, 2020, 4:39pm

I have a tabular dataset. Training inputs are mostly continuous, but I have some categorical and integer inputs too. Outputs are multiple classes, so each row could have zero, one or many classes.

I’ve managed to convert the outputs to a single delimited column, but I can’t figure out how to use label_delim for tabular data.

How can I use the data block API for a tabular dataset with multiclass outputs? Also, is there a way to make sure that a test dataset is normalized and categorified consistently with the training set?

Thanks

Edit: Link to data