Tabular data with multiple output classes

(Phil Lynch) #1

I have a dataframe with multiple target columns as categories.

Dtypes:

tar_0    category
tar_1    category
tar_2    category
dtype: object

Example .head(3):

tar_0 tar_1 tar_2
0 2 4 7
1 3 4 6
2 1 4 8

I can’t seem to get the datablock api to work with this setup using .label_from_df(cols=tar_cols). Is it currently possible or do I have to use another approach?

Thanks

2 Likes

#2

Since this is multilabel (with always three labels from what you show), you should just your data frame to have all your targets in one column, with the different labels separated by a space (or any delimiter you like). This way, you can use the data block API with label_delim='...' in your call.

2 Likes

(WG) #3

But I don’t think this is really a multi-label problem since he isn’t trying to predict just 1 or 0 for each column … it looks more like he’s trying to solve multiple multi-classification problems at the same time. I’m not sure if fastai handles that scenario out-of-the-box.

One idea, is to turn this into a regression problem which I think would work.

0 Likes

(Phil Lynch) #4

I’m trying to model the top 3 positions of a jai alai match. The classes are for the player/team that gets in 1st/2nd/3rd position. In jai alai, the positions that place (top 3) have some interdependencies based on how the games play out since they play in an order.

I’m testing a few different approaches with the targets:

  • the combination of top 3 positions as a single class (order doesn’t matter), output of bs*56 (# of class combinations)
  • binary indicators for each position placing in top 3, output of bs*8 (# of teams) with BCEWithLogitsLoss
  • class foreach of the top 3 in each position, output of bs38 (# of top spots * # of positions)

I got the first two so far, but haven’t been able to do the 3rd with the data block yet. I’m not familiar enough with the api to figure it out at the moment, but can do with a can do manual data class/bunch.

0 Likes

(WG) #5

Can you describe how many columns you are trying to predict and the range for each column?

You have three listed above but I’m not sure if there are only three and what the range is for each.

0 Likes

(Phil Lynch) #6

Just 3 columns (1st, 2nd, and 3rd place), with each having 8 classes (# of teams)

0 Likes

(WG) #7

Yah, I’m not sure if the framework supports this right out of the box (we’ll see what Sylvain says).

I have seen folks doing something similar by defining it as a regression problem vs. a classification problem. Something to consider.

0 Likes

(Phil Lynch) #8

I don’t think doing a regression problem makes sense since its the team numbers 1-8. I’ll try it though since the network might just be able to figure it out if I give it enough layers.

0 Likes

#9

There is no ItemList type in fastai that matches your need so you will need to write your own. Note that you’ll also need to adapt your model to return 3 by 8 probs and write a custom loss function.

0 Likes

(Phil Lynch) #10

That what I figured when I needed the extra dimension. Thanks for your help, I’ll post what I come up with when I write it.

0 Likes

(Edward Atkins) #11

@pl3 Did you ever make progress on this?

0 Likes

(Phil Lynch) #12

It’s been awhile since I looked at this code, but I tried a few ways to model the target variable, but never like the 3rd bullet I had commented on above. Here’s what I had:

# individual tri box cat
# tar_cols = ['ID_exa_box']
tar_cols = ['ID_tri_box']
out_sz = len(win_cats['tri'])

# each position in box
# tar_cols = ['tri_box']
# tar_cols = ['ID_exa_box_0','ID_exa_box_1']
# tar_cols = ['ID_tri_box_0','ID_tri_box_1','ID_tri_box_2']
# out_sz = 3 * len(win_cats['pos'])

# each position indicator
# tar_cols = [f'exa_box_{i}_ind' for i in range(1,9)]
# tar_cols = [f'tri_box_{i}_ind' for i in range(1,9)]
# out_sz = len(win_cats['pos'])

I was testing 3 target variable approaches, on exacta box (top 2 spots any order) and trifecta box (top 3 spots any order). I created the columns for all approaches in the dataframe, then was running the code, testing each block of code.

The first block of code was an ID variable for the combination (e.g. 123=0, 124=1, etc.).

The second block was setting the the ID of the lowest numbered team in the box for each position (e.g. 124=0 1 3.)

The last chunk, I set a binary flag for each team that is 1 if they placed in the box.

With this approach I was able to use the datablock api:

data = (TabularList
        .from_df(df_trn, cat_names=inp_cols)
        .split_by_idxs(idx_trn, idx_val)
        .label_from_df(cols=tar_cols)
        .add_test(test)
        .databunch(bs=32)
       )

This seemed to worked, but never got great results. I never spent much more time on it though.

0 Likes