I don't know how to make a multi output model for structured data using FastAI library

fakhirali · June 11, 2018, 4:31pm

I cant get a model working which outputs more than one value. I use the column_data and structured parts of the library but I get errors like : input and target shapes do not match and boolean index did not match indexed array along dimension 0.

Processing the data using “proc_df” also was difficult with multipule values of y. It would be great if anyone could guide me.
Thank You

whamp · June 11, 2018, 4:38pm

I have this same question. I see the “is_multi” parameter but am struggling to understand the source code on this.

johnri99 · June 11, 2018, 4:52pm

Can you provide more details of what you have tried and what doesn’t work? These errors are typical of not getting the input or output structure correct but its very difficult to help without more information.

whamp · June 11, 2018, 5:25pm

So I have tried passing a list of targets to proc_df as a variable “target_vars”. Seems like proc_df doesn’t like lists so I just went into the source for proc_df and made that work for my use case.

However where I’m confused is where to indicate in the modeldata object or in the learner object that I have multiple targets.

Here is a screenshot of my code which works for a single target. I’m trying to alter it to multi-output preferably with a higher weighting on a individual target within the multi-output target list.

so i guess i’m asking, is there a way to pass a list of targets to proc_df and I’m just doing wrong?
Do I need to do anything else besides is_multi= True and set out_szs to the len(list of targets) ?
What about for upweighting a given target in the list of targets, is that the wds parameter ?

Unfortunately I’ve deleted the sample code that had errors in it to focus on getting single output working. I will try again and post here when i have sample code for multi-output that fails.

fakhirali · June 11, 2018, 5:58pm

For me I first found out that I could not pass a list to proc_df() as y_fld. I was able to fix that by running proc_df() multiple times on single y_flds and then combining all the resulting y values into a single array of shape (no of targets,no of rows).

Then I ran into a problem with from_data_frame() function which contained a function called split_by_idx(). split_by_idx() required the arrays to have the same dimensions in the first place. I fixed that by reshaping the y array to (no of rows , no of targets)

After that whenever I run a learn function on the learner object it gives me an error. the error is related to the pytorch loss function. It says :

input and target shapes do not match: input [128 x 5], target [128 x 1 x 5] at c:\programdata\miniconda3\conda-bld\pytorch_1524549877902\work\aten\src\thcunn\generic/MSECriterion.cu:15

johnri99 · June 12, 2018, 10:39am

My reading of this is that Proc_df should work ok if the variables are part of the dataframe and are of the right type, if this is the case I don’t see why passing a list of columns should not work. If the columns are numerical then it probably won’t since there is a flag in the routine to prevent this.

Are you doing regression or classification. If regression then you need to set the is_reg flag to True. Apart from that my understanding is that if you set is_multi then it will select the appropriate loss function and the output number of outputs will be se by out_sz. For binary cross entropy pytorch will expect one column per class in the y array.

At present I can’t see a convenient way to pass a set of loss weights into the criteria function although clearly Pytorch allows this and I have used it in the past in standalone (non-fastai) models. If anybody could shed light on wheher this is possible at present it would be helpful. I don’t think it would be too much work to add this as a feature but no point in doing so if there is already a way.

fakhirali · June 15, 2018, 6:16am

I got it working by setting is_reg to False. This made my y values the correct shape(no of rows , no of targets) instead of (no of rows , 1 , no of targets). is_multi is still set to True. Another thing i noticed was that the model started using log_loss instead of MSE.

rwfilice · June 16, 2018, 11:14pm

Any chance you’d be willing to share your code for how you created your array of y’s (from proc_df) and then passed that to ColumnarModelData.from_data_frame? I’m having a bit of trouble working that out. Thanks much.

fakhirali · June 17, 2018, 6:50am

Sure!
I just did it multiple times. Its probably not the best way but it worked

rwfilice · June 19, 2018, 2:59am

Thanks much - and how are you setting torney_df? I run into index errors when I try to set my test dataframe with multiple variables like you’ve done. Thanks again.

fakhirali · June 19, 2018, 2:58pm

The test dataframe I think should not contain targets. Try removing them. This parameter is for like the kaggle submission dataset.

If you have a validation set then combine it with the training data and then pass the indexes of the validation rows.

rwfilice · June 20, 2018, 1:32am

Thanks for all your help. For what it’s worth, I modified your approach slightly and have something that appears to be working though I need to do some more validation. Instead of passing the dataframe to prof_df multiple times (which didn’t seem quite right to me because then each time you are rescaling the dataframe) I just modified the proc_df function as such so it iterates through a array of y_flds and returns an array of ys then you can then reshape as appropriate to pass to the ColumnarModelData with is_multi=True. See my tweaked function below. I passed both the train and test datasets through this function as they did in the Rossman example.

def proc_dfs(df, y_flds=[], skip_flds=None, ignore_flds=None, do_scale=False, na_dict=None, preproc_fn=None, max_n_cat=None, subset=None, mapper=None):
    if not ignore_flds: ignore_flds=[]
    if not skip_flds: skip_flds=[]
    if subset: df = get_sample(df,subset)
    ignored_flds = df.loc[:, ignore_flds]
    df.drop(ignore_flds, axis=1, inplace=True)
    df = df.copy()
    if preproc_fn: preproc_fn(df)
    else:
        ys = []
        for y_fld in y_flds:
            if not is_numeric_dtype(df[y_fld]): df[y_fld] = df[y_fld].cat.codes
            y = df[y_fld].values
            ys.append(y)
            skip_flds += [y_fld]
    df.drop(skip_flds, axis=1, inplace=True)

    if na_dict is None: na_dict = {}
    else: na_dict = na_dict.copy()
    na_dict_initial = na_dict.copy()
    for n,c in df.items(): na_dict = fix_missing(df, c, n, na_dict)
    if len(na_dict_initial.keys()) > 0:
        df.drop([a + '_na' for a in list(set(na_dict.keys()) - set(na_dict_initial.keys()))], axis=1, inplace=True)
    if do_scale: mapper = scale_vars(df, mapper)
    for n,c in df.items(): numericalize(df, c, n, max_n_cat)
    df = pd.get_dummies(df, dummy_na=True)
    df = pd.concat([ignored_flds, df], axis=1)
    res = [df, ys, na_dict]
    if do_scale: res = res + [mapper]
    return res

fakhirali · June 22, 2018, 8:28am

This is fantastic. I just set do_Scale to False but your function is great. I am going to be using this from now on if you don’t mind. Thanks!

msmedes · July 19, 2018, 9:24pm

For this approach are you one-hot encoding the targets? (assuming you are classifying)

rwfilice · July 20, 2018, 2:24pm

Sorry for the late response - but of course you can use the code - glad it’s helpful!

rwfilice · July 20, 2018, 4:39pm

My targets were continuous variables - not classification.

msmedes · July 20, 2018, 5:09pm

Gotcha, I have 4 classification categories (1-4 in a single df column) which I’m one hot encoding into 4 different columns. When I try to create a ColumnarModelData object I get the error TypeError: only integer scalar arrays can be converted to a scalar index, which I’m assuming has something to do with the shape of y, but I can’t seem to get it into a shape it accepts. Any ideas?

tcapelle · September 11, 2018, 12:10pm

Hello, I am trying to do the same, can you post the call to

md.get_learner

the is_reg variable shouldn’t be True, because it is a regression no?

My actual problem may be solved with another approach, my input are 100 (x,y) points and my output are 100 (x,y) points, any suggestions?

fakhirali · September 17, 2018, 7:07pm

If you know how to make a custom model as a nn.Module class then you can use
m = Learner.from_model_data(Model() , md)

sorry for the late response tho. Feel free to ask anymore questions.

tcapelle · September 18, 2018, 7:08am

I am actually using ConvLearner.from_model_data(simplenet, md), looking at the code:
The call from Learner.from_model_data is:

    @classmethod
    def from_model_data(cls, m, data, **kwargs):
        self = cls(data, BasicModel(to_gpu(m)), **kwargs)
        self.unfreeze()
        return self

and the ConvLearner: inherits Learner, so it is the same call.