Build mixed databunch and train end-to-end model for Tabular (categorical + continuous data) and Text data

Hey @Andreas_Daiminger I was busy on other things and haven’t had a chance to look back at it. Do you have a list of functionality you want to have (beside the single data point prediction), as I have some time this weekend to play around with version 2 a bit (though I am not sure whether I’d be able to make v2 have all the functionalities as in v1 because v2 is fundamentally different)

I am not sure what you mean by ‘to classify dep_var without matching label columns …’
Can you give me an example on what your dataset is like?

So essentially, I’ve got a database of 48,000 e-commerce items, vendors send me price sheets with updated pricing information on them, and none of the columns are ever labeled consistently. The only thing that is generally consistent is the SKU or Model number, which I have trained as the dep_var from the database. These SKU’s share the same row as the pricing information, but since the labels on columns can change it makes it a bit tricky. Any insights would be greatly appreciated!

Hey @quan.tran!
I would like to use v2 in production. So everything related with that would be a top priority.
First single data point prediction and then model.export (difficult … I know!!)
I had a look myself, but could not come up with a simple way to make single data point prediction work. If you point in the right direction I can help you develop a solution.
Thanks for keeping interest!

So the column names are not consistent? or the value within each columns? I am still not sure what your dataset looks like. Can you provide the first 10 records in that dataset?

1 Like

Hey! I finally had time to work on it a bit. Code for version 2 is factorized into modules , there’s a new cleaner notebook for that (https://github.com/anhquan0412/fastai-tabular-text-demo/blob/master/mercari-tabular-text-version-2-complete.ipynb) and the repo is updated. Also I have added the predict_one_item function for version 2: All you need is to provide a series with column names (like the output of df.loc[some_index]) and it will spit out the prediction and raw_prediction. Haven’t tried it on classification task yet, so give it a go and let me know if it works!

And about the export function, I am not sure what the purpose of that function: do you want to save the model so that you can load it somewhere? Is it somewhat similar to ‘model.save()’ ?

3 Likes

@quan.tran Wow that was fast! Thanks a lot for the quick response!

The purpose of the export function is to prepare the model for inference. It’s like a lighter version of the learner. It can forget about learner.data and only needs to remember the model + its weights and the transforms it used or the normalization in the training data.

So I just took a look at the export function in the fastai doc + source code, and I have an approach, though not sure if it’s gonna work: since the v2 basically just a combination of Tabular Learner and Text Learner with a concat head, you can export these 2 learners using existing export() function and (now the hard part) write a function to join them back with the concat head. All the data transformation will be taken care of by these 2 learners, and the concat head is just nn.Sequential.

1 Like

@quan.tran I understand. I can give it a try. But I have not done a lot of low level PyTorch programming. So this is hard for me.
I visualised the Model architecture of V2. This might be helpful for new collaborators, who want a quick high level overview.

Okay, sorry it took so long to get back to you. So… I can’t get you the exact items requested as there are privacy standards that we need to adhere to. But I did create a mock test training set to show you what the 48,000 Rows looks like, as well as a general standard for what we get the pricing changes and new items.

https://drive.google.com/drive/folders/1PAjj0l2n0AH_VukMMjLo6HK0u0oRCIcE?usp=sharing

I am still a bit unsure about your dataset, but overall: if the class that you are trying to predict (dept_var) is subject to change, then there is point in classifying it at all, but I don’t think this is what you mean. If the ‘label’ are not consistent (by label I guess you are referring to numerical/ categorical features), it’s kinda tough: I guess you can take the most recent snapshot of the dataset, or try to pick the most appropriate values for those inconsistent features. In the end, the dataset has to be consistent (with moderate amount of error/mislabeled records) for any deep neural net to work well.

By label I’m meaning the name of the column in which the piece of information resides, though the same text characters would appear in the same orders. I guess the flow of decisions that need to be made look something like this:

  1. for each dep_var, find&match to existing dep_var in training set, if the character set does not exist in training, create a new one.
  2. for each dep_var, concat row with information provided in test set to training set.
  3. for each new dep_var, create row with predicted information from the rest of the test set being passed.

The reason why I originally asked if this was the thing to use, was because I saw mixed databunch, which would include text, and I thought that the tokenization might be able to help with the first portion.

Hi! Is there an update on this project? Did you figure out how to save and load the model?

Hi @dohait,
Currently I am not working on this project anymore. However since fastai version 2 has been released and I am learning the source code, I might convert this using the new library version and hopefully it will be easier to save and load model there. I will update this thread once I begin writing it!

2 Likes

Hey @quan.tran

I have a use case where I am also looking to combine text + tabular data. I’m interested in developing an unsupervised architecture to learn embeddings for the text and tabular data together that can be later used in downstream tasks.

I’m also just getting up to speed with fastaiv2. I’d be interested in getting involved and developing things further.

1 Like

Sure, I will keep you in the loop. Right now I have to focus more on work but I will get back to this eventually.

Hi @quan.tran, I have been testing your code on mercari-tabular-text-version-2-complete.ipynb on both the mercari dataset for regression and my dataset for classification, where I change metrics=[root_mean_squared_error] to

f1=FBeta()
    precision = Precision()
    recall = Recall()
    metrics=[accuracy, precision, recall, f1]

When I ran

lin_layers=[500]
ps=[0.]

# 50 is the default lin_ftrs in AWD_LSTM
lin_layers[-1]+= 50 if 'lin_ftrs' not in text_params else text_params['lin_ftrs']

# be careful here. If no lin_ftrs is specified, the default lin_ftrs is 50
learner = get_tabtxt_learner(data,tab_learner,text_learner,lin_layers ,ps) 

learner.freeze()
learner.fit_one_cycle(5, 1e-2, moms=(0.8, 0.7))

I encounter error:

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-67-257eb45edebd> in <module>()
      1 learner.freeze()
----> 2 learner.fit_one_cycle(5, 1e-2, moms=(0.8, 0.7))

7 frames
/usr/local/lib/python3.7/dist-packages/fastai/train.py in fit_one_cycle(learn, cyc_len, max_lr, moms, div_factor, pct_start, final_div, wd, callbacks, tot_epochs, start_epoch)
     21     callbacks.append(OneCycleScheduler(learn, max_lr, moms=moms, div_factor=div_factor, pct_start=pct_start,
     22                                        final_div=final_div, tot_epochs=tot_epochs, start_epoch=start_epoch))
---> 23     learn.fit(cyc_len, max_lr, wd=wd, callbacks=callbacks)
     24 
     25 def fit_fc(learn:Learner, tot_epochs:int=1, lr:float=defaults.lr,  moms:Tuple[float,float]=(0.95,0.85), start_pct:float=0.72,

/usr/local/lib/python3.7/dist-packages/fastai/basic_train.py in fit(self, epochs, lr, wd, callbacks)
    198         else: self.opt.lr,self.opt.wd = lr,wd
    199         callbacks = [cb(self) for cb in self.callback_fns + listify(defaults.extra_callback_fns)] + listify(callbacks)
--> 200         fit(epochs, self, metrics=self.metrics, callbacks=self.callbacks+callbacks)
    201 
    202     def create_opt(self, lr:Floats, wd:Floats=0.)->None:

/usr/local/lib/python3.7/dist-packages/fastai/basic_train.py in fit(epochs, learn, callbacks, metrics)
    104             if not cb_handler.skip_validate and not learn.data.empty_val:
    105                 val_loss = validate(learn.model, learn.data.valid_dl, loss_func=learn.loss_func,
--> 106                                        cb_handler=cb_handler, pbar=pbar)
    107             else: val_loss=None
    108             if cb_handler.on_epoch_end(val_loss): break

/usr/local/lib/python3.7/dist-packages/fastai/basic_train.py in validate(model, dl, loss_func, cb_handler, pbar, average, n_batch)
     61             if not is_listy(yb): yb = [yb]
     62             nums.append(first_el(yb).shape[0])
---> 63             if cb_handler and cb_handler.on_batch_end(val_losses[-1]): break
     64             if n_batch and (len(nums)>=n_batch): break
     65         nums = np.array(nums, dtype=np.float32)

/usr/local/lib/python3.7/dist-packages/fastai/callback.py in on_batch_end(self, loss)
    306         "Handle end of processing one batch with `loss`."
    307         self.state_dict['last_loss'] = loss
--> 308         self('batch_end', call_mets = not self.state_dict['train'])
    309         if self.state_dict['train']:
    310             self.state_dict['iteration'] += 1

/usr/local/lib/python3.7/dist-packages/fastai/callback.py in __call__(self, cb_name, call_mets, **kwargs)
    248         "Call through to all of the `CallbakHandler` functions."
    249         if call_mets:
--> 250             for met in self.metrics: self._call_and_update(met, cb_name, **kwargs)
    251         for cb in self.callbacks: self._call_and_update(cb, cb_name, **kwargs)
    252 

/usr/local/lib/python3.7/dist-packages/fastai/callback.py in _call_and_update(self, cb, cb_name, **kwargs)
    239     def _call_and_update(self, cb, cb_name, **kwargs)->None:
    240         "Call `cb_name` on `cb` and update the inner state."
--> 241         new = ifnone(getattr(cb, f'on_{cb_name}')(**self.state_dict, **kwargs), dict())
    242         for k,v in new.items():
    243             if k not in self.state_dict:

/usr/local/lib/python3.7/dist-packages/fastai/metrics.py in on_batch_end(self, last_output, last_target, **kwargs)
    158         if self.cm is None: self.cm = torch.zeros((self.n_classes, self.n_classes), device=torch.device('cpu'))
    159         cm_temp_numpy = self.cm.numpy()
--> 160         np.add.at(cm_temp_numpy, (targs ,preds), 1)
    161         self.cm = torch.from_numpy(cm_temp_numpy)
    162 

IndexError: only integers, slices (`:`), ellipsis (`...`), numpy.newaxis (`None`) and integer or boolean arrays are valid indices

Do you think it’s a problem from the code or my data? Any advice would be appreciated!

Hi @Andreas_Daiminger ,
Fascinating works! Do you have some update on the productionized the hybrid learner, especially on the newest version of Fastai?

I am facing a decision to productionized a fastai tabular model, either use unsupervised sentence-embedding for text cols or the hybrid learner (which is pretty buggy at this point).

I would love to chat more about this if you are available.

2 Likes

Hi @wjlgatech !

I think the project still has a lot of potentials. But it has not been actively maintained. We would have to put a bunch of work into it to make it work with the latest fastai. I would be happy to get back to this.

@Andreas_Daiminger it’s been a while, how are you doing Andreas? I’d glad to get back to this as well, @wjlgatech I have only recently looked at the infrastructure of the new fastai library and it’s gonna take a while to reimplement this, though I think there’s a lot of potential here.

1 Like