Build mixed databunch and train end-to-end model for Tabular (categorical + continuous data) and Text data

So the column names are not consistent? or the value within each columns? I am still not sure what your dataset looks like. Can you provide the first 10 records in that dataset?

1 Like

Hey! I finally had time to work on it a bit. Code for version 2 is factorized into modules , there’s a new cleaner notebook for that (https://github.com/anhquan0412/fastai-tabular-text-demo/blob/master/mercari-tabular-text-version-2-complete.ipynb) and the repo is updated. Also I have added the predict_one_item function for version 2: All you need is to provide a series with column names (like the output of df.loc[some_index]) and it will spit out the prediction and raw_prediction. Haven’t tried it on classification task yet, so give it a go and let me know if it works!

And about the export function, I am not sure what the purpose of that function: do you want to save the model so that you can load it somewhere? Is it somewhat similar to ‘model.save()’ ?

3 Likes

@quan.tran Wow that was fast! Thanks a lot for the quick response!

The purpose of the export function is to prepare the model for inference. It’s like a lighter version of the learner. It can forget about learner.data and only needs to remember the model + its weights and the transforms it used or the normalization in the training data.

So I just took a look at the export function in the fastai doc + source code, and I have an approach, though not sure if it’s gonna work: since the v2 basically just a combination of Tabular Learner and Text Learner with a concat head, you can export these 2 learners using existing export() function and (now the hard part) write a function to join them back with the concat head. All the data transformation will be taken care of by these 2 learners, and the concat head is just nn.Sequential.

1 Like

@quan.tran I understand. I can give it a try. But I have not done a lot of low level PyTorch programming. So this is hard for me.
I visualised the Model architecture of V2. This might be helpful for new collaborators, who want a quick high level overview.

Okay, sorry it took so long to get back to you. So… I can’t get you the exact items requested as there are privacy standards that we need to adhere to. But I did create a mock test training set to show you what the 48,000 Rows looks like, as well as a general standard for what we get the pricing changes and new items.

https://drive.google.com/drive/folders/1PAjj0l2n0AH_VukMMjLo6HK0u0oRCIcE?usp=sharing

I am still a bit unsure about your dataset, but overall: if the class that you are trying to predict (dept_var) is subject to change, then there is point in classifying it at all, but I don’t think this is what you mean. If the ‘label’ are not consistent (by label I guess you are referring to numerical/ categorical features), it’s kinda tough: I guess you can take the most recent snapshot of the dataset, or try to pick the most appropriate values for those inconsistent features. In the end, the dataset has to be consistent (with moderate amount of error/mislabeled records) for any deep neural net to work well.

By label I’m meaning the name of the column in which the piece of information resides, though the same text characters would appear in the same orders. I guess the flow of decisions that need to be made look something like this:

  1. for each dep_var, find&match to existing dep_var in training set, if the character set does not exist in training, create a new one.
  2. for each dep_var, concat row with information provided in test set to training set.
  3. for each new dep_var, create row with predicted information from the rest of the test set being passed.

The reason why I originally asked if this was the thing to use, was because I saw mixed databunch, which would include text, and I thought that the tokenization might be able to help with the first portion.

Hi! Is there an update on this project? Did you figure out how to save and load the model?

Hi @dohait,
Currently I am not working on this project anymore. However since fastai version 2 has been released and I am learning the source code, I might convert this using the new library version and hopefully it will be easier to save and load model there. I will update this thread once I begin writing it!

2 Likes

Hey @quan.tran

I have a use case where I am also looking to combine text + tabular data. I’m interested in developing an unsupervised architecture to learn embeddings for the text and tabular data together that can be later used in downstream tasks.

I’m also just getting up to speed with fastaiv2. I’d be interested in getting involved and developing things further.

1 Like

Sure, I will keep you in the loop. Right now I have to focus more on work but I will get back to this eventually.

Hi @quan.tran, I have been testing your code on mercari-tabular-text-version-2-complete.ipynb on both the mercari dataset for regression and my dataset for classification, where I change metrics=[root_mean_squared_error] to

f1=FBeta()
    precision = Precision()
    recall = Recall()
    metrics=[accuracy, precision, recall, f1]

When I ran

lin_layers=[500]
ps=[0.]

# 50 is the default lin_ftrs in AWD_LSTM
lin_layers[-1]+= 50 if 'lin_ftrs' not in text_params else text_params['lin_ftrs']

# be careful here. If no lin_ftrs is specified, the default lin_ftrs is 50
learner = get_tabtxt_learner(data,tab_learner,text_learner,lin_layers ,ps) 

learner.freeze()
learner.fit_one_cycle(5, 1e-2, moms=(0.8, 0.7))

I encounter error:

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-67-257eb45edebd> in <module>()
      1 learner.freeze()
----> 2 learner.fit_one_cycle(5, 1e-2, moms=(0.8, 0.7))

7 frames
/usr/local/lib/python3.7/dist-packages/fastai/train.py in fit_one_cycle(learn, cyc_len, max_lr, moms, div_factor, pct_start, final_div, wd, callbacks, tot_epochs, start_epoch)
     21     callbacks.append(OneCycleScheduler(learn, max_lr, moms=moms, div_factor=div_factor, pct_start=pct_start,
     22                                        final_div=final_div, tot_epochs=tot_epochs, start_epoch=start_epoch))
---> 23     learn.fit(cyc_len, max_lr, wd=wd, callbacks=callbacks)
     24 
     25 def fit_fc(learn:Learner, tot_epochs:int=1, lr:float=defaults.lr,  moms:Tuple[float,float]=(0.95,0.85), start_pct:float=0.72,

/usr/local/lib/python3.7/dist-packages/fastai/basic_train.py in fit(self, epochs, lr, wd, callbacks)
    198         else: self.opt.lr,self.opt.wd = lr,wd
    199         callbacks = [cb(self) for cb in self.callback_fns + listify(defaults.extra_callback_fns)] + listify(callbacks)
--> 200         fit(epochs, self, metrics=self.metrics, callbacks=self.callbacks+callbacks)
    201 
    202     def create_opt(self, lr:Floats, wd:Floats=0.)->None:

/usr/local/lib/python3.7/dist-packages/fastai/basic_train.py in fit(epochs, learn, callbacks, metrics)
    104             if not cb_handler.skip_validate and not learn.data.empty_val:
    105                 val_loss = validate(learn.model, learn.data.valid_dl, loss_func=learn.loss_func,
--> 106                                        cb_handler=cb_handler, pbar=pbar)
    107             else: val_loss=None
    108             if cb_handler.on_epoch_end(val_loss): break

/usr/local/lib/python3.7/dist-packages/fastai/basic_train.py in validate(model, dl, loss_func, cb_handler, pbar, average, n_batch)
     61             if not is_listy(yb): yb = [yb]
     62             nums.append(first_el(yb).shape[0])
---> 63             if cb_handler and cb_handler.on_batch_end(val_losses[-1]): break
     64             if n_batch and (len(nums)>=n_batch): break
     65         nums = np.array(nums, dtype=np.float32)

/usr/local/lib/python3.7/dist-packages/fastai/callback.py in on_batch_end(self, loss)
    306         "Handle end of processing one batch with `loss`."
    307         self.state_dict['last_loss'] = loss
--> 308         self('batch_end', call_mets = not self.state_dict['train'])
    309         if self.state_dict['train']:
    310             self.state_dict['iteration'] += 1

/usr/local/lib/python3.7/dist-packages/fastai/callback.py in __call__(self, cb_name, call_mets, **kwargs)
    248         "Call through to all of the `CallbakHandler` functions."
    249         if call_mets:
--> 250             for met in self.metrics: self._call_and_update(met, cb_name, **kwargs)
    251         for cb in self.callbacks: self._call_and_update(cb, cb_name, **kwargs)
    252 

/usr/local/lib/python3.7/dist-packages/fastai/callback.py in _call_and_update(self, cb, cb_name, **kwargs)
    239     def _call_and_update(self, cb, cb_name, **kwargs)->None:
    240         "Call `cb_name` on `cb` and update the inner state."
--> 241         new = ifnone(getattr(cb, f'on_{cb_name}')(**self.state_dict, **kwargs), dict())
    242         for k,v in new.items():
    243             if k not in self.state_dict:

/usr/local/lib/python3.7/dist-packages/fastai/metrics.py in on_batch_end(self, last_output, last_target, **kwargs)
    158         if self.cm is None: self.cm = torch.zeros((self.n_classes, self.n_classes), device=torch.device('cpu'))
    159         cm_temp_numpy = self.cm.numpy()
--> 160         np.add.at(cm_temp_numpy, (targs ,preds), 1)
    161         self.cm = torch.from_numpy(cm_temp_numpy)
    162 

IndexError: only integers, slices (`:`), ellipsis (`...`), numpy.newaxis (`None`) and integer or boolean arrays are valid indices

Do you think it’s a problem from the code or my data? Any advice would be appreciated!

Hi @Andreas_Daiminger ,
Fascinating works! Do you have some update on the productionized the hybrid learner, especially on the newest version of Fastai?

I am facing a decision to productionized a fastai tabular model, either use unsupervised sentence-embedding for text cols or the hybrid learner (which is pretty buggy at this point).

I would love to chat more about this if you are available.

2 Likes

Hi @wjlgatech !

I think the project still has a lot of potentials. But it has not been actively maintained. We would have to put a bunch of work into it to make it work with the latest fastai. I would be happy to get back to this.

@Andreas_Daiminger it’s been a while, how are you doing Andreas? I’d glad to get back to this as well, @wjlgatech I have only recently looked at the infrastructure of the new fastai library and it’s gonna take a while to reimplement this, though I think there’s a lot of potential here.

1 Like

Hi @quan.tran and @Andreas_Daiminger,

Great to hear your interest! I am very motivated to put some efforts to this direction, reasons being

  • there are many situations in real life (e.g. recommendation, social media ranking), data is getting more and more hybrid (cont, cat, datetime, text, image, voice…). The demands are growing! I recently see some cases from work and have to use some workaround (as a temporally solution)

  • this project can be naturally followed up with fastai hyperparameters tuning and fastai classifier calibration as post-process steps to further improve the model performance.

  • writing it up in a blogpost would give this work great visibility, an obvious big boost to our career/market value (:smile:

In short, it’s a hi-value hackathon project. If you are interested, we can have a video call this weekend to talk about the details. I can send you an invite through linkedin

1 Like

Just thought I’d throw this down here to help you guys out, I did do something like this already in fastai v2 :slight_smile: I’m not 100% sure if it still works today, but you should steal the concept around it as it’s fairly straightforward

2 Likes

@muellerzr awesome works! I really appreciate it. I will try it out and keep you updated. @quan.tran and @Andreas_Daiminger welcome to the party to make it production ready if you are interested.

1 Like

Right on! Let’s give this a second wind.
Awesome work @muellerzr .
The fastai community keeps amazing me.

@wjlgatech I am off the grid this weekend. But happy to get on a call next week.

@quan.tran Good to hear from you! A lot has happened. Got acquired two times in less than a year :crazy_face: . But your work on the Concat Model is still one of the most interesting things I have come across lol! The combination of structured and unstructured data deserves attention!

1 Like