Combining text and numerical data in ULMFiT

salmanmaq · February 20, 2019, 9:42am

Hi everyone

I am using FastAI v0.7 for a text classifier which makes pretty accurate predictions. However, we got another data source which is numerical (6 numbers, standard normalized over the training set mean and std) and has been independently verified to have a good predictive performance for the same task.

So, I was investigating how to combine the text data and references, and investigate the performance. Here are some ideas I had:

For each example, concatenate the 6 numbers with the concat pooling output in the PoolingLinearClassifier. So basically, the concat pooling output would look like this:

[output[-1], mxpool, avgpool, the_6_numbers]

The rationale for the concatenation here was that I didn’t want the numerical data to get pooled. I tried this, and the performance dropped significantly. I am not sure why and any possible explanation would be appreciated.

Have a LinearBlock on top of the 6 numerical inputs, maybe of size 50 or 100. And then concatenate the output of this block with the concat pooling output as in 1.

I had tried a similar approach with a CNN-based classifier and that worked great. I am yet to investigate this experimentally.

EDIT 1: I tried this with ULMFiT, but I get similar results as 1. The validation set loss I got after 1 epoch was 8749143523.580736, while the training set loss loomed in the range 0.4-0.6. I do not get what’s wrong.

Have 2 independent classifiers for the text data and the numerical data, both of which have exhibited a good predictive performance independently, and then add a 2 layer classifier on top of the output of the first 2 classifiers.

I would appreciate any feedback and comments on the best way to go about this.

Cheers

EDIT 2: So, I was incorrectly normalizing the validation data, which caused it to be way different than the training data. That led to the erroneous training I mentioned in 1 and 2. I would like to close those issues. However, I would welcome any feedback on which of the methods 1, 2, or 3 would be the best.

Andreas_Daiminger · March 26, 2019, 4:09pm

Hey @salmanmaq!

I am having the exact same problem set where I want to combine metadata with the text source. I will start doing some experiments with fast.ai 1.0. I will let you know how it goes.
Have you found any helpful resources in the meantime??

salmanmaq · March 26, 2019, 4:46pm

Hi @Andreas_Daiminger

So, I did manage to get it to work with fastai v0.7. The major change in the network was in the Pooling Linear classifier, where I added two LinearBlocks for the numerical data, the output of which is combined with the text data as I mentioned above. There were some changes in the Dataset classes as well. We eventually shelved the idea as it didn’t improve accuracy, and in fact, sometimes the accuracy degraded slightly as well*. You can see the major changes I made here:

github.com

inspirehep/inspire-classifier/blob/bfb7906c69c4665bcf22027f8192f67dd2ca9ffe/inspire_classifier/utils.py

# -*- coding: utf-8 -*-
#
# This file is part of INSPIRE.
# Copyright (C) 2014-2018 CERN.
#
# INSPIRE is free software: you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation, either version 3 of the License, or
# (at your option) any later version.
#
# INSPIRE is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with INSPIRE. If not, see <http://www.gnu.org/licenses/>.
#
# In applying this license, CERN does not waive the privileges and immunities
# granted to it by virtue of its status as an Intergovernmental Organization

This file has been truncated. show original

Let me know if you want to ask something about it.

*which I am not sure why that would happen - the network should learn to disregard or give a low weight to the numerical input in that case, but it doesn’t. I plotted the weights from the concat layer to the following fully connected layers, and the weights from the text part and the numerical part were more or less the same.

quan.tran · April 8, 2019, 6:34pm

There are few resources for this. One is from this post where tabular and text is combined for Spanish

I also figured a way to combine metadata (numerical and categorical) with text and tested it out with some Kaggle dataset and it seems to work. I hope it’s somewhat related to your case

Andreas_Daiminger · April 9, 2019, 9:00am

@quan.tran @salmanmaq

Thanks a lot for sharing this. I will give @quan.tran’s a try as soon as possible and let you know how it goes.

Check out this thread for more context on the problem I am currently working on.