Share your work here ✅

Hi,
After lesson 4, I tried to combine tabular data with NLP, particularly in spanish.

I took a tabular dataset from an e-commerce marketplace with the objective of predicting products’ condition (new or used) based on listings’ features. It includes 100k records and after some data pre-processing (not included in the attached notebooks), I ended up with 30 features, including: 17 categorical, 12 continuous and 1 text field (listing’s title).

The process included:

  1. Creating a tabular model without the text feature (accuracy: 91.5%).
  2. Creating an NLP model to predict from the listing title:
    2.2. Training a language model in spanish from scratch: I used a Wiki corpus trimmed to around 130 million tokens (training for 6 epochs tooks 10 hours on a GTX 1080TI, reaching an accuracy of 30.5%).
    2.3. Appling ULMFiT: First training a domain language model (accuracy: 34.3%) and then classifier itself (accuracy: 81.5%). Then, the classifier was used to predict on the entire data set (probability of the product being new given the title).
  3. Creating a new tabular model, this time adding as a new feature the prediction coming from the NLP model (final accuracy: 92.4%).

I tried also extracting the last linear layer’s activations (50) from the NLP model and feeding them in the tabular model, but it didn’t improve accuracy. Something that I didn’t reach to try was removing the output layer of both models, concatenating the outputs and feed it in a linear model (unlike my simpler model, this would backprop to both models).

In this case, the effort of training the NLP model (particularly the spanish model from scratch) just improved something below 1%. However, it was nice learning exercise and now I have a spanish pre-trained model, that hopefully will be useful for others projects thanks to ULMFiT. :slight_smile:

25 Likes