Share your work here ✅

joshfp · November 16, 2018, 8:48pm

Hi,
After lesson 4, I tried to combine tabular data with NLP, particularly in spanish.

I took a tabular dataset from an e-commerce marketplace with the objective of predicting products’ condition (new or used) based on listings’ features. It includes 100k records and after some data pre-processing (not included in the attached notebooks), I ended up with 30 features, including: 17 categorical, 12 continuous and 1 text field (listing’s title).

The process included:

Creating a tabular model without the text feature (accuracy: 91.5%).
Creating an NLP model to predict from the listing title:
2.2. Training a language model in spanish from scratch: I used a Wiki corpus trimmed to around 130 million tokens (training for 6 epochs tooks 10 hours on a GTX 1080TI, reaching an accuracy of 30.5%).
2.3. Appling ULMFiT: First training a domain language model (accuracy: 34.3%) and then classifier itself (accuracy: 81.5%). Then, the classifier was used to predict on the entire data set (probability of the product being new given the title).
Creating a new tabular model, this time adding as a new feature the prediction coming from the NLP model (final accuracy: 92.4%).

I tried also extracting the last linear layer’s activations (50) from the NLP model and feeding them in the tabular model, but it didn’t improve accuracy. Something that I didn’t reach to try was removing the output layer of both models, concatenating the outputs and feed it in a linear model (unlike my simpler model, this would backprop to both models).

In this case, the effort of training the NLP model (particularly the spanish model from scratch) just improved something below 1%. However, it was nice learning exercise and now I have a spanish pre-trained model, that hopefully will be useful for others projects thanks to ULMFiT.

gist.github.com

https://gist.github.com/joshfp/b62b76eae95e6863cb511997b5a63118

1.tabular.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Model based on tabular data only"
   ]
  },
  {

This file has been truncated. show original

2.lm-spanish.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# ULMFiT: Train spanish LM"
   ]
  },
  {

This file has been truncated. show original

3.nlp.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# NLP model to predict from title (ULMFiT)"
   ]
  },
  {

This file has been truncated. show original