A code snippet for regression with structured data + text

ranih · September 21, 2018, 4:52am

Hey everyone!

I wrote a new module based on MultiBatchRNN and MixedInputModel for regression (or classification) of a structured data combined with a text input. The module handles 3 types of data: categorical, continuous, and text tokens. The final module is:

class RNN_Structured_regressor(nn.Module):

    def __init__(self, text_bptt, text_max_seq, text_ntoken, text_emb_sz, text_n_hid, text_n_layers, text_pad_token, 
                 struct_emb_szs, struct_n_cont, y_range, struct_layers_szs=[1000,500]):
        super().__init__()
        self.rnn_enc = MyMultiBatchRNN(bptt=text_bptt, max_seq=text_max_seq, ntoken=text_ntoken, emb_sz=text_emb_sz, 
                                     n_hid=text_n_hid, n_layers=text_n_layers, pad_token=text_pad_token, 
                                     dropouth=0.3, dropouti=0.65, dropoute=0.1, wdrop=0.5, qrnn=False) 
        
        self.structured_model = MixedInputModelWithText(struct_emb_szs, struct_n_cont, emb_drop=0.04, out_sz=1, 
                                                        szs=struct_layers_szs, drops=[0.001,0.01], y_range=y_range, 
                                                        use_bn=False, is_reg=True, is_multi=False, n_text=text_emb_sz)
        
    def forward(self, x_cat, x_cont, text_inp):
        raw_outputs, outputs = self.rnn_enc(torch.t(text_inp))
        encoded_text = outputs[-1][-1] # add max pooling afterwards
        return self.structured_model(x_cat, x_cont, encoded_text)

The entire code can be found here:

gist.github.com

https://gist.github.com/ranihorev/8a25b8038f14c96cbba5fc3717247245

Structured_with_text.py

from fastai.text import *
from fastai.structured import proc_df
import pandas as pd
import numpy as np

class MixedInputModelWithText(nn.Module):
    def __init__(self, emb_szs, n_cont, emb_drop, out_sz, szs, drops,
                 y_range=None, use_bn=False, is_reg=True, is_multi=False, n_text=0):
        super().__init__()
        for i, (c, s) in enumerate(emb_szs): assert c > 1, f"cardinality must be >=2, got emb_szs[{i}]: ({c},{s})"

This file has been truncated. show original

my_dataset.py

class MyDataset(Dataset):
    def __init__(self, cats, conts, texts, y, is_reg, is_multi, reverse_text=False):
        n = len(cats[0]) if cats else len(conts[0])
        self.cats  = np.stack(cats,  1).astype(np.int64) if cats  else np.zeros((n,1))
        self.conts = np.stack(conts, 1).astype(np.float32) if conts else np.zeros((n,1))
        self.texts = np.zeros((n,1)) if texts is None else np.array(texts)
        self.y     = np.zeros((n,1)) if y is None else np.array(y).reshape(-1, 1).astype(np.float32)
        if is_reg:
            self.y =  self.y[:,None]
        self.is_reg = is_reg

This file has been truncated. show original

my_learner.py

class MyModel(BasicModel):
    def get_layer_groups(self):
        m=self.model
        return [m.rnn_enc, m.structured_model]
    
class MyLearner(Learner):
    def __init__(self, data, models, **kwargs):
        super().__init__(data, models, **kwargs)

    def _get_crit(self, data):

This file has been truncated. show original

You can also load Wikipedia LM weights, or any other LM, to improve your initial RNN weights by:
load_model(learner.model.rnn_enc, '<your_LM>')

I’d love to get your feedback!
Hope it helps anyone

knesgood · September 21, 2018, 2:05pm

This looks very cool - I’ll be sure to check it out!

britton · December 14, 2018, 5:59am

This is exactly what I’m trying to do as well! Trying to run it with my dataset now, and working through a few hiccups. I may post some questions for you here

britton · January 5, 2019, 3:12am

Hey there, I’m currently working out the ‘build learner’ part of your code that calls RNN_Structured_regressor.

I’m defining things like batch size, emb size, etc. but I’m confused about that text_ntoken argument.

Is it the total number of tokens in the text columns of my dataset? Or is it per text column, or something? Do you know of a way to find that number? What does it do down the line in the code? I don’t see that argument actually used anywhere further down in the custom MultiBatchRNN that is called.

Thanks!

ranih · January 5, 2019, 6:25am

Oh, I forgot to add some of the hyper-parameters to that code (I added them below). text_ntoken is the size of your dictionary, probably somewhere between 10k to 50k.

text_emb_sz = 400
text_n_hid = 1150 # size of hidden layer
text_n_layers = 3
text_pad_token = 1 # position of padding token
struct_layers_szs=[1000, 500]

wd=1e-7
text_bptt=16
text_max_seq=text_bptt*10
text_ntoken = len(itos)
bs=30

britton · January 5, 2019, 6:42am

thanks! I’ll give it a try with those updates.