A code snippet for regression with structured data + text

Hey everyone!

I wrote a new module based on MultiBatchRNN and MixedInputModel for regression (or classification) of a structured data combined with a text input. The module handles 3 types of data: categorical, continuous, and text tokens. The final module is:

class RNN_Structured_regressor(nn.Module):

    def __init__(self, text_bptt, text_max_seq, text_ntoken, text_emb_sz, text_n_hid, text_n_layers, text_pad_token, 
                 struct_emb_szs, struct_n_cont, y_range, struct_layers_szs=[1000,500]):
        super().__init__()
        self.rnn_enc = MyMultiBatchRNN(bptt=text_bptt, max_seq=text_max_seq, ntoken=text_ntoken, emb_sz=text_emb_sz, 
                                     n_hid=text_n_hid, n_layers=text_n_layers, pad_token=text_pad_token, 
                                     dropouth=0.3, dropouti=0.65, dropoute=0.1, wdrop=0.5, qrnn=False) 
        
        self.structured_model = MixedInputModelWithText(struct_emb_szs, struct_n_cont, emb_drop=0.04, out_sz=1, 
                                                        szs=struct_layers_szs, drops=[0.001,0.01], y_range=y_range, 
                                                        use_bn=False, is_reg=True, is_multi=False, n_text=text_emb_sz)
        
    def forward(self, x_cat, x_cont, text_inp):
        raw_outputs, outputs = self.rnn_enc(torch.t(text_inp))
        encoded_text = outputs[-1][-1] # add max pooling afterwards
        return self.structured_model(x_cat, x_cont, encoded_text)

The entire code can be found here:

You can also load Wikipedia LM weights, or any other LM, to improve your initial RNN weights by:
load_model(learner.model.rnn_enc, '<your_LM>')

I’d love to get your feedback!
Hope it helps anyone :slight_smile:

6 Likes

This looks very cool - I’ll be sure to check it out!

1 Like

This is exactly what I’m trying to do as well! Trying to run it with my dataset now, and working through a few hiccups. I may post some questions for you here :slight_smile:

1 Like

Hey there, I’m currently working out the ‘build learner’ part of your code that calls RNN_Structured_regressor.

I’m defining things like batch size, emb size, etc. but I’m confused about that text_ntoken argument.

Is it the total number of tokens in the text columns of my dataset? Or is it per text column, or something? Do you know of a way to find that number? What does it do down the line in the code? I don’t see that argument actually used anywhere further down in the custom MultiBatchRNN that is called.

Thanks!

Oh, I forgot to add some of the hyper-parameters to that code (I added them below). text_ntoken is the size of your dictionary, probably somewhere between 10k to 50k.

text_emb_sz = 400
text_n_hid = 1150 # size of hidden layer
text_n_layers = 3
text_pad_token = 1 # position of padding token
struct_layers_szs=[1000, 500]

wd=1e-7
text_bptt=16
text_max_seq=text_bptt*10
text_ntoken = len(itos)
bs=30

thanks! I’ll give it a try with those updates.