Tabular Transfer Learning and/or retraining with fastai

jeremyeast · April 10, 2019, 3:43pm

Hello to all the amazing Fast.ai community ! I love to help but now in the position where I would like to question some more advanced of you that have worked on similar issues of Tabular.

For my job, I successfully developed Tabular models with Fast.ai that use over 60 categorical features(country, device, etc.) and around 20 continuous features (i.e datepart). These models are now in Production but we are looking to.go further with several new values added each day, requiring us to retrain models.

I was really inspired by the performance and progress done by transfer learning inside of vision (Resnet) and text (ULMfit), but have not seen any research on tabular.

Similarly to work done by Pinterest and Instacart, I would like to reuse the fast.ai categorical embeddings to train new models with less datapoints or similar problems. Exporting the PKL, extracting the weights is simple…

But how to prune the model, and load it inside a new model; while keeping the categorical cat_codes in the same order and efficiency.

Alternatively, we could simply retrain models from scratch all the time, but we feel that would be a waste of computing…or we could load the .PTH file but that does not seem efficient to store on AWS and still does not tell me how to add the new DataBunch.

I’ve followed 2018 FL pt1&2 and DL 2019, I researched several times the forums for different keywords, as well as Google, Github, to find a clear way to do it.

Would extremely appreciate some help !

sgugger · April 11, 2019, 4:23pm

It’s a bit tricky if you have new categorical codes as it will require you to change the embeddings. There is no pre-written function in fastai to help, but you should have a look a the function load_pretrained in fastai.text.learner, as this function matches word ids from one old vocab to a new one and create the corresponding embedding matrix. You would need the same for all the categorical variables.

As for not loading the pth file, there is no workaround that for now. You can implement some pruning probably, but there is nothing like this in fastai.

jeremyeast · April 11, 2019, 9:02pm

Thanks for the pointers, I will research and share progress.

I think there is an opportunity here. Could producing general-purpose categorical embeddings (categories, products, geos, datetime, etc.) for usage in general areas offer faster converging and better performance ? I see this the same way the ULMFiT language models are being used today.

spacecadet · April 26, 2019, 10:30am

I’ve been wondering the same thing , so i did some research but nothing i come across seems to show a clear way to do transfer learning for tabular.
Did you manage to find any further resources on the topic ?

jeremyeast · June 13, 2019, 4:49am

Hi Sylvain, I’ve did a lot of progress on the tabular transfer learning. However, there are significant differences between text, vision and tabular in terms of layers. I would like to know if I need to transfer more than the embeds in the module list…

In fast.ai text, the function load_pretrained() contains several elements we are transferring from the old state_dict() to the new state_dict() :

0.encoder.weight
1.decoder.bias’
1.decoder.weight

We get those, for example, through:
dec_bias, enc_wgts = wgts.get('1.decoder.bias', None), wgts['0.encoder.weight']

On the Adult Dataset Tabular Example, here are the layers I get from state_dict . We can see that they do not match the layers.bias llike in text :

embeds.0.weight
embeds.1.weight
embeds.2.weight
embeds.3.weight
embeds.4.weight
embeds.5.weight
embeds.6.weight
embeds.7.weight
embeds.8.weight
bn_cont.weight
bn_cont.bias
bn_cont.running_mean
bn_cont.running_var
bn_cont.num_batches_tracked
layers.0.weight
layers.0.bias
layers.2.weight
layers.2.bias
layers.2.running_mean
layers.2.running_var
layers.2.num_batches_tracked
layers.3.weight
layers.3.bias
layers.5.weight
layers.5.bias
layers.5.running_mean
layers.5.running_var
layers.5.num_batches_tracked
layers.6.weight
layers.6.bias

TabularModel(
(embeds): ModuleList(
(0): Embedding(10, 6)
(1): Embedding(17, 8)
(2): Embedding(17, 8)
(3): Embedding(8, 5)
(4): Embedding(16, 8)
(5): Embedding(7, 5)
(6): Embedding(6, 7)
(7): Embedding(3, 3)
(8): Embedding(43, 10)
)
(emb_drop): Dropout(p=0.0)
(bn_cont): BatchNorm1d(5, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(layers): Sequential(
(0): Linear(in_features=65, out_features=200, bias=True)
(1): ReLU(inplace)
(2): BatchNorm1d(200, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(3): Linear(in_features=200, out_features=100, bias=True)
(4): ReLU(inplace)
(5): BatchNorm1d(100, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(6): Linear(in_features=100, out_features=2, bias=True)
)
)

And the layers bias do not match at all any structure, for example :
‘layers.6.bias’, tensor([ 0.1803, -0.2174]

So again my question would be, which other layers do I need to transfer ?

Will share my code as soon as I’ve fully rewritten functions !

sgugger · June 13, 2019, 12:51pm

In a language model, the decoder is tied with the encoder (the embeddings used for coding are used to decode after softmax). There is nothing like this in a regular tabular model from the fastai library, you would have to write the equivalent yourself.

Jumonji · June 15, 2019, 4:48pm

I’m facing a similar problem with transfer learning of embeddings. I’ve taken the approach of copying the tensor values from an embedding to a CSV file and reloading them into a new embedding which may have some different categories. I’m still having a problem freezing and unfreezing them, but otherwise it seems to work. Here’s what I have so far (I would appreciate ANY critique on the approach or the code itself.)

 import csv
 def write_encoding_dict(filename,df,cat,input_embeds):
     embeds=input_embeds.cpu()
     source_vocab= df[cat].astype('category').cat.categories.values
     with open(filename, 'w') as csvFile:
         writer = csv.writer(csvFile, lineterminator='\n')
         for i in range(len(source_vocab)):
             myvals = np.array(embeds(torch.tensor(i))).tolist()
             writer.writerow([source_vocab[i],*myvals])
         csvFile.close()

In my model, I want to save the first embedding variable, and I do it like this:

write_encoding_dict(‘embedding0.csv’,panda_dataframe,category_var0, learn.model.embeds[0])

Then the file contains rows of “class,embeddings value list” like this:

ACE,-0.00013918841432314366, 3.610396379372105e-05, -7.69308189774165e-06, -2.2517966499435715e-05, -2.284333822899498e-05

Then to read them back in and load the embedding values into a different model:

 def get_encoding_dict(filename):
     with open(filename, 'r') as csvFile:
         reader = csv.reader(csvFile)
         lines = list(reader)
         d = OrderedDict()
         for i in range(len(lines)):
             d[lines[i][0]] = [float(lines[i][j]) for j in range(1,len(lines[i]))]
         csvFile.close()
         return d
 
 def load_embed_weights(df, cat, embeds, file):
     encodings = get_encoding_dict(file)
     target_vocab = df[cat].astype('category').cat.categories.values
     weights_matrix = embeds.weight
     #weights_matrix.requires_grad = False
     emb_dim=weights_matrix.shape[1]
     words_found = 0
     for i, word in enumerate(target_vocab):
         try: 
             enc = encodings[word]
             for j in range(emb_dim):
                 weights_matrix[i][j] = enc[j]
             words_found += 1
         except KeyError:
             for j in range(emb_dim):
                 weights_matrix[i][j] = np.random.normal(scale=0.6)
     print(weights_matrix.shape[0], words_found)

So - seems to work. The problem I’m having is when I try to freeze the weights in the new model, like this:

weights_matrix.requires_grad = False

I get an error that I can’t freeze a non-leaf node. So when I try to freeze the embedded tensor directly, like this:

weights_matrix.data.requires_grad = False

I get a different error that the optimizer can’t optimize a non-leaf variable.

I feel like I’ve made real progress, but this last hurdle is killing me…

Jumonji · June 15, 2019, 5:58pm

Ok - I’ve figured out that this works if I don’t reset the embedding values:

model.embeds[0].weight.requires_grad = False

so the problem is how I’m doing the reset. Apparently it’s creating a dependency from my initialization value to the embedding value I’m trying to replace. hmm…

Now, If I use this wrapper to copy the weight values in:

with torch.no_grad():

I don’t get any errors. However, setting requires_grad to False isn’t having any effect. It removes the gradient from the Tensors, but the values keep adapting.

Jumonji · June 15, 2019, 10:11pm

Ok, this is strange. If I turn off the gradient after loading, it doesn’t change during learning even if I turn it back on later.

learn.model.embeds[0].weight.requires_grad = False

However, if I turn it on after loading, it keeps changing even if I turn it off later. I’m stumped.

Is there some reason that setting requires_grad only works once?

jeremyeast · June 16, 2019, 6:43am

Hi Jumonji, I will soon share my take on it, however I have not gotten around to freezing layers.

Jumonji · June 17, 2019, 1:49pm

Well, (duh!) fastai always recreates the optimizer after freezing layers, which (I suspect) is reloading just the unfrozen parameters to be optimized. So just turning off the gradient will not automatically remove those parameters from being optimized, one must also recreate the optimizer. So I tried that and (drum roll…) it worked!

You can’t just use the built in freeze function for two reasons. First, I only want to freeze the embeddings, not all inputs in the first layer, and secondly, The tabular data model is all wrapped within a SequenceEx wrapper so its all one big layer grouping anyway. You can only freeze all or none with the built in function.

So, to freeze and unfreeze a specific embedding you must use the correct index based on the category order, like so:

learn.model.embeds[index].weight.requires_grad = False (or True)
learn.create_opt(defaults.lr)

Voia! It works!

Now I just need to understand why the weight matrix for the embedding has an extra row in it (one more than the number of classes in that category.) Any ideas?

jeremyeast · June 17, 2019, 5:21pm

Hi Jumonji, I am curious which other layers you have transferred other than embeds.[index].weights ?

To answer your question, the extra row value in each embedding is #na# which is served as a placeholder default when you try to predict a new value that is not present in your embedding dictionary .

You can see those with learn.data.train_ds.x.classes

Will soon share my code, I have asked somebody to review it beforehand.

Jumonji · June 17, 2019, 10:22pm

Thanks, Jeremy. It looks like the #na# is prepended to the classes at index zero, so that’s what I’m doing now.

I have only been transferring the embeddings themselves. I’m working in the airline domain and I’m trying to come up with a generic airport encoding, starting by using destination volume analogously to word order in NLP.

Any comments on my code thus far? Here’s the latest version:

import csv
from collections import OrderedDict 

def write_encoding_dict(filename, df, cat, embeds):
    source_vals = ['#na#', *df[cat].astype('category').cat.categories.values]
    weight_matrix = embeds.weight
    with open(filename, 'w') as csvFile:
        writer = csv.writer(csvFile, lineterminator='\n')
        for i in range(len(source_vals)):
            writer.writerow([source_vals[i],*weight_matrix[i].tolist()])
        
def get_encoding_dict(filename):
    with open(filename, 'r') as csvFile:
        reader = csv.reader(csvFile)
        lines = list(reader)
        d = OrderedDict()
        for i in range(len(lines)):
            d[lines[i][0]] = [float(lines[i][j]) for j in range(1,len(lines[i]))]
        return d

def load_embed_weights(filename, df, cat, embeds):
    encodings = get_encoding_dict(filename)
    target_vals = ['#na#', *df[cat].astype('category').cat.categories.values]
    weights_matrix = embeds.weight
    emb_dim=weights_matrix.shape[1]
    vals_found = 0
    with torch.no_grad():
        for i, value in enumerate(target_vals):
            try: 
                enc = encodings[value]
                for j in range(emb_dim):
                    weights_matrix[i][j] = enc[j]
                vals_found += 1
            except KeyError:
                for j in range(emb_dim):
                    weights_matrix[i][j] = np.random.normal(scale=0.6)

maral · June 18, 2019, 11:57pm

When you say you’re adding new values each day are you adding more training data (rows) or are you changing the structure of the model i.e. adding more columns?

jeremyeast · June 19, 2019, 4:16pm

Hi, both. Models could have new data with new categorical values that were never observed in the past (for example a new car model) or it could also be transfer the weights for a new kind of problem reusing the same rows.

maral · June 20, 2019, 10:50am

I think you have to ditch your embeddings if you want to avoid retraining. If you one hot encode your categorical variables instead you should be able to add new connections to the network while preserving the existing weights and then train the model using the validation data from the original model (inference only) to re-calibrate it such that the validation loss between the original model and new model is minimised. That should retain the knowledge acquired by the original models training while expanding the model into a new model that can support new inputs. I am using a similar approach right now except I have the inverse problem. I am shrinking a GAN by 50% so I can perform real-time inference by removing entire resblocks and re-calibrating.

So steps are:

Copy original model (O) to (N)
Add new connections to (N)
Get validation data from (O)'s previous training loop
Run validation data through (O) and get outputs
Train (N) on validation data and calculate MSE loss between (O) and (N) outputs

(N) should learn how to imitate (O)

jeremyeast · June 21, 2019, 1:04am

Hi @sgugger , I am happy to share with the community a basic demo of tabular transfer learning with fast.ai , thanks for pointing me in the direction of the fastai text. I am still unsure on how to handle the bias layers. I would really appreciate any help on how to modify the model architecture or layers (require_grad ?) to improve transfer accuracy , could you suggest me any paths of improvement ?

@Jumonji : with you my version of the code; as you can see, I work directly from a pickled dictionary instead of a CSV and only take care of embed weights, and not other layers yet.

You can see the model automatically starts at ~.30 loss instead of ~0.7, and everything runs+trains smoothly. I am ready to work on other problems, but I would first appreciate some feedback from anybody here !

CODE:
https://colab.research.google.com/drive/1yvA6pFPbmtwUUw1VDtPixoqWPTgkEfpM

Jumonji · June 21, 2019, 1:47am

Thanks @Jeremyeast - You’re solving a slightly different problem than I am - I want to use lists of category classes with their embedding vectors that possibly haven’t been created in fastai models - i.e., similar to GloVe vectors for NLP word embeddings than may be created and shared from many different sources. The CSV file format was just a start to see if I could make it work, I don’t want to depend on having a pickle model to start with. GloVe uses space-delimited records actually.

I do think I’ll try your technique of getting the class list instead of building it from Panda, just to verify my results if nothing else. Cheers!

jeremyeast · June 21, 2019, 5:46am

I have done a lot of work on representing tabular entities in a 2d space. I used tsne and matplot and had some success grouping entities through DBScan (does not require to pass N clusters). Glove is probably superior but its hard to see tangible applications with this… I would love to see how you apply this for the airline industry.

You could easily take my code and extend it to add a category with a mean or empty vector value until you receive a vector that you have the data. I would recommend you to look at my code in order to pass a uniform number of columns for the category names you will want to transfer (n classes / 2, max 50).

Jumonji · June 24, 2019, 2:27am

My “final” version with no pandas dependencies. Pretty minimal, if I do say so myself:

import csv
import torch
from collections import OrderedDict 
from fastai.basic_train import Learner

defaultlr = 1e-3

def write_encoding_dict(filename, learner, cat_names, cat):
    classes = learner.data.label_list.train.x.classes[cat]
    weight_matrix = learner.model.embeds[cat_names.index(cat)].weight
    with open(filename, 'w') as csvFile:
        writer = csv.writer(csvFile, lineterminator='\n')
        for i in range(len(classes)):
            writer.writerow([classes[i],*weight_matrix[i].tolist()])
        
def get_encoding_dict(filename):
    with open(filename, 'r') as csvFile:
        reader = csv.reader(csvFile)
        lines = list(reader)
        d = OrderedDict()
        for i in range(len(lines)):
            d[lines[i][0]] = [float(lines[i][j]) for j in range(1,len(lines[i]))]
        return d

def load_embed_weights(filename, learner, cat_names, cat):
    encodings = get_encoding_dict(filename)
    classes = learner.data.label_list.train.x.classes[cat]
    weight_matrix = learner.model.embeds[cat_names.index(cat)].weight
    emb_dim=weight_matrix.shape[1]
    with torch.no_grad():
        for i, value in enumerate(classes):
            try: 
                enc = encodings[value]
                for j in range(emb_dim):
                    weight_matrix[i][j] = enc[j]
            except KeyError:
                for j in range(emb_dim):
                    weight_matrix[i][j] = np.random.normal(scale=0.6)
                    
def freeze_embedding(learner:Learner,index=0):
    learner.model.embeds[index].weight.requires_grad = False
    learner.create_opt(defaultlr)
                    
def unfreeze_embedding(learner:Learner,index=0):
    learner.model.embeds[index].weight.requires_grad = True
    learner.create_opt(defaultlr)