Transfer learning with tabular data (my experience)

Some time ago at work I had to look into transfer learning for tabular data, as well as problems with columns of a repeated type (think “client 1”, “client 2”, which should often be independent with respect to the order of the values but may be not).

Here is a brief summary of some of the things we tried, in case it is helpful for anyone.

Firstly, people have looked into this from different perspectives. Right here, for example, we have

It is also interesting how others have tried to convert part of the problem into images (where transfer learning is standard). (See for example.)

The work by Levin et al. (2022) is also relevant. They look into using a model trained on x columns to a new problem on x + 1 columns in a biomedical context. Cool stuff!

But we want the opposite, which is transferring one or few field embeddings to a new context. Again, not exactly ground-breaking.

We used three very different datasets, a private one for micro-credit risk, a housing dataset and another for prediction of drug synergies (DREAM challenge).

In order to swap embeddings we play with TabularPandas, in particular we load the trained model and get the info for the source variable:

source_category_map = self._learner.dls.procs.categorify.classes[source_variable]

and then

def update_variable_categories(self, variable: str, source_category_map: CategoryMap) -> None:
    self.procs.categorify.classes[variable] = source_category_map

This allows to initialize the new model with the right values for the target_variable. This is specially important since the length of the embeddings is a function of the number of values the categorical variable can take, so we need them to match before initializing the new model. Of course, the embeddings themselves are still new.

We can get the old embeddings from


We just need to know the index:

def _embedding_idx(self, variable: str) -> int:
    categoricals = list(self._learner.dls.procs.categorify.classes.keys())
    return categoricals.index(variable)

In the same way that we got the embeddings from the old model, we can update the embeddings in the new model.

Does this work? Yes, but it’s not impressive. We trained a model for the risk prediction problem and transplanted the zip-code embeddings into a model for housing prices, where we improve from mrse 0.292 to 0.288 (average of 10 tests).

An interesting idea is that recycling embeddings could potentially allow to get reasonable results for values outside of the train range (in our example, even if the housing data is only for Seville, we could see what happens with zip-codes from values covered in the original model! It did not really work in our particular case though).

Improvement! Recycling embeddings has a limitation. Imagine our first problem considers values A, B, C, and our new problem has values B, C, D. If we use embeddings from the first problem then D will be “unknown” (the way we applied them it won’t even be in the dictionary, so the new model can’t even train a new embedding for D!).

What we do is first identify all new values, like [D] in the example. We also get the average embedding as a starting point:

abs_weights = np.abs(self.value.weight.detach()) 

With this we can expand our old embeddings before inserting them into a new model. This is perhaps a bit cumbersome, but it worked:

def _append_weights(self, new_rows: np.ndarray) -> None:
    weights = self.value.weight.detach()
    new_rows = torch.from_numpy(new_rows)
    new_rows = new_rows.type_as(weights)
    full_weights =, new_rows), dim=0)
    full_weights = Parameter(full_weights, requires_grad=True)
    self.value = torch.nn.Embedding(full_weights.shape[0], self._row_len)
    self.value.weight = full_weights

The results are not very different than before, to be honest. In this case taking 100 repetitions, we improve from mrse 0.294 to 0.287.

We looked into applying all of this for a small train dataset (10 examples per class of the target variable). Similarly, a consistent but humble improvement (mrse 0.3823 to 0.3785).

Now we look at problems with categorical variables of the same type. For example, in the drugs problem we need to predict the synergy of COMPOUND_A and COMPOUND_B. In this case the original dataset does not discriminate which goes first, so the prediction should be the same for A, B and B, A. By default, this is not the case.

The idea here is to train only a list of embeddings, instead of two. For this, we first get all values for COMPOUND_A and B as CategoryMap. Like in the transfer learning tests we update the possible values for A and B. Finally, we link the embeddings before training (after getting their indices, like we did before). Now, this is probably not the best way to do this, but it worked for a quick test. First we record any links in a dict:

self._embed_redirect: Dict[int, int] = {
    i: i for i, _ in enumerate(self.embeds)}

and use this in the forward pass (the part that applies to categorical variables):

if self.n_emb != 0:
    for i, _ in enumerate(self.embeds):
        embedding_idx = self._embed_redirect[i]
        embedding = self.embeds[embedding_idx]
        x.append(embedding(x_cat[:, i]))
    x =, 1)
    x = self.emb_drop(x)

In this case we did 200 tests, and the results are marginal, probably not even statistically significant, from 0.6261 msre to 0.6246.

BUT it is interesting to see that our results are more symmetric! Taking the average prediction difference (A, B vs B, A), we reduce asymmetry by 8%.

If we duplicate our train set (so we add a B, A example for every A, B) we slightly improve overall metrics (0.6261 to 0.6045). It is puzzling that, while sharing embeddings has lower asymmetry in this case (15%), asymmetry is actually doubled with respect to the original training!!

But there’s acutally a simpler approach here. What if instead of COMPOUND_A, COMPOUND_B, we use a combined variable COMPOUNDS that simply joins them (alphabetically sorted)? Then A, B and B, A both take the same value AB.

This has a small performance improvement (2%, msre 0.6139) and a nice reduction in asymmetry (10%). Note we still have asymmetry since there are numerical variables for each compound.

So now we tried a fully symmetrical approach. Like we did before, we store all linked variables, categorical and continuous, and modify our forward pass:

def forward(self, x_cat, x_cont=None):
    first_pred = self._forward_base(x_cat, x_cont)
    x_cat_swapped = self._swap_cat(x_cat)
    x_cont_swapped = self._swap_cont(x_cont)
    swapped_pred = self._forward_base(x_cat_swapped, x_cont_swapped)
    return (first_pred + swapped_pred) / 2.0

def _swap_columns(data, redirects: List[int]):
    cols = [data[:, i] for i in range(data.shape[1])]
    cols_swapped = [cols[i].unsqueeze(1) for i in redirects]
    return, 1)

def _swap_cat(self, x_cat):
    return self._swap_columns(x_cat, self._cat_redirect)

def _swap_cont(self, x_cont):
    return self._swap_columns(x_cont, self._cont_redirect)

The performance improvement is about the same as the one we had with the combined variable (2%, msre 0.6138) but of course now the predictions are perfectly symmetric.

And that’s about it! Sorry this is not super complete, but I’m reading from last-year’s notes.