Replicating fastai results manually

The fastai library is incredibly helpful, but obviously it can’t be used for every scenario. I’ve run into situations where it seems either impossible or overly convoluted to use fastai for that particular use-case (sample-wise loss and multimodal models are two examples I’m currently struggling with). However, when trying to move away from fastai I fail to replicate fastai results. The point of this thread is to go through how to manually implement/replicate fastai results without fastai. I think this will give a lot of users a greater depth as to what is going on with fastai and allow us all to be more flexible with the principles we learn from fastai.

This first post will be asking for specific help on one step of this process of manually replicating fastai results. My goal is to continue to post on this thread as I move throughout the process. I’ll use the multimodal problem I’m working on as the use-case.

I have been using a fastai TabularLearner on just a tabular data set (all continuous features) and getting respectable results (about equivalent to my results using a random forest model) - this fastai TabularLearner is my baseline. I then tried combining this in a multimodal model with associated text data and got poor results. To debug, I went back to only tabular data but did not use fastai - got poor results. I double and triple checked my normalization technique, fit function, etc.; they all seemed good. I then tried a ridiculous thing, the results of which are baffling to me. I used fastai and left everything the same as my baseline, except I manually inserted the model:

import fastai.tabular as tab
import fastai.layers as fast_layers

# Make databunch 
procs = [tab.Normalize]
train_val_data_df = d_in.loc[train_val_df_indices].merge(d_out.loc[train_val_df_indices][clf_target], left_index=True, right_index=True)
fast_data = tab.TabularDataBunch.from_df('.', train_val_data_df, clf_target, \
                                         valid_idx=val_iloc_indices, test_df=full_test_df[in_cols], procs=procs)
# Model architecture
layers = [337, 290, 82] 
dropouts = [0.3, 0.3, 0.07]
learner = tab.tabular_learner(fast_data, layers=layers, ps=dropouts, metrics=tab.accuracy)
weights = torch.Tensor([0.413250185, 0.586749815]).cuda()
learner.loss_func = fast_layers.CrossEntropyFlat(weight=weights)
# Added the line below and that's the only change I made
learner.model = tab.TabularModel(emb_szs=[], n_cont=len(in_cols), out_sz=2, layers=layers, ps=dropouts).cuda() 

This model should be the exact same as the one that is created in tab.tabular_learner(...), but yet I get poor results when doing this. The losses look worse, as do my domain-specific results.
Baseline loss:


Manually inserted model loss:

Also the results of executing learner.lr_find(1e-10, 1, num_it=1000) also changes drastically.
Baseline lr_find:
image
Manually inserted model lr_find:
image

The domain-specific metrics were substantially worse.

It seems to me like something must be happening in fastai’s tabular_learner that messes with the model after it instantiates it, but looking through the source code I can’t seem to find what it is. Any suggestions? What fastai magic am I not replicating? Recall that the only thing I changed was inserting manually the model that TabularLearner makes; I still used the same databunch and learner.fit functions as my baseline.

1 Like

Seems like I am following you around on these forums. But I like your style, thoroughness, and a good mystery.

The main difference I see is in the initialization of weights, though I don’t know if it can account for such a large discrepancy. Also, the DataLoader is going to generate training samples in different orders, and CUDA is slightly indeterminate.

I would first try copying the weights from the original learner model into the new model. Then evaluate several single training samples to see whether their outputs are the same or very close.

If they are, then try setting the random seeds along with num_workers=1 right before training to get the same training sequence. And the fastai docs show code for making CUDA determinate, someplace.

These steps should at least give you more clues. Good luck!

P.S. Somewhere fastai initializes the optimizer, one hopes the same way for each run you are doing.

I tried the same thing some months ago. What I noticed is that fastai is doing many things in the background when creating a learner (at least in vision which is what I use):

  • It initializes weights using a specific distribution
  • It creates layer groups to allow discriminative learning rate
  • It freezes part of the model
    Still, all these things are not done in tabular_learner so you’re probably wondering why I’m writing all this. The reason is that being able to do these things implies some changes in the optimization of the model, which are hidden in the Learner class. The most important thing is that fastai requires the Learner to have a layer_groups attribute so that it can use discriminative lr and freezing no matter what. When exploring the fit method, we then notice the call to a create_opt method, that creates an OptimWrapper object that calls the optimizer. Long story short, the parameter groups of the optimizer are created using the layer_groups object, which is instantiated in the constructor of Learner. That means that the optimizer will optimize the parameters of the layers contained in layer_groups.

The consequence of all this is that when you create a Learner it uses the layers of the model passed to it to later create the optimizer. But when you change the model, you don’t change these layers, which means that it will optimize parameters that have no relation whatsoever with the current model. It basically doesn’t learn anything.

If you want to change a learner’s model, you need to both change the model and layer_groups attribute like this:

learner.model = tab.TabularModel(emb_szs=[], n_cont=len(in_cols), out_sz=2, layers=layers, ps=dropouts).cuda()
learner.layers_groups = [nn.Sequential(*flatten_model(learner.model))]

flatten_model is a function defined in fastai.torch_core if you want to use it. With these two lines defined, your training should yield similar results.

One more thing: if you want to change model after calling fit, you also have to reset the optimizer by calling learner.create_opt, else the parameters of the optimizer still won’t match those of your model.

Hope this makes your case work, don’t hesitate if you have any question, I spent a lot of time trying to recreate fastai’s behavior so I noticed a lot of details along the way.

4 Likes

@florobax, thanks for your reply. It is far more in depth than mine was, and based on actual experience.

…the parameter groups of the optimizer are created using the layer_groups object, which is instantiated in the constructor of Learner. That means that the optimizer will optimize the parameters of the layers contained in layer_groups.

This part was upsetting. I have nearly always redone layer_groups after altering the model in order to use freeze_to(). So, just by accident, I never ran into these problems with the optimizer. But maybe I have run into the issue occasionally and misinterpreted it as a failed training.

The frozen state is already directly embedded in the model’s parameters as requires_grad, yet fastai takes the parameters for the optimizer from layer_groups. There should be a big red warning about this “gotcha”: model and layer_groups getting out of sync.

Can anyone explain the reasons fastai is designed this way? It seems to invite a difficult to diagnose error. I am certainly open to explanations and corrections.

I am pretty sure the layer groups are created so that there can be a consistent behavior for creating param groups in the optimizer directly from layers (which is more easily readable than parameters). However I agree there should a least be a warning when layer groups, the model and the optimizer are out of sync. Another option would be to set model as a property method, which would work something like:

@property:
def model(self):
    return self._model

@model.setter
def model(self, model):
    self._model = model
    self.layers_groups = [nn.Sequential(*flatten_model(self._model))]

@property
def layer_groups(self):
    return self._layer_groups

@layer_groups.setter
def layer_groups(self, layer_groups):
    self._layer_groups = layer_groups
    delattr(self, 'opt')

This would allow consistent behavior when changing model and/or layer_groups. I guess a PR could be suggested, but I believe the focus is on v2 now, more than on fixing v1.

@florobax thank you so much for the reply! That makes total sense now and resolves this specific problem. As a side note, you can of course also resolve this problem by creating your learner from basic_train rather than a tabular learner, and then you can pass whatever kind of model you want:

model = tab.TabularModel(emb_szs=[], n_cont=len(cont_features), out_sz=2, layers=layers, ps=dropouts).to(device)
learner = basic_train.Learner(fast_data, model, loss_func=fast_layers.CrossEntropyFlat(weight=weights), metrics=tab.accuracy)

The good news is that since posting this, I’ve worked through the process end-to-end without fastai and am getting good results (actually slightly better domain results). I’m not entirely sure what my problem was before (as I started mostly from scratch to debug), which is disappointing, but I’m happy to respond to anyone who is having trouble replicating good tabular results without fastai.

Thanks to everyone for posting your suggestions! If I run into issues on multimodal I may post in this thread and continue the discussion on manually implementing best practices that fastai is doing behind the scenes.

1 Like

@jaxondk I know I’m digging up dead threads, but you seemed to have figured out the problem… which I am facing right now…

Just as a educational exercise, I’ve been trying to replicate fastai tabular results with plain PyTorch, or rather PyTorch Lightning…

Used the same model configurations, learning rates, OneCycleScheduler, etc. but still the gap between fastai tabular and the regular PyTorch model is quite substantial.

This is indeed a pretty dead thread haha. Without more info I’m not sure how to help, as I’m sure there isn’t some silver bullet to replicating the fastai results. There’s a lot going on by fastai, and any one of those pieces could be the discrepancy for you. Maybe you can give some more details into your problem?

What I’m doing now is mostly using fastai to train my models, and then extracting the trained pytorch models and using them out of the fastai context. So I might not be the best person to help, but I can try and hopefully some others on this forum may have insight for you

1 Like