Fastai V2 Upgrade Review Guide. Whats new in Fastai Version 2?

Fastai V2 Upgrade Review Guide. Whats new in Fastai Version 2?

Fastai2 was released on August 21st 2020 (Fastai2 and new course now released). As Jeremy writes fastai v2 is not API-compatible with fastai v1 (it’s a from-scratch rewrite).. Fortunately, though, the overall architecture of fastai remained the same, so upgrading to fastai2 is less of a hassle as it sounds like in the announcement.
As I would have wished for an upgrade guide which is missing so far, I am writing this post to guide you through upgrading your fastai1 code to the latest fastai2. My focus is on the core and visual packages, as those are the ones I have been working with. Please add everything that’s missing in this guide in the comments as resources for others.
Overall, most of the changes I noticed are moving the same functionality to different places and renaming stuff.
This guide will hopefully also show you some useful functionality of fastai that you haven’t discovered yet, there are many hidden gems in the library, which makes it highly effective for production usage. You get many great defaults and helpers that give you a bunch of additional performance (both accuracy and speed) for free.
All of this was done as part of my work at https://www.intuitionmachines.com

Setup

Fastai2 requires you to import from fastai.vision.all import *. The reason is they do a lot of monkey patching, and things fall apart, if you don’t do the all import. This goes against good coding standards, and one can only hope that in future they refactor this.

Data loading

The low-level API for handling data structurally changed between fastai version 1 and fastai version 2.
In fastai version 1 we created a DataBunch by chaining for steps: Provide inputs, split the data, label inputs and finally convert to a DataBunch which is then passed to the learner (https://fastai1.fast.ai/data_block.html).
In fastai version 2 the steps are to create a DataBlock object, which then is converted to a Datasets or DataLoaders object (https://docs.fast.ai/tutorial.datablock). The DataLoaders object can be passed to a learner object, the use case for the Datasets object remains unclear to me (it is a member of the DataLoaders object and seems to take the role of the datasource for the DataLoader).
In both versions there are helper functions to create datasets that are suitable for most use-cases. In fastai version 1 these were the DataBunch factory methods (https://fastai1.fast.ai/vision.data.html#Factory-methods), which became the ImageDataLoaders (https://docs.fast.ai/vision.data#ImageDataLoaders). The arguments of both are mostly the same, size=224, ds_tfms=vision.get_transforms() became: batch_tfms=[*aug_transforms(size=224)], item_tfms=Resize(256). It’s unfortunate that one needs to supply the resize and cropping size independently, now. If you haven’t tried the predefined augmentations so far, you should, it’s likely giving you some additional performance improvements. (from fastai.vision.augment import aug_transforms, Resize)
Within the learner, data has been renamed to dls: learn.datalearn.dls.
DataLoaders still have the train_ds, valid_ds and test_ds attributes. But Datasets does not have x and y attributes anymore, it only contains the underlying dataframe in .items, so in order to obtain all labels one needs to do this: labels = learn.dls.train_ds.items[learn.dls.train_ds.cols[1].items].
The number of classes and the class names dictionary has been moved from the Dataset to the DataLoaders object, and renamed: DataLoaders.c and DataLoaders.vocab (was dataset.classes).
As a side note, because it’s not documented: The vocab attribute is a CategoryMap (when does fastai finally introduce type annotations everywhere?), but the vocab function argument (eg. in CategoryBlock) accepts a dictionary in the following form: {class_name_str: class_idx_int}.

Callbacks

Callbacks are moved from fastai.callbacks to fastai.callback, and the LearnerCallback functionality is moved into the general Callback.
Fastai version 1 required the learner as an argument of all callbacks, this argument is now removed.
Callback functions have been renamed in the following way: on_*_beginbefore_* and on_*_endafter_*, and *_train*_fit.
And the tracker operator is now comp: t.operatort.comp
Finally, the dict return status codes were replaced by excpetions, eg.: return {"stop_training": True}raise CancelFitException()

Metrics

from fastai.metrics import auc_roc_score and callback → from fastai.metrics import RocAuc

Distributed

Fastai version 2 does a lot of monkey patching as mentioned before, so instead of inheriting a distributed learner, the learner is now patched: from fastai.distributed import Learnerfrom fastai.distributed import to_parallel (and call the to_parallel() function on the learner).

wandb

Weights & Biases (www.wandb.com) is a great library to track your machine learning experiments, results, models and data. In fastai version 1 the wandb library shipped a fastai callback, for fastai version 2 the appropriate callback is in the fastai library itself.
from wandb.fastai import WandbCallbackfrom fastai.callback.wandb import WandbCallback

Prediction

Became a lot easier! Yet, I never understood why it was so complicated before to make predictions for unlabeled/new data.
In fastai1 one needed to add_test to the label_list of learn.data, with an empty label, create a DataLoader for that label_list and wrap the DataLoader in a DeviceDataLoader before one was able to call learn.predict() and get_preds() to obtain the results. Now all this mess became two lines of code:
dl = learn.dls.test_dl(filenames)
logits, _ = learn.get_preds(dl=dl, drop_last=False) (fastai version 2 currently has the bug that if you don’t explicitly supply , drop_last=False, it drops the last batch during predict)
Why I still need to create a test_dl as a user of this core functionality remains unclear to me; imho this should be as simple as learn.predict(filenames).

Renaming / Moving

Many packages, classes and functions have been renamed and moved around, here’s a non-exhaustive list of everything I discovered during my migration:
fastai version 1 → fastai version 2
fastai.datasets.URLsfastai.data.external.URLs
fastai.vision.untar_datafastai.data.external.untar_data
fastai.callbacks.csv_logger.CSVLoggerfastai.callback.progress.CSVLogger
fastai.core.camel2snakefastcore.utils.camel2snake
fastai.core.defaultsfastcore.foundation.defaults
fastai.basic_train.load_learnerfastai.learner.load_learner
fastai.vision.cnn_learnerfastai.vision.learner.cnn_learner
fastai.basic_train.get_predsfastai.learner.Learner.get_preds (now a function of the learner object)

Removed / not yet implemented

self.learner.destroy(), with fastai version 1 and pytorch 1.4 there existed a memory leak that required to destroy the learner object to release all memory. This was necessary in situations where one wanted to run several experiments in a row (e.g. for hyperparameter search). The destroy() is not implemented yet, and I need to verify if the memory leak still exists.

@jeremy I hope this is still relevent for anybody, despite being late. If you see use for it, I am happy to create a PR to add it to the documentation, eg. in “migrating from other libraries”?

10 Likes

Thanks for sharing, I wished I had this few weeks ago (particularly to learn how to generate predictions for unlabeled datasets)!

@jeremy What do you think, shall I create a PR to add this to the documentation, eg. in “migrating from other libraries”?

Hello, I hope you are doing well! Im working to upgrade a codebase from FastAI v1 to v2 and am really confused about somethings, can you please help me! @dreamflasher @jeremy The code is: def create_databunch(config, df_trn, df_valid):
bert_tok = BertTokenizer.from_pretrained(config.model_name,)
print(bert_tok.vocab.keys())
fastai_tokenizer = Tokenizer(tok_func=FastAiBertTokenizer(bert_tok, max_seq_len=config.max_seq_len), pre_rules=[], post_rules=[])
fastai_bert_vocab = Vocab(list(bert_tok.vocab.keys()))
print(fastai_bert_vocab)
return BertDataBunch.from_df(“.”,
train_df=df_trn,
valid_df=df_valid,
tokenizer=fastai_tokenizer,
vocab=fastai_bert_vocab,
bs=config.bs,
text_cols=input_col,
label_cols=config_data.label_column,
collate_fn=partial(pad_collate, pad_first=False, pad_idx=0),
)

def create_learner(config, databunch):
model = BertTextClassifier(config.model_name, config.num_labels)

optimizer = partial(AdamW)
if config.es:
  learner = Learner(
    databunch, model,
    optimizer,
    wd = config.weight_decay,
    metrics=FBeta(beta=1), #accuracy, (metric to optimize on)
    loss_func=config.loss_func, callback_fns=[partial(EarlyStoppingCallback, monitor='f_beta', min_delta=config.min_delta, patience=config.patience)]
  )
else:
  learner = Learner(
    databunch, model,
    optimizer,
    wd = config.weight_decay,
    metrics=FBeta(beta=1), #accuracy, (metric to optimize on)
    loss_func=config.loss_func,
  )

return learner

Create the classifier

def create_classifier(config, df):
df_trn, df_valid = split_dataframe(df, train_size = config.train_size, random_state = config.seed)
#print(df_trn.iloc[0])
databunch = create_databunch(config, df_trn, df_valid)
print(databunch.show_batch())

return create_learner(config, databunch)

I’m really confused how to upgrade this to v2:
I made the following changes:------ but it is not correct I feel

def create_databunch(config, df):
bert_tok = BertTokenizer.from_pretrained(config.model_name,)
fastai_tokenizer = Tokenizer(tok=FastAiBertTokenizer(bert_tok, max_seq_len=config.max_seq_len), rules=[])
fastai_bert_vocab = Categorize(vocab = list(bert_tok.vocab.keys())).vocab

return TextDataLoaders.from_df(
                df=df,
               tok_tfm=fastai_tokenizer,
               text_vocab=fastai_bert_vocab,
               text_col=input_col,
               label_col=config_data.label_column)

#No padding in text data loader *** Delta-2, changed function signature

def create_learner(config, databunch):
model = BertTextClassifier(config.model_name, config.num_labels)

optimizer = partial(AdamW)
learner = Learner(
  dls = databunch,
  model = model,
  wd = config.weight_decay,
  metrics=FBeta(beta=1), #accuracy, (metric to optimize on)
  loss_func=config.loss_func,
)

return learner

Create the classifier

def create_classifier(config, df):
print("Config.train ", config.train_size)
df_trn, df_valid = split_dataframe(df, train_size = config.train_size, random_state = config.seed)
databunch = create_databunch(config, df_trn)

train, valid split where ?

return create_learner(config, databunch)

Can you help point out what I am doing wrong?

Are you sure this is correct? I don’t think it is.

None of this is true.

I agree that none of that is literally true, but going through the courses, I had a similar feeling, from module import * goes against popular python coding practices, and there is a lot of * importing in the courses. But no one forces you to import * if you don’t want to.