Moving BERT into FastAI - good article

I’m planning to do some work with BERT but wanted to use the FastAI library…I found this article about how to do that:

I started a new thread as I saw other people asking about doing the same. I haven’t worked through the above yet but plan to do so tomorrow and will post if I find any other tips/tricks involved (and please do the same if you integrate).
Hopefully we can establish an optimal process for integrating BERT, RoBERTa, etc.


Hi LessW2020 hope you are well!
Thanks for providing a very informative write up of your work, it will help many people like myself completing part 1 of the course build on lessons based on NLP.

Cheers mfabulous1 :smiley::smiley:

1 Like

Hi @LessW2020,

Thanks for pointing such a wonderful guides. After spending couple hours playing from the beginning, here are something seems odd to me… (which Jeremy said if you feel something odd you should just say it…)

Here I am:

  1. Why using factory databunch? Why not using Datablock API?

      processor = [OpenFileProcessor(),TokenizeProcessor(tokenizer=fastai_tokenizer,include_bos=False),NumericalizeProcessor(max_vocab=40000)] 

This is actually the only line we need to create a processor to pass to the datablock api for it to work.

For anyone that wondering what I am talking about

data = (TextList
  1. Why setting pre_rules and post_rules to []
    As I can see, if you don’t set pre_rules/post_rules, some of the rules will destroy some of the BERT tokens. Like [CLS] will become xxup[cls], but this can be easily fixed by change the post_rules to have no deal_caps() function call.

I don’t know if I am missing something or my approach is wrong.

I guess my question is:
All pre_rule and post_rule that adds fastai special tokens to the BERT will destroy the BERT tokenizer? Therefore we should just leave it alone?

But what about html and some repeated text issue? Like < br >, if you dont have the fix_html() for the pre_rule, you will have all the html related tokens. BERT likes this?

Thanks, any input is appreciated


And just to follow up. It actually very easy to understand once you load the BERT model and print it out.
It just a ENCODER until the final dropout and classifier. My another question is the how did you guys figure out the splitting methods proposed in the guide.

For the efficient net model, I have tried many times for different splitting, but none of them is working… In terms of working, is not the running error, is once you apply different layer groups, then unfreeze the body, the efficient net becomes un-trainable.

Humm… Sorry for my English, I really needs to re-phrase, For efficient net, I tried different ways to split the model (and the model has been split to different parts as I want)

But after splitting the model, apply discriminative lr, fine tune the head, then unfreeze, fine tune the whole model. In the last phase, the loss is damping at some place (train loss is improving but valid loss is not). And accuracy is as same as not unfreeze the model, just train the head.

So I was wondering if the splitting work here is actually working? Do we have a guide about how to split the model?

Thanks in advance!


Hi all,

Just to follow up my findings before running into the training step. Here are something that is interesting.

  1. BERT is using WordPiece to do tokenization, it will run to sub words such as ‘playing’ is ‘play’ + ‘##ing

  2. What we learned in Fastai ULMFit is using Spacy, I didn’t dive too much to figure out what the difference are. However, as we learned from Rachel in the NLP course, different tokenization leads to different results.

  3. Pre_rules, Post_rules.
    Let’s take a look of example first:

[CLS] the premise of cabin fever starts like it might have something to offer . a group of college teens after finals ( in the fall ? ) goes to a resort cabin in the woods where one by one they are attacked by an unseen flesh eating virus . < br / > < br / > unfortunately , the first paragraph is where any remote elements of film

This is one of the IMDB review that use the method introduced in the great article that ports BERT into fastai. As we can see, since pre_rules and post_rules are set to none, we will have this strange HTML format. I also think BERT tokenizer (It is wordPiece, but I will call it BERT tokenizer from now) just turn each of that to a token, so ‘<’ , ‘br’ , ‘/’, ‘>’ . My question is, keep HTML related format helps downstream tasks? Not to say that in a lot of recent competition / dataset that you have emoji in it.

I take my exploration a bit further. I went to read fastai source code, here are the list of pre_rules and post_rules

defaults.text_pre_rules = [fix_html, replace_rep, replace_wrep, spec_add_spaces, rm_useless_spaces] 
defaults.text_post_rules = [replace_all_caps, deal_caps]

Let’s go over them 1 by 1,

  1. fix_html()
    resolves the issue that will remove html things, and introduced none of the fastai tokens. (I vote for keep)

  2. replace_rep()
    resolves the repeating tokes, like cooooool, and ##, also introduce fastai token TK_REP (I vote for discard, idea is that BERT doesn’t like fastai - tokens, joke, it will remove ## token that BERT used for sub-word, and it is essential for BERT model therefore I don’t think we should keep it)

  3. replace_wrep()
    My understanding, removes words like ‘this is so cool cool cool’, becomes ‘this is so TK_WREP cool’. (I vote for discard, reason is it introduce new fastai token, and I think BERT token handles the repeating words differently? - if anyone can comment on this one that will be great)

  4. space_add_spaces()
    This doesn’t introduce any new tokens, but the problem is it will remove the ## token from BERT by adding space between ##. I discard

  5. rm_useless_space()
    I have no idea, I discard, but it seems to me we can keep it? It might be that IMDB dataset doesn’t have this thing (or I just didn’t find a solid example in my test, so I simply removed)

  6. Post Rules

None of the post rules can be kept, the reason is it will change BERT [CLS] and [SEP] tokens to small cases. This two tokens are essential for BERT models, therefore I simply discard both

With that being said, this is what I have for post and pre rules

fastai_tokenizer = Tokenizer(tok_func=FastaiBertTokenizer(bert_tok,max_seq=256),pre_rules=[fix_html],post_rules=[])

Let’s compare the same sentences with what introduced in the BERT-Fastai article

[CLS] the premise of cabin fever starts like it might have something to offer . a group of college teens after finals ( in the fall ? ) goes to a resort cabin in the woods where one by one they are attacked by an unseen flesh eating virus . unfortunately , the first paragraph is where any remote elements of film quality stop . cabin fever is little more

What do you guys think about it?




Nice research, thank you!

I’ve been able to get both BERT and roBERTa to run using fastai guidelines. However, for the purposes of my research instead of trying to predict a class I’m trying to predict a float score.So instead of positive/negative think of it as how positive or negative between 0-1.0. I tried changing the num_classses to equal 1,but I’m getting an error. I have been able to get this to work in ULMFiT by simply converting the int of 1-5 to a float between 0-1. But that doesn’t seem to work with the transformer models. I’d love any thoughts.


1 Like

Have you tried changing the label class in the datablock to

.label_from_df(cols=dep_var, label_cls=FloatList)

1 Like

Thank you! I felt like I had tried this earlier and it didn’t work, but I tried it again and it did this time. Much appreciated!