Just to follow up my findings before running into the training step. Here are something that is interesting.
BERT is using WordPiece to do tokenization, it will run to sub words such as ‘playing’ is ‘play’ + ‘##ing’
What we learned in Fastai ULMFit is using Spacy, I didn’t dive too much to figure out what the difference are. However, as we learned from Rachel in the NLP course, different tokenization leads to different results.
Let’s take a look of example first:
[CLS] the premise of cabin fever starts like it might have something to offer . a group of college teens after finals ( in the fall ? ) goes to a resort cabin in the woods where one by one they are attacked by an unseen flesh eating virus . < br / > < br / > unfortunately , the first paragraph is where any remote elements of film
This is one of the IMDB review that use the method introduced in the great article that ports BERT into fastai. As we can see, since pre_rules and post_rules are set to none, we will have this strange HTML format. I also think BERT tokenizer (It is wordPiece, but I will call it BERT tokenizer from now) just turn each of that to a token, so ‘<’ , ‘br’ , ‘/’, ‘>’ . My question is, keep HTML related format helps downstream tasks? Not to say that in a lot of recent competition / dataset that you have emoji in it.
I take my exploration a bit further. I went to read fastai source code, here are the list of pre_rules and post_rules
defaults.text_pre_rules = [fix_html, replace_rep, replace_wrep, spec_add_spaces, rm_useless_spaces]
defaults.text_post_rules = [replace_all_caps, deal_caps]
Let’s go over them 1 by 1,
resolves the issue that will remove html things, and introduced none of the fastai tokens. (I vote for keep)
resolves the repeating tokes, like cooooool, and ##, also introduce fastai token TK_REP (I vote for discard, idea is that BERT doesn’t like fastai - tokens, joke, it will remove ## token that BERT used for sub-word, and it is essential for BERT model therefore I don’t think we should keep it)
My understanding, removes words like ‘this is so cool cool cool’, becomes ‘this is so TK_WREP cool’. (I vote for discard, reason is it introduce new fastai token, and I think BERT token handles the repeating words differently? - if anyone can comment on this one that will be great)
This doesn’t introduce any new tokens, but the problem is it will remove the ## token from BERT by adding space between ##. I discard
I have no idea, I discard, but it seems to me we can keep it? It might be that IMDB dataset doesn’t have this thing (or I just didn’t find a solid example in my test, so I simply removed)
None of the post rules can be kept, the reason is it will change BERT [CLS] and [SEP] tokens to small cases. This two tokens are essential for BERT models, therefore I simply discard both
With that being said, this is what I have for post and pre rules
fastai_tokenizer = Tokenizer(tok_func=FastaiBertTokenizer(bert_tok,max_seq=256),pre_rules=[fix_html],post_rules=)
Let’s compare the same sentences with what introduced in the BERT-Fastai article
[CLS] the premise of cabin fever starts like it might have something to offer . a group of college teens after finals ( in the fall ? ) goes to a resort cabin in the woods where one by one they are attacked by an unseen flesh eating virus . unfortunately , the first paragraph is where any remote elements of film quality stop . cabin fever is little more
What do you guys think about it?