Custom tokenization using FastAI tokenizer

shaun1 · November 19, 2018, 7:32pm

Hi,

I am trying to create a language model using clinical notes. The dataset I’m using contains redacted field for personally identifiable information such names, ssn etc. I have created functions which uses regex and custom rules to detect these fields and replace them with special tokens. For example, the redacted information for a first name is replaced by a special token xxfn as shown:

[**First Name (Titles) 137**] ----> xxfn

As mentioned earlier, I have custom rules and regex for each and every redacted piece of information (and there are a lot, for a full list you can check here) such that when this preprocessing is done a regex to capture [** **] would return nothing.

I did this preprocessing prior to the release of FastAI 1.0. I have some questions on how to move forward creating my language model.

I initially thought that I could use this preprocessed text as input to create the language model. However, I have a question on how the tokenization process would handle my special tokens. Would it replace my tokens such as xxfn and xxln with xxunk?
I see that the tokenizer has an option for special_casses. If I pass a list of all my special tokens would that then be spared from being replaced?
Would it be better to use the FastAI’s API using the rules option in the tokenizer to accomplish the same thing I have already done in a custom tokenizer?

Thanks.

sgugger · November 19, 2018, 8:14pm

The tokens won’t be replaced. Placing them in special_cases will tell the spacy tokenizer (default) that those are special tokens, so it’s best to do it.

If you put your rules in a fastai tokenizer, you will only have one step of preprocessing, which will also help on inference (you can do directly learn.predict on a text now, but it needs to know all the rules for properly processing it).

shaun1 · November 19, 2018, 8:20pm

Thank you for your reply.

It is my understanding that for OOV (out of vocab) words, xxunk is used. I would think that my custom tokens would not be found in the vocabulary trained on wikipedia. If that is the case, how wouldn’t my custom tokens not be replaced?

What exactly does the special_cases option use for?

sgugger · November 19, 2018, 8:22pm

The vocabulary used is the one built on your dataset, then we change the embeddings in the pretrained model by only keeping the ones that appear in the new dataset vocab. The new tokens get an embedding equal to the mean of all embeddings and those are going to be learned during the first stage of training.

shaun1 · November 19, 2018, 8:32pm

Ofcourse! How stupid of me to ask a silly question. I must’ve forgotten Jermey’s IMDB lesson from part 2 earlier this year. Sorry about that.

I agree with this totally. However, I will be preprocessing (ie., applying my custom rules) to ALL my texts initially and only use the processed texts for all target tasks. This is because there are only a limited number of medical notes which will be split into training/valid/testing. So, I don’t need to worry too much for prediction since the processed text is the one that will be fed into it. Having said that, I will be looking into how to customize the FastAI tokenizer as I see lots of benefits in doing so.

shaun1 · November 20, 2018, 2:12am

1.In the steps for the language modeling where do we pass a customized tokenizer?

data_lm = (TextList.from_df(sample, path, col=['DESCRIPTION', 'PROC_TEXT'])
          .random_split_by_pct(0.1)
          .label_for_lm()
          .databunch())

In addition, I would like custom values for max_vocab, min_freq etc. Where do I pass all those options?
show_batch throws a StopIteration error. I was operating on a sample of the entire data, so I thought it could be that the sample size is too small and show_batch maybe tried to retrieve more data than was available, however even if I increase the sample to size of 10,000, this error persists:

---------------------------------------------------------------------------
StopIteration                             Traceback (most recent call last)
<ipython-input-15-f346687833ba> in <module>
----> 1 data_lm.show_batch()

~/fastai/fastai/text/data.py in show_batch(self, sep, ds_type, rows, max_len)
    224         from IPython.display import display, HTML
    225         dl = self.dl(ds_type)
--> 226         x,y = next(iter(dl))
    227         items = [['idx','text']]
    228         for i in range(rows):

StopIteration:

Any thoughts/suggestions?

Thanks.

sgugger · November 20, 2018, 3:49am

For 1. and 2., you need to design your custom processor. This is fairly easy, just pass the list [TokenizeProcessor(...), NumericalizeProcessor(...)] with all the arguments you want. Then you can give this processor in the first call, TextList.from_df.

show_batch tries to get a number of samples equal to rows, so you should ask for less rows than the batch size. If your error comes from the specific line x,y = next(iter(dl)), it’s a problem of setup of your data and you’ll never be able to train your model though.

shaun1 · November 20, 2018, 10:00pm

Thank you for your reply.

Since the show_batch error seems to be a bigger concern, I decided to focus on that first before worrying about custom tokenization. Consequently, I started to debug my code to figure out why I get a StopIteration exception even when I call show_batch with rows=1 with a sample dataset of size 1000.

What I discovered is that, the length of the dataloader len(data_lm.train_dl is 0 i.e., there is no data to iterate over, hence StopIteration.

As part of my debugging strategy, I started increasing my sample size further and sort of doing a binary search to see which sample size doesn’t throw an error on show_batch. Also, I called show_batch(rows=1). I found out that sampling 1320 data points from my dataset (basically doing sample = texts.sample(1320), I did not get an error when I called show_batch and I got an output of 1 row. Even when I sampled 1319 data points, I ended up getting an error, which is very weird.

Since the full size of my dataset is over 2 million, I’m guessing this shouldn’t be a big problem.

Any ideas on why this is happening? I’m guessing with even a dataset of size 64 (since 64 is the default batch size) show_batch(rows=1) should output 1 row and not a StopIteration error.

sgugger · November 20, 2018, 10:21pm

You forget the bptt and especially for the first batch, it’s augmented by 25, so you need 70 (default bptt) + 25 = 95 multiplied by your batch size tokens.

shaun1 · November 20, 2018, 10:46pm

Ah! So if default batch_size is 64, then 95*64 = 6080 tokens for show_batch to not throw an error. That’s good to know. Where would I pass a parameter to change the default batch size in the data_block?

sgugger · November 20, 2018, 10:47pm

At the databunch call, you can also set your bptt there.

shaun1 · November 21, 2018, 9:30pm

Until recently, when I looked at the processed output using show_batch, I saw individual fields marked such as xxfld 1 and so on. But now I don’t see it (I always do a git pull; pip install -e.[dev] before I start working). Digging in the code, I found that it is now option in passed in the variable mark_fields which is set to False by default.

Why has this change been made? Is it better for me to mark fields or not for fine-tuning the LM?
If I mark (or don’t mark) the fields during fine-tuning then I should also do the same for target task during processing?

Thanks.

sgugger · November 21, 2018, 10:29pm

We changed the default since often people have only one field, so it didn’t make any sense to put that token. If you have several fields, you just add that token, both for fine-tuning (so that the language model learns what this fld token means) and the target task.

diamondspark · January 15, 2020, 1:46am

@sgugger Can you please elaborate on use of TokenizeProcessor(…). I am working with Arabic Language.

ar_tok = TokenizeProcessor(tokenizer=Tokenizer(lang='ar'))
data_cls = (TextList.from_csv('./dataset/', 'ec_train+dev+test_clean_ar.csv', cols='clean_tweet', vocab=data_lm.vocab,processor=ar_tok)
                .split_by_rand_pct(valid_pct=0.2)
                .label_from_df(cols=['anger', 'anticipation', 'disgust', 'fear', 'joy','love', 'optimism', 'pessimism', 'sadness', 'surprise', 'trust'], label_cls=MultiCategoryList, one_hot=True)
                .databunch())
data_cls.show_batch()

This gives me TypeError: list indices must be integers or slices, not str

However if I use no processor, the code runs just fine. Please advice what I’m doing wrong.