Lesson 12 (2019) discussion and wiki

In my prior (and current) NLP work, “no” and other forms of negation are SUPER important. And they would be filtered out as “stop words” if I applied such a preprocessing technique.

3 Likes

I’ve been using ULMFiT for genomics data. You need to write your own processing functions and train your own language models. I needed to write custom functions for:

Tokenizer
Vocab
NumericalizeProcessor
_join_texts
TokenizeProcessor
NumericalizeProcessor
_get_processor
TextLMDataBunch
TextClasDataBunch

But really that’s stuff for turning your data into a form you can feed into the model. Everything that happens after you tokenize/numericalize your data is the same. Same AWD-LSTM model, same ULMFiT training process.

13 Likes

I liked that Jeremy mentioned he noticed a reduction in performance when he tried to remove Spacy and use a simpler tokenizer. Did anyone work out what components of Spacy added the biggest boost to performance? Perhaps one could use a subset of Spacy’s careful tokenization together with other tokenization ideas to further boost the performance of language models.

Speaking of alternative tokenizers, here is one that trended in HN today: https://github.com/Microsoft/BlingFire/blob/master/README.md

And here is a course on spacy:

11 Likes

Try bla :wink:
Both are important in their own way, so I’d recommend a good mix.

3 Likes

assuming GPU memory is not an issue, do you need to set it?

For what it’s worth, the AWD-LSTM github page says BPTT doesn’t impact final results.

3 Likes

can language models be used for non nlp ?. any experience doing that ?.

1 Like

surprising, what about modeling long-term dependencies?

2 Likes

I don’t think you have a GPU that can accommodate the total length of wikipedia, even divided by batch size :wink:

4 Likes

No, that is misleading. They say the bptt in validation doesn’t impact result. That’s because there are no gradients in validation.

8 Likes

Since we already have bos and eos tokens, why not just start looking at the next document instead of trying to pad?

1 Like

That’s for classification, so you need to classify each document.

2 Likes

so for just the language modeling part, we don’t need this fancy sorting?

1 Like

Nope, it stops at the LM_Dataset, the fancy sorting is only for the classification part.

1 Like

Can the fast.ai API handle variable length sequences? Most of my work is in healthcare using longitudinal datasets that when processed on the patient level are variable length. I often have to work around it on my own as Pytorch’s packed padding doesn’t work for the nested structure of my data.

3 Likes

Are there any recommended pre-trained Transformer/XL/BERT models available to use in the same way we can use fastai’s pre-trained AWD-LSTM model?

2 Likes

Hmm I may need to revisit that. I didn’t get good results the first time I tried it. It just kept terminating really quickly without generating interesting text (predict output seemed much richer)

If packed sequence from Pytorch doesn’t work, we don’t have anything better.

Depends on how you define long term. I don’t have any issues with sequences of 1000-2000 tokens. I’ve tested one dataset that had extra long challenge sequences (up to 10000 or so). At first I had worse performance, but it turned out to be due to batching different length sequences together and the max_len parameter in the MultiBatchEncoder being too low.

That said this is all in the context of classification. The model only needs to focus on a single output for each input. A more interesting challenge would be to see if the model can learn long term interactions (ie promoters and enhancers), but I haven’t tried anything like that yet.

3 Likes

Oh that makes me think we have a transformer XL pretrained model sitting somewhere.

4 Likes