Lesson 12 (2019) discussion and wiki

neuradai · April 18, 2019, 3:19am

In my prior (and current) NLP work, “no” and other forms of negation are SUPER important. And they would be filtered out as “stop words” if I applied such a preprocessing technique.

KarlH · April 18, 2019, 3:19am

I’ve been using ULMFiT for genomics data. You need to write your own processing functions and train your own language models. I needed to write custom functions for:

Tokenizer
Vocab
NumericalizeProcessor
_join_texts
TokenizeProcessor
NumericalizeProcessor
_get_processor
TextLMDataBunch
TextClasDataBunch

But really that’s stuff for turning your data into a form you can feed into the model. Everything that happens after you tokenize/numericalize your data is the same. Same AWD-LSTM model, same ULMFiT training process.

paul · April 18, 2019, 3:19am

I liked that Jeremy mentioned he noticed a reduction in performance when he tried to remove Spacy and use a simpler tokenizer. Did anyone work out what components of Spacy added the biggest boost to performance? Perhaps one could use a subset of Spacy’s careful tokenization together with other tokenization ideas to further boost the performance of language models.

Speaking of alternative tokenizers, here is one that trended in HN today: https://github.com/Microsoft/BlingFire/blob/master/README.md

And here is a course on spacy:

sgugger · April 18, 2019, 3:20am

Try bla
Both are important in their own way, so I’d recommend a good mix.

tanyaroosta · April 18, 2019, 3:21am

assuming GPU memory is not an issue, do you need to set it?

KarlH · April 18, 2019, 3:22am

For what it’s worth, the AWD-LSTM github page says BPTT doesn’t impact final results.

harikrishnanrajeev · April 18, 2019, 3:22am

can language models be used for non nlp ?. any experience doing that ?.

alenas · April 18, 2019, 3:22am

surprising, what about modeling long-term dependencies?

sgugger · April 18, 2019, 3:23am

I don’t think you have a GPU that can accommodate the total length of wikipedia, even divided by batch size

sgugger · April 18, 2019, 3:23am

No, that is misleading. They say the bptt in validation doesn’t impact result. That’s because there are no gradients in validation.

benjmann · April 18, 2019, 3:26am

Since we already have bos and eos tokens, why not just start looking at the next document instead of trying to pad?

sgugger · April 18, 2019, 3:26am

That’s for classification, so you need to classify each document.

benjmann · April 18, 2019, 3:27am

so for just the language modeling part, we don’t need this fancy sorting?

sgugger · April 18, 2019, 3:28am

Nope, it stops at the LM_Dataset, the fancy sorting is only for the classification part.

sparalic · April 18, 2019, 3:31am

Can the fast.ai API handle variable length sequences? Most of my work is in healthcare using longitudinal datasets that when processed on the patient level are variable length. I often have to work around it on my own as Pytorch’s packed padding doesn’t work for the nested structure of my data.

wgpubs · April 18, 2019, 3:31am

Are there any recommended pre-trained Transformer/XL/BERT models available to use in the same way we can use fastai’s pre-trained AWD-LSTM model?

yeldarb · April 18, 2019, 3:31am

Hmm I may need to revisit that. I didn’t get good results the first time I tried it. It just kept terminating really quickly without generating interesting text (predict output seemed much richer)

sgugger · April 18, 2019, 3:33am

If packed sequence from Pytorch doesn’t work, we don’t have anything better.

KarlH · April 18, 2019, 3:33am

Depends on how you define long term. I don’t have any issues with sequences of 1000-2000 tokens. I’ve tested one dataset that had extra long challenge sequences (up to 10000 or so). At first I had worse performance, but it turned out to be due to batching different length sequences together and the max_len parameter in the MultiBatchEncoder being too low.

That said this is all in the context of classification. The model only needs to focus on a single output for each input. A more interesting challenge would be to see if the model can learn long term interactions (ie promoters and enhancers), but I haven’t tried anything like that yet.

sgugger · April 18, 2019, 3:34am

Oh that makes me think we have a transformer XL pretrained model sitting somewhere.