Do you actually set the bptt length in RNN or is it figured out by the length of your sequence (i.e. time steps)?
How do you handle larger contexts?
It’s one of the things you set, because it depends on the GPU memory you have.
What are the tradeoffs to consider between bs and bptt?
For example, Bptt 10 with bs 100
vs Bptt 100 with bs 10
, both would be passing 1000 tokens at a time to the model. But what should you consider when tuning the ratio?
In my prior (and current) NLP work, “no” and other forms of negation are SUPER important. And they would be filtered out as “stop words” if I applied such a preprocessing technique.
I’ve been using ULMFiT for genomics data. You need to write your own processing functions and train your own language models. I needed to write custom functions for:
Tokenizer
Vocab
NumericalizeProcessor
_join_texts
TokenizeProcessor
NumericalizeProcessor
_get_processor
TextLMDataBunch
TextClasDataBunch
But really that’s stuff for turning your data into a form you can feed into the model. Everything that happens after you tokenize/numericalize your data is the same. Same AWD-LSTM model, same ULMFiT training process.
I liked that Jeremy mentioned he noticed a reduction in performance when he tried to remove Spacy and use a simpler tokenizer. Did anyone work out what components of Spacy added the biggest boost to performance? Perhaps one could use a subset of Spacy’s careful tokenization together with other tokenization ideas to further boost the performance of language models.
Speaking of alternative tokenizers, here is one that trended in HN today: https://github.com/Microsoft/BlingFire/blob/master/README.md
And here is a course on spacy:
Try bla
Both are important in their own way, so I’d recommend a good mix.
assuming GPU memory is not an issue, do you need to set it?
For what it’s worth, the AWD-LSTM github page says BPTT doesn’t impact final results.
can language models be used for non nlp ?. any experience doing that ?.
surprising, what about modeling long-term dependencies?
I don’t think you have a GPU that can accommodate the total length of wikipedia, even divided by batch size
No, that is misleading. They say the bptt in validation doesn’t impact result. That’s because there are no gradients in validation.
Since we already have bos and eos tokens, why not just start looking at the next document instead of trying to pad?
That’s for classification, so you need to classify each document.
so for just the language modeling part, we don’t need this fancy sorting?
Nope, it stops at the LM_Dataset, the fancy sorting is only for the classification part.
Can the fast.ai API handle variable length sequences? Most of my work is in healthcare using longitudinal datasets that when processed on the patient level are variable length. I often have to work around it on my own as Pytorch’s packed padding doesn’t work for the nested structure of my data.
Are there any recommended pre-trained Transformer/XL/BERT models available to use in the same way we can use fastai’s pre-trained AWD-LSTM model?