In my prior (and current) NLP work, “no” and other forms of negation are SUPER important. And they would be filtered out as “stop words” if I applied such a preprocessing technique.
I’ve been using ULMFiT for genomics data. You need to write your own processing functions and train your own language models. I needed to write custom functions for:
Tokenizer
Vocab
NumericalizeProcessor
_join_texts
TokenizeProcessor
NumericalizeProcessor
_get_processor
TextLMDataBunch
TextClasDataBunch
But really that’s stuff for turning your data into a form you can feed into the model. Everything that happens after you tokenize/numericalize your data is the same. Same AWD-LSTM model, same ULMFiT training process.
I liked that Jeremy mentioned he noticed a reduction in performance when he tried to remove Spacy and use a simpler tokenizer. Did anyone work out what components of Spacy added the biggest boost to performance? Perhaps one could use a subset of Spacy’s careful tokenization together with other tokenization ideas to further boost the performance of language models.
Speaking of alternative tokenizers, here is one that trended in HN today: https://github.com/Microsoft/BlingFire/blob/master/README.md
And here is a course on spacy:
Try bla
Both are important in their own way, so I’d recommend a good mix.
assuming GPU memory is not an issue, do you need to set it?
For what it’s worth, the AWD-LSTM github page says BPTT doesn’t impact final results.
can language models be used for non nlp ?. any experience doing that ?.
surprising, what about modeling long-term dependencies?
I don’t think you have a GPU that can accommodate the total length of wikipedia, even divided by batch size
No, that is misleading. They say the bptt in validation doesn’t impact result. That’s because there are no gradients in validation.
Since we already have bos and eos tokens, why not just start looking at the next document instead of trying to pad?
That’s for classification, so you need to classify each document.
so for just the language modeling part, we don’t need this fancy sorting?
Nope, it stops at the LM_Dataset, the fancy sorting is only for the classification part.
Can the fast.ai API handle variable length sequences? Most of my work is in healthcare using longitudinal datasets that when processed on the patient level are variable length. I often have to work around it on my own as Pytorch’s packed padding doesn’t work for the nested structure of my data.
Are there any recommended pre-trained Transformer/XL/BERT models available to use in the same way we can use fastai’s pre-trained AWD-LSTM model?
Hmm I may need to revisit that. I didn’t get good results the first time I tried it. It just kept terminating really quickly without generating interesting text (predict
output seemed much richer)
If packed sequence from Pytorch doesn’t work, we don’t have anything better.
Depends on how you define long term. I don’t have any issues with sequences of 1000-2000 tokens. I’ve tested one dataset that had extra long challenge sequences (up to 10000 or so). At first I had worse performance, but it turned out to be due to batching different length sequences together and the max_len
parameter in the MultiBatchEncoder
being too low.
That said this is all in the context of classification. The model only needs to focus on a single output for each input. A more interesting challenge would be to see if the model can learn long term interactions (ie promoters and enhancers), but I haven’t tried anything like that yet.
Oh that makes me think we have a transformer XL pretrained model sitting somewhere.