There is a new architecture for transfer learning in NLP released by Google AI called BERT: https://www.reddit.com/r/MachineLearning/comments/9nfqxz/r_bert_pretraining_of_deep_bidirectional/
I think it is worth checking out, as it boasts SOTA results on a range of tasks with minimal architecture change. On the other hand, the paper says the models are huge - 110M and 340M parameters for a small and large model respectively. However, after Google releases it, the language modelling and transfer learning approach can really take off.
What do you guys think?
Thanks for the info. It sounds really interesting, and it makes ULMFit like a paper from last year
I hope they will release soon their pre-trained lm and its source code, so we can try and compare it
This paper seems really cool but I’m a bit worried about the time it will take to run inference with the model. I am working with a somewhat similar architecture (the GPT from OpenAI) and with a context of the size 512 (maximum length for this network) the inference is extremely slow.
We can get around this problem by reducing the context size. I benchmarked the performances with a context of size 80 with different GPUs and I get the following results (in number of sample by second):
173 with a GTX1070
288 with a P100
435 with a V100
I think that’s something you need to keep in mind before trying to apply this kind of networks for real world applications.
If anyone is interested, I used a PyTorch implementation of this paper (full disclosure: on which I contributed).
Finetuning the PyTorch model for 3 Epochs on ROCStories takes 10 minutes to run on a single NVidia K-80.
This doesn’t sounds like a lot. But I haven’t read the paper yet so I might be missing something.
Can you help me put the numbers in context? If i think about ULMFiT we’ve got around 1k sentences per second when training a language model so probably we have 2k for inference on single 1080ti. But that was language modeling task on wikipedia. I’m not sure how your numbers compare. What task have you run GPT against? How slow/fast are other models on the same task?
Regarding the context size of 512, do you have any numbers? How slow was it running?
Trying to load the weights from that converted model seems to work a little differnt though, right?
I’m getting the error that “0.encoder.weight” key doesn’t exist for example.
It has multiple weights like
encoder.layer.0.attention.self.key.weight
encoder.layer.0.attention.self.value.weight
encoder.layer.0.attention.output.dense.weight
encoder.layer.0.intermediate.dense.weight
etc.
Is anyone familiar enough with tf to pytorch conversion to tell me how to implement the transformed model in the fast.ai projects?
Any ideas why such a big difference? Could it be because in the above example BERT used only 128 tokens max out of each review? Does it mean that in general ULMFiT is more suited for longer texts?
I’ve run some tests with BERT in CPU machine. My baseline was a simple feed-forward neural network with a single hidden layer (64 neurons) and some dropout, TF-IDF vectorization. That simple model gives me 80-90% accuracies, while BERT gives only 56-65% accuracies.
The documents in my datasets are several pages long. In fact, it seems BERT may perform poorly with longer texts.