Google BERT Language Models

noisefield · October 12, 2018, 10:53am

Hi!

There is a new architecture for transfer learning in NLP released by Google AI called BERT: https://www.reddit.com/r/MachineLearning/comments/9nfqxz/r_bert_pretraining_of_deep_bidirectional/
I think it is worth checking out, as it boasts SOTA results on a range of tasks with minimal architecture change. On the other hand, the paper says the models are huge - 110M and 340M parameters for a small and large model respectively. However, after Google releases it, the language modelling and transfer learning approach can really take off.
What do you guys think?

cahya · October 12, 2018, 12:36pm

Thanks for the info. It sounds really interesting, and it makes ULMFit like a paper from last year
I hope they will release soon their pre-trained lm and its source code, so we can try and compare it

crcrpar · October 12, 2018, 1:56pm

Thank you for sharing the information.

According to the WideResNet paper table 8, ResNet152 has 60.2M parameters.

What huge models GoogleAI trained…

wgpubs · October 12, 2018, 9:07pm

It’s heavily based on the transformer architecture and recommends this link as a good place to get started: http://nlp.seas.harvard.edu/2018/04/03/attention.html

I’m looking forward to the code release and any pytorch ports, so keep me posted if any of ya’ll hear anything.

cedric · October 16, 2018, 4:56am

Look at how much compute power required to train BERT language models.

PierreO · October 16, 2018, 9:31am

It actually was 256 TPU-days, he corrected it in a subsequent tweet.

He also said that using the model would be accessible for everyone, we just wouldn’t be able to retrain the whole thing.

cedric · October 16, 2018, 9:44am

Yes, I am aware of his subsequent tweet.

Correct. Transfer learning from the pre-trained model is more practical.

Gabriel_Syme · October 16, 2018, 3:33pm

That helps but most of the interesting translation problems are not English to French and the like.

I highly doubt they are training Bengali or Xhosa models, but would definitely use the pre-trained model if they did!

KarlH · October 16, 2018, 5:34pm

Here’s a pytorch implementation if anyone’s interested.

rodgzilla · October 26, 2018, 9:21am

This paper seems really cool but I’m a bit worried about the time it will take to run inference with the model. I am working with a somewhat similar architecture (the GPT from OpenAI) and with a context of the size 512 (maximum length for this network) the inference is extremely slow.

We can get around this problem by reducing the context size. I benchmarked the performances with a context of size 80 with different GPUs and I get the following results (in number of sample by second):

173 with a GTX1070
288 with a P100
435 with a V100

I think that’s something you need to keep in mind before trying to apply this kind of networks for real world applications.

If anyone is interested, I used a PyTorch implementation of this paper (full disclosure: on which I contributed).

piotr.czapla · October 27, 2018, 12:58pm

In the repo you linked it says:

Finetuning the PyTorch model for 3 Epochs on ROCStories takes 10 minutes to run on a single NVidia K-80.

This doesn’t sounds like a lot. But I haven’t read the paper yet so I might be missing something.

Can you help me put the numbers in context? If i think about ULMFiT we’ve got around 1k sentences per second when training a language model so probably we have 2k for inference on single 1080ti. But that was language modeling task on wikipedia. I’m not sure how your numbers compare. What task have you run GPT against? How slow/fast are other models on the same task?

Regarding the context size of 512, do you have any numbers? How slow was it running?

Separius · October 28, 2018, 4:56pm

Here’s a keras implementation if anyone’s interested. (the data reader is framework independent can be easily used in pytorch)

DeepBlender · November 5, 2018, 2:36pm

They just release it. They are covering 102 languages:

github.com

google-research/bert/blob/master/multilingual.md

## Models

There are two multilingual models currently available. We do not plan to release
more single-language models, but we may release `BERT-Large` versions of these
two in the future:

*   **[`BERT-Base, Multilingual`](https://storage.googleapis.com/bert_models/2018_11_03/multilingual_L-12_H-768_A-12.zip)**:
    102 languages, 12-layer, 768-hidden, 12-heads, 110M parameters
*   **[`BERT-Base, Chinese`](https://storage.googleapis.com/bert_models/2018_11_03/chinese_L-12_H-768_A-12.zip)**:
    Chinese Simplified and Traditional, 12-layer, 768-hidden, 12-heads, 110M
    parameters

See the [list of languages](#list-of-languages) that the Multilingual model
supports. The Multilingual model does include Chinese (and English), but if your
fine-tuning data is Chinese-only, then the Chinese model will likely produce
better results.

## Results

To evaluate these systems, we use the

This file has been truncated. show original

brian · November 5, 2018, 6:32pm

Here is a PyTorch version with a conversion script for loading the Google checkpoints.

EinAeffchen · November 7, 2018, 4:14pm

Trying to load the weights from that converted model seems to work a little differnt though, right?
I’m getting the error that “0.encoder.weight” key doesn’t exist for example.

It has multiple weights like
encoder.layer.0.attention.self.key.weight
encoder.layer.0.attention.self.value.weight
encoder.layer.0.attention.output.dense.weight
encoder.layer.0.intermediate.dense.weight
etc.
Is anyone familiar enough with tf to pytorch conversion to tell me how to implement the transformed model in the fast.ai projects?

cedric · November 11, 2018, 12:56am

Rani Horev first attempt at summarizing this paper:

nemish · November 22, 2018, 9:53am

pytorch implementation: github_pytorch
keras implementation: github_keras

Eng1neer · March 18, 2019, 1:09pm

I wonder why BERT being a SOTA architecture, performs so bad (compared to ULMFiT) on IMDB movie review dataset?

At least in their Colab https://colab.sandbox.google.com/github/google-research/bert/blob/master/predicting_movie_reviews_with_bert_on_tf_hub.ipynb shows just 86% accuracy (well they used only 5000 reviews our of 25000 to train and test, but I got 88% accuracy when used the whole dataset).

ULMFiT in Part 1 of this course showed > 95% accuracy (also here http://nlpprogress.com/english/sentiment_analysis.html).

Any ideas why such a big difference? Could it be because in the above example BERT used only 128 tokens max out of each review? Does it mean that in general ULMFiT is more suited for longer texts?

monilouise · March 20, 2019, 5:35pm

I’ve run some tests with BERT in CPU machine. My baseline was a simple feed-forward neural network with a single hidden layer (64 neurons) and some dropout, TF-IDF vectorization. That simple model gives me 80-90% accuracies, while BERT gives only 56-65% accuracies.

The documents in my datasets are several pages long. In fact, it seems BERT may perform poorly with longer texts.

monilouise · March 20, 2019, 5:41pm

Interesting discussion: https://github.com/google-research/bert/issues/66