Google BERT Language Models


There is a new architecture for transfer learning in NLP released by Google AI called BERT:
I think it is worth checking out, as it boasts SOTA results on a range of tasks with minimal architecture change. On the other hand, the paper says the models are huge - 110M and 340M parameters for a small and large model respectively. However, after Google releases it, the language modelling and transfer learning approach can really take off.
What do you guys think?


Thanks for the info. It sounds really interesting, and it makes ULMFit like a paper from last year :wink:
I hope they will release soon their pre-trained lm and its source code, so we can try and compare it

Thank you for sharing the information.

According to the WideResNet paper table 8, ResNet152 has 60.2M parameters.

What huge models GoogleAI trained…

It’s heavily based on the transformer architecture and recommends this link as a good place to get started:

I’m looking forward to the code release and any pytorch ports, so keep me posted if any of ya’ll hear anything.


Look at how much compute power required to train BERT language models.

It actually was 256 TPU-days, he corrected it in a subsequent tweet.

He also said that using the model would be accessible for everyone, we just wouldn’t be able to retrain the whole thing.


Yes, I am aware of his subsequent tweet.

Correct. Transfer learning from the pre-trained model is more practical.

That helps but most of the interesting translation problems are not English to French and the like.

I highly doubt they are training Bengali or Xhosa models, but would definitely use the pre-trained model if they did!

1 Like

Here’s a pytorch implementation if anyone’s interested.


This paper seems really cool but I’m a bit worried about the time it will take to run inference with the model. I am working with a somewhat similar architecture (the GPT from OpenAI) and with a context of the size 512 (maximum length for this network) the inference is extremely slow.

We can get around this problem by reducing the context size. I benchmarked the performances with a context of size 80 with different GPUs and I get the following results (in number of sample by second):

  • 173 with a GTX1070
  • 288 with a P100
  • 435 with a V100

I think that’s something you need to keep in mind before trying to apply this kind of networks for real world applications.

If anyone is interested, I used a PyTorch implementation of this paper (full disclosure: on which I contributed).


In the repo you linked it says:

Finetuning the PyTorch model for 3 Epochs on ROCStories takes 10 minutes to run on a single NVidia K-80.

This doesn’t sounds like a lot. But I haven’t read the paper yet so I might be missing something.

Can you help me put the numbers in context? If i think about ULMFiT we’ve got around 1k sentences per second when training a language model so probably we have 2k for inference on single 1080ti. But that was language modeling task on wikipedia. I’m not sure how your numbers compare. What task have you run GPT against? How slow/fast are other models on the same task?

Regarding the context size of 512, do you have any numbers? How slow was it running?

Here’s a keras implementation if anyone’s interested. (the data reader is framework independent can be easily used in pytorch)

1 Like

Google has released the official TensorFlow code and pre-trained models for BERT.

1 Like

They just release it. They are covering 102 languages:

1 Like

Here is a PyTorch version with a conversion script for loading the Google checkpoints.


Trying to load the weights from that converted model seems to work a little differnt though, right?
I’m getting the error that “0.encoder.weight” key doesn’t exist for example.

It has multiple weights like
Is anyone familiar enough with tf to pytorch conversion to tell me how to implement the transformed model in the projects?

Rani Horev first attempt at summarizing this paper:

pytorch implementation: github_pytorch
keras implementation: github_keras


I wonder why BERT being a SOTA architecture, performs so bad (compared to ULMFiT) on IMDB movie review dataset?

At least in their Colab shows just 86% accuracy (well they used only 5000 reviews our of 25000 to train and test, but I got 88% accuracy when used the whole dataset).

ULMFiT in Part 1 of this course showed > 95% accuracy (also here

Any ideas why such a big difference? Could it be because in the above example BERT used only 128 tokens max out of each review? Does it mean that in general ULMFiT is more suited for longer texts?


I’ve run some tests with BERT in CPU machine. My baseline was a simple feed-forward neural network with a single hidden layer (64 neurons) and some dropout, TF-IDF vectorization. That simple model gives me 80-90% accuracies, while BERT gives only 56-65% accuracies.

The documents in my datasets are several pages long. In fact, it seems BERT may perform poorly with longer texts.

1 Like