Training large models (like GPT2-1.5B) with distributed training?

Considering;

  • the SOTA NLP results from training larger models (e.g. OpenAI’s GPT-2-1.5B model [1])

  • the current high cost of training larger models (e.g. with GPT2-1.5B - Jeremy estimated it would cost $50k-$100k ‘in a hurry’, and around $20k to train in 10 months [2])

  • the new research and applications that fast.ai students could engage in, using transfer learning with such larger models

  • Jeremy and Sylvain’s existing research in training smaller, GPT-2-like transformer models [3]

  • the OpenAI’s GPT-2-1.5B dataset should be easy to replicate [4]

And also;

  • TensorFlow has an API to distribute training [5]

  • combined with Fast.ai’s newly announced usage of Swift for TensorFlow [6]

  • SenseTime’s recently announced distributed ImageNet/AlexNet Training in 1.5 Minutes on a cluster of 512 Volta GPUs (Arxiv paper dated 19 Feb 2019 [7])

I can’t help but wonder if it’s possible to train a large model, like GPT-2-1.5B, both quickly and cheaply, using distributed training that utilises the compute resources of fast.ai students and anyone else who wants to contribute GPU/TPU/IPU resource.

I’m not sure how amenable a model like GPT2 is to data-parallelism and model-parallelism and, even if it is, maybe it’s unworkable (e.g. latency issues) using widely distributed asynchronous training compared to, say, using Mesh-TensorFlow [8] on a ‘local’ supercomputer.

But despite this, I thought it worth mentioning anyway. I would love to know what’s stopping such distributed training, and whether those obstacles are insurmountable or not.

In terms of the ethics of creating, distributing and using large models like GPT2 - my personal view is currently that others are going to replicate GPT2-like results anyway, it’s just a matter of time, if not already done. I think the days of taking information at face value (including text, audio and video) are long over, and that digitally signing information, as Jeremy points out [9], makes sense to me.

As fast.ai students, we could adhere to a Code of Conduct in how we use such models, maybe digitally signing any products derived from such models, so it’s explicitly traceable and trust is in responsibly-managed certificate authorities, rather than in raw information.

Sources

[1] Language Models are Unsupervised Multitask Learners - https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf

[2] Some thoughts on zero-day threats in AI, and OpenAI’s GP2 - https://www.fast.ai/2019/02/15/openai-gp2/

If you’re in a hurry and you want to get this done in a month, then you’re going to need 80 GPUs. You can grab a server with 8 GPUs from the AWS spot market for $7.34/hour. That’s around $5300 for a month. You’ll need ten of these servers, so that’s around $50k to train the model in a month. OpenAI have made their code available, and described how to create the necessary dataset, but in practice there’s still going to be plenty of trial and error, so in practice it might cost twice as much.

If you’re in less of a hurry, you could just buy 8 GPUs. With some careful memory handling (e.g. using Gradient checkpointing) you might be able to get away with buying RTX 2070 cards at $500 each, otherwise you’ll be wanting the RTX 2080 ti at $1300 each. So for 8 cards, that’s somewhere between $4k and $10k for the GPUs, plus probably another $10k or so for a box to put them in (with CPUs, HDDs, etc). So that’s around $20k to train the model in 10 months (again, you’ll need some extra time and money for the data collection, and some trial and error).

[3] https://twitter.com/jeremyphoward/status/1100818170716160001

[4] https://github.com/eukaryote31/openwebtext

[5] https://www.tensorflow.org/alpha/guide/distribute_strategy

[6] Jeremy’s announcement - Swift for TensorFlow: The Next-Generation Machine Learning Framework (TF Dev Summit ’19) - https://youtu.be/s65BigoMV_I?t=1739

[7] Optimizing Network Performance for Distributed DNN Training on GPU Clusters - https://arxiv.org/pdf/1902.06855.pdf

[8] Mesh-TensorFlow: Model Parallelism for Supercomputers (TF Dev Summit ’19) - https://www.youtube.com/watch?v=HgGyWS40g-g

[9] https://twitter.com/jeremyphoward/status/1100828136789307394

3 Likes

This could be a business idea…a kickstarter for model training. :smiley:

Someone puts up a model they want trained and people volunteer to hook up their gpus (or their cloud providers gpus) for that model to get trained. Bonus points if this is dynamic with people having the option to connect or disconnect their gpus over the course of the model training. You can gamify it by showing the model being trained with training/valid loss on a web page. So there is more buy-in as the model gets closer to fully being trained and you see a count of gpus being connected in real time.

The model is saved at intermediate stages (after 10 epochs etc) and automatically distributed to everyone participating, or even open sourced based on the terms set by the main admin.

3 Likes

IF you build it they will come!

1 Like