RWKV, the Generative LM that could help RNNs make a comeback!

sgaseretto · January 30, 2023, 10:23pm

Hello everyone!

I’m participating in a LAION project called Open-Assistant, led by Yannick Kilcher, to build an open version of ChatGPT.

The thing is that running these models is very resource-demanding; most of them can’t be run on a single GPU. But when deciding which models were going to be used for the Open-Assistant model, this project was presented RWKV-LM

It is an RNN trained as a Transformer, and the things that make this model so interesting is that in theory:

It can be directly trained like a GPT (parallelizable)
Fast training and fast inference
Saves VRAM,
“Infinite” ctx_len, and
Free sentence embeddings

I’m still learning about RNNs and LSTMs to contribute and experiment with it, but since the results presented in the repo are truly impressive and here in this forum are lots of folks with a lot of experience working with them, I wanted to share it with this community, so it doesn’t go unnoticed. Probably we are in front of the future model that could power an LLM revolution like what happened with stable-diffusion that “commoditized” image generation, and since the model was accessible for everyone to run on their own devices (or in a colab environment, easily accessible to everyone) allowed the development of new amazing works and improvements.

It could be used in a new course, probably, haha, like an evolution of ULMfit with generative and embedding capabilities.

@jeremy @muellerzr @sgugger, sorry for mentioning all of you directly; but you are the first ones that I could think of that might see the potential (or the flaws and limitations) of this project.

jeremy · February 4, 2023, 8:48pm

Thanks @sgaseretto . Is there a paper or post or anything that describes the algorithm in more detail?

ilovescience · February 4, 2023, 10:59pm

sgaseretto · February 22, 2023, 10:00am

Sorry for my very late response! RWKV is inspired by Apple’s AFT ([2105.14103] An Attention Free Transformer), but they are adding lots of tricks on top of it. They don’t have a published paper as far as I know, but they are detailing everything in their repo:
GitHub - BlinkDL/RWKV-LM: RWKV is a RNN with transformer-level LLM performance. It can be directly trained like a GPT (parallelizable). So it's combining the best of RNN and transformer - great performance, fast inference, saves VRAM, fast training, "infinite" ctx_len, and free sentence embedding.

Right now, they are training on the Pile, and after that, they are planning on training it with RLHF using the dataset that is currently being collected by Yannick Kilcher and the Open Assistant team.

If the performance of this model is on par with something like Flan-T5 or similar models, then we are talking about the “Stable Diffusion” for text generation, something that can be run in consumer GPUs with acceptable performance and good quality.