Finetuning pretrained LLMs

So I have llama2c that I am using. I can train tinystories to equal karpathy’s results. However, I find that the model has trouble when seeing terms it has likely never or seldom seen. This might be proper names or even creatures that might not be in the tinystories dataset.

My thought is to download a form of wikipedia, either Simple Wiki (250MB) or the full text (85GB uncompressed), and train on that first, then fine tune on tinystories. I’m curious if this will help improve a model with same parameter count.

When I do this, do I keep the same cosine learning rate with warmup and just treat it like a normal training, except my weights are starting from a closer “known good” than random weights?

Do I need to decrease my LR as to not disturb the existing weights too much?

For a classifier, you would freeze some layers, but that doesn’t seem right for this, does it?

I know of LORA, but I’m trying to follow, at least on a small scale, the

  • pretrain
  • finetune
  • RLHF/Alignment
    pattern that the foundational models apply, before getting into LORAs.

I can find this through trial and error, but given how long it takes to train even small models (a few hours), it’s slow to dial this in, so I’m hoping for some ideas of best practices.

1 Like

In case anyone stumbles across this…

I tried to train a 247M parameter llama2c on the tokenized SImple Wiki (~85MB), and overtrained hard. I could use dropout, but instead, I’m going to tokenize the full english wiki and train on that and see how it goes.