How do you train a music generator?

mooktakim · October 27, 2020, 4:21pm

On last years course (2019) Jeremy talked about alumni that built a neural network music generator.
I’ve been thinking about this and I can’t figure out how this was done.

As I understand it, normally you input data into a model, you get an output. The model is trained with data. When you have a certain kind of input, you get a certain kind of output (and hopefully what you expect).
But how do you train a model like this music generator? There is no “right” answer to the data. You generate sound, how is that sound correct? How would you even validate that it is working correctly? How does a model “learn” in this case?

Any explanation is appreciated. Its something I’ve been just thinking about and can’t figure out the answer.

GoofyMango · October 27, 2020, 7:13pm

I’m not really sure how they did it, but one way I can think of is training it like how you’d train a language model. Just like how language models predict the next word in a sentence, you could train a music model to predict the next note in a song. Then to generate music, you could have the model predict the next note in a song, and then pass the model’s prediction in as the next input. You could repeat that process and the model would effectively generate new music. Just like how you could use a language model to generate a new sentence by repeatedly feedings its predictions in as the next input.

Edit: I read the linked post and it looks like this is what she did. As she explains under the “Musical Generator” section:

I train the model by asking it to predict the next note or chord, given an input sequence …
Once the model is trained, I create generations by sampling the output prediction, and then feeding that back into the model, and asking it to predict the next step, and so forth.

stefan-ai · October 27, 2020, 7:30pm

@GoofyMango already explained very well how they trained the music generator model analogous to a language model for text data. I don’t have much to add to that.

The same is true for text data. When you feed a stream of text to a language model, in most cases there is not exactly one single next word that is correct while all others are false. But what you do know is that the next word (or note) in your training data at least is not entirely wrong. First the untrained model starts out by making random false predictions, gets signal from the loss and updates its parameters. And while there is never a objectively correct next note for a single piece of music, by processing lots and lots of data, the model learns about the underlying structure of the data and builds a probabilistic representation of music. This trained model can then be used to generate new pieces of artificial music from the learned representation and a given starting sequence.

Btw, the project you posted eventually turned into OpenAI’s Musenet.

mooktakim · October 27, 2020, 11:18pm

The music generator is trying to predict the next note. Its been trained on existing music, and therefore can recreate the same style of the music. Very clever.

Thank you @GoofyMango and @stefan-ai