Language Model with Rhyme

lesscomfortable · May 6, 2019, 1:47am

Hi! I like rap and I want to train a language model that learns how to generate good rap/trap lyrics (in Spanish). I tried a AWD_LSTM model in fastai and the results are encouraging but the verses do not rhyme (they do in the finetuning dataset). I think this is an interesting problem. I found this, this, this and this paper online. However the most useful link I found is a blogpost.

I am thinking of finding clusters of rhyming words from my dataset (probably manually or maybe running an unsupervised model for approximate clusters) and then enforcing a rhyming word at inference before the ‘\n’ token. This entails a challenge since the predict function does not know if the next token is going to be ‘\n’ beforehand, it only knows it once the last word has been predicted. I am still thinking of what the best approach is for this but I am interested in the community’s input!

jolackner · May 6, 2019, 2:29am

Would it make sense to run a backwards LSTM model that predicts in the reverse direction, from right to left? You would have to train the whole LM from scratch, I guess, but what wouldn’t one do for a good rhyme?

lesscomfortable · May 6, 2019, 3:24am

Thank you Johannes I think I will do that together with enforcing rhyme in the last word of every verse. Can you think of a way to avoid doing a rhyming dictionary manually? I didn’t find any pronouncing dictionaries from Spanish online.

jolackner · May 6, 2019, 6:58am

Would it change anything if you put a tag like “xxrhymeword” in front of the last word of each line in your training rap dataset? It might (or might not) make it easier for the network to learn to notice every last word in a line as something that belongs to the same category - one that is a bit different from the other words preceding it.
I’m just suggesting this because I’ve been tinkering with screenplay datasets for a while - and I found that AWD-LSTM (and TransformerXL) learn&generate really well the formal structure of a screenplay scene if I provide the appropriate tags (“xxrole”, “xxdialog”, “xxaction”, “xxscenehead”), just concatenating them along with the running training dataset text.

Of course, your network would also have to learn which last words (pairwise or every second line etc, depending on the rhyme scheme) belong together. That’s more difficult, but you’re saying that you might run an unsupervised model for approximate rhyming clusters - then you could even mark every rhymeword with a more specific rhymetag (indicating which words should go together, again according to the rhyme scheme).