Text Generation and Beam search

AbuFadl · February 9, 2019, 10:46am

Are there examples (sample code) of text generation implemented in fastai based on ULMFiT (other than simply predict next tokens from language model learner)? I found a few references to beam search:
@bfarzin 's post (Developer chat) and Improving Text Generation and Share your work here ✅ but could not find code examples. Any pointers?

sgugger · February 9, 2019, 1:45pm

beam_search will be in the next release of fastai (and already is present in master).

AbuFadl · February 9, 2019, 4:45pm

This is great news. How do I pass a non-English pretrained files (itos.pkl and lm.pth)? pretrained_fnames is no longer accepted. If AWD_LSTM is specified as arch with pretrained=True, URLs.WT103_1 gets downloaded. I tried copying my files to data/models/ but doesn’t seem to work.

sgugger · February 9, 2019, 9:37pm

You’re right, that’s missing now, I forgot to put them back there. Will add that tonight or tomorrow.

AbuFadl · February 10, 2019, 6:53am

Is it feasible to also filter out unknowns in beam search (similar to no_unk=True in learner.predict)?

sgugger · February 10, 2019, 1:42pm

It’s not implemented yet, but a PR could add that.

AbuFadl · February 10, 2019, 2:02pm

Unfortunately, I am still on the code consumer side. I hope to cross the border someday. Right now, my contribution is to appreciate hard work by all those at https://github.com/fastai/fastai/graphs/contributors especially the top ones

abduissa · February 11, 2019, 3:02am

This is very cool stuff. I tried to implement beam search by loading a language_model_learner, training it, then typing this line of code:

data_lm = TextLMDataBunch.load(path, ‘data_lm’, bs=32)
learn = language_model_learner(data_lm, AWD_LSTM, pretrained=True, drop_mult=0.3)
learn.load(‘final_fit_render’)
learn.beam_search(‘hello’, 1)

But alas, i am getting this error:
RuntimeError: CUDA error: device-side assert triggered

learn.predict(‘hello’, 1) still works though.

Is there something different i should be doing?

AbuFadl · February 11, 2019, 5:45am

Try changing the beam size (1000 by default) and try a few words instead of 1. I tried generating movie reviews (kernel: TextGen_fastai143 | Kaggle) based on the IMDB dataset and got better results from learner prediction. Beam search may either require parameter tuning or has a cyclic issue in implementation (I am not a movie goer but results have so much repetition to suggest an obvious problem).

bfarzin · February 12, 2019, 8:45pm

one line change + param. I have a PR with this included:
if no_unk: out[:,self.data.vocab.stoi[UNK]] = -float('Inf')

AbuFadl · February 13, 2019, 12:09pm

That’s great, @bfarzin - I guess it’s not merged yet. Have you tested the quality of text generated with beam_search? I am still getting poor results on the IMDB reviews (even compared to learner.predict).
Here are two examples (100 words based on imdb train and test, using wt103_1, start with [‘when’, ‘i really’], temperature=0.75, top_k=10,beam_sz=100). Trying to get rid of the repetition but not there yet:

“when xxbos when i went to see this movie , i thought it was going to be good . i was wrong . The acting was bad , the plot was bad , and the acting was terrible . The only thing that kept me watching was the fact that it was supposed to take place in outer space . In fact , i do n’t know if it was supposed to take place in outer space . In fact , i do n’t know if it was supposed to take place in outer space . In fact”
"i really xxbos i really wanted to like this movie . However , i found it to be one of the worst movies i have ever seen in my entire life . " The English Patient " is one of the worst movies i have ever seen in my entire life . " The English Patient " is one of the worst movies i have ever seen .

SPOILER It happens , you get your first impression when we learn how this film will not go over as long in our life as i can
"

cahya · February 15, 2019, 4:20pm

Is there any chance with ULMFiT to get an impressive result on text generation like the one from openai (https://blog.openai.com/better-language-models/) recently?

gerardo · February 25, 2019, 1:58pm

Is there any way to generate the top answers based on he text using beam search

For example
learner.beam_search(‘Hello’, top_probs=5)

Hello thing1 0.95
Hello thing2 0.90
Hello thing3 0.80
Hello thing4 0.75
Hello thing5 0.70

sgugger · February 25, 2019, 2:22pm

There is no flag for that yet, but just copy-paste the source code and return

[(sep.join(decoder(self.data.vocab.textify([i.item() for i in node[1:] ], sep=None))), s) for node,s in zip(nodes,scores)]

gerardo · February 26, 2019, 4:52pm

Looks like the return repeats each node top_k times.
I made a modification of the code to include the 0 item and then skip top_k for the next one to avoid duplicate.

[(sep.join(decoder(self.data.vocab.textify([i.item() for i in node[1:] ], sep=None))), s) for node,s in zip(nodes[0::top_k],scores[0::top_k])]

pweids · February 27, 2019, 3:26pm

Does anyone have ideas for generating text for a certain class? In other words, a way to generate positive and negative reviews?

I tried, for example, generating a bunch of reviews with beam_search, but they ended up being all negative for my starting sequence. Can’t think of a better way to do this right now other than trial and error until a review is predicted positive.

gerardo · March 12, 2019, 7:48pm

Looks like the code doesn’t convert back the tokens to their original format.
https://docs.fast.ai/text.transform.html#Vocab.textify

Is there any method available to convert back from tokens to the “original” text?

sgugger · March 12, 2019, 8:25pm

No, spacy tokenizer isn’t reversible.

denisvlr · March 13, 2019, 3:39pm

as @AbuFadl I’m also running into poor results with the beam_search method, they also have a lot of repetitions and are generally less good than with predict.
I’ve tried a lot of different variations of the params (beam_sz, top_k, temperature) with no luck.

Has anyone achieved good results?

gan · March 13, 2019, 4:47pm

if you would like to fine tune gpt-2 on particular text by retraining the 117m model, I made a colab notebook https://github.com/ak9250/gpt-2-colab