Text generation with FastAI LSTM issue

cdparks · April 3, 2020, 1:14am

I have been trying to chase down a bug for some time now pertaining to the quality of the generated text coming out of my LSTM models. The issue pertains to differences in the quality of the sampled text obtained via learn.predict vs a custom sampler function I wrote. For some models, the issue does not appear, and the quality of the text obtained via these two methods are equivalent, as they should be. In other cases, there is a vast difference in the quality of the text, with the default fastai predict method generating the correct quality level of text.

I am able to gauge the quality of the text through some domain specific (cheminformatics) tools which tell where a piece of text is “valid” or not. Below, I tried to reduce the bug so it was as to the point as possible without the chemistry background.

Can anyone think of a reason my sampling function would not agree with the text learner predict function used in FastAI? To me, it should be as easy as soft maxing the decoded output from the linear layer, and sampling from a multinomial. Perhaps there is some callback I am not accounting for in my custom sampler fcn?

> 
> def fastai_get_smiles(learner):
>     tt = learner.predict('', n_words = 50000)
>     tt = tt.replace(' GO','').split(' END')
>     smiles = []
>     for text in tt:
>         smi = text.replace(' ','')
>         smiles.append(smi)
>     return smiles
> 
> def custom_sampler(learner):
>     learner.model.eval()
>     batch_sample=1000
>     seqs_gen = ['']*batch_sample
>     go_int = learner.data.train_ds.vocab.stoi['GO']
>     xb = np.array( [go_int]*batch_sample )
>     xb = torch.from_numpy(xb).to(device='cuda').unsqueeze(1)
>     max_seq_length=100
>     actions = torch.zeros((batch_sample, max_seq_length), dtype=torch.long).to(device='cuda')
>     learner.model.reset()
>     with torch.no_grad():
>         for i in range(0, max_seq_length):
>             output = learner.model(xb)[0].squeeze()
>             output_probs = F.softmax(output, dim=-1)
>             action = torch.multinomial(output_probs,1)
>             xb = action
>             actions[:,i] = action.squeeze()
>     return actions
> 
> def debug_code(learner):
>     smiles = fastai_get_smiles(learner)
>     #---domain specific function that tells how many of the generated text items are valid
>     valid1  = number_valid(smiles)
>     actions = custom_sampler(learner)
>     #---maps actions to text, and checks if the text is valid
>     valid2 = get_valid( actions,learner.data.train_ds.x.vocab.itos )
>     print( 'number of valid text generated via FastAI:',valid1/len(smiles) )
>     print( 'number of valid text from custom sampler:',valid2/actions.size()[0] )

> #---both of the following learners were trained using the same data set, but different lrs
> learner = load_learner('./exports/','model1.pkl')
> #this debug_code call outputs 0.97, 0.97! Both sampling methods agree!
> debug_code(learner)
> learner = load_learner('./exports/','model2.pkl')
> #this debug code call outputs 0.98, 0.15! The custom sampler script is not functioning correctly.
> debug_code(learner)

cdparks · April 10, 2020, 4:55pm

To anyone who comes across an error similar to this in the future. The problem was a bit bizarre. When I tokenized my text, I included a GO at the beginning, and an END at the end. Retraining the models with no END character created consistent performance between the custom batch and fastai sampler methods.