How to use a text classifier in production

StatisticDean · February 18, 2019, 4:35pm

Hi everyone,
i’m using fastai V1 to perform text classification. At the end of my training, when i’m satisfied with my model, I want to use it for prediction. After reading this thread : [ SOLVED ] Get prediction from the ImageNet model without creating a databunch , I figured that i needed to export my learner.
So, let’s pretend learn is the name of my fully trained Learner.
I did learn.export('clas.pkl') after my training.
Then, to do some text classification, I would do :
learn = load_learner('clas.pkl)
learn.predict(text)
where text contains the text on which I want to do prediction. The weird thing is that I get predictions if text is a string, a sequence of string, or a sequence of integers. At the moment, i’m using a dummy classifier(not trained) for test so I can’t really interpret the results to see which one is the good one.
So my question is the following : Should I feed learn the base text, the tokenized text or the numericalized text?

sgugger · February 18, 2019, 6:08pm

It is supposed to take the raw text and will tokenize then numericalize it.

StatisticDean · February 18, 2019, 11:32pm

When creating my learner, I used a databunch.from_tokens, which means my databunch (and my learner) never saw my tokenization process. That means that there is no way, the learner I load with load_learner has the same tokenization that the one I want to use.

During my experimentations, passing the tokenized version of the text to the learner yielded a prediction. Is that prediction the one made by numericalizing the tokens I gave to the learner (and in which case, this solves my issue and allows me to use my homemade tokenization) or is that prediction done in another fashion(in that case, How should I use this learner to predict?)?

sgugger · February 19, 2019, 2:52am

You should use it exactly the same way you passed your dataset. So if you created it from tokens, you should pass the tokens.

StatisticDean · February 25, 2019, 3:00pm

This morning, I tried to predict the next words of a sentence using the language model I had trained. Following your last comment, I tried to pass it tokens since I created the databunch from the tokens, but got the following error :

~/Documents/fastai/fastai/text/learner.py in predict(self, text, n_words, no_unk, temperature, min_p, sep, decoder)
125 new_idx.append(idx)
126 xb = xb.new_tensor([idx])[None]
–> 127 return text + sep + sep.join(decoder(self.data.vocab.textify(new_idx, sep=None)))
128
129 def beam_search(self, text:str, n_words:int, no_unk:bool=True, top_k:int=10, beam_sz:int=1000, temperature:float=1.,

TypeError: can only concatenate list (not “str”) to list`

It works just fine when I give it raw text. However, since my tokenization is not the default one, and is never passed to the databunch, it can’t be using the right tokenization. Any thoughts on how to proceed there?

sgugger · February 25, 2019, 4:17pm

You will just have to copy-paste the predict function of fastai to adapt it to your own process (here it can’t return text since it doesn’t even know your vocabulary). If you want to use all the functionality of fastai, you have to use it end-to-end, so you need to pass it your own tokenizer when you create your DataBunch.

StatisticDean · February 25, 2019, 9:23pm

The issue I have with fastai end to end is that I have a very large number of categories and as a result, some have few entries. To have meaningful result, it is way better for me to do K-fold cross validation than simple cross validation, but I don’t want to tokenize each fold multiple times (My dataset is very messy and I need to experiment with tokenization a lot.). At the moment, what I’m doing is tokenizing the whole set, then doing my Kfold split, and passing each split to databunch.from_tokens. Regarding fastai source, I guess I would be better off numericalizing each split, and passing them to databunch.from_ids since I can pass processor there. Maybe the best solution is to propose a PR so that we can pass processor to databunch.from_tokens as well. What do you think?