Regression using Fine-tuned Language Model

shaun1 · December 16, 2018, 5:47pm

After building the LM, now I’ve started working on the regression problem. Here is a sample of my training data:

train_df.head()
train_id	name	price	item_description
0	0	MLB Cincinnati Reds T Shirt Size XL	10.0	No description yet
1	1	Razer BlackWidow Chroma Keyboard	52.0	This keyboard is in great condition and works ...
2	2	AVA-VIV Blouse	10.0	Adorable top with a hint of lace and a key hol...
3	3	Leather Horse Statues	35.0	New with tags. Leather horses. Retail for [rm]...
4	4	24K GOLD plated rose	44.0	Complete with certificate of authenticity

Testing data has similar structure.

Using the data block API, I think I was able to create the databunch I want but I have a few questions about what I got and where to go from here. These are the things I did:

I initialized my custom tokenize and numericalize processors and loaded up my saved language model databunch:

tok_proc = TokenizeProcessor(mark_fields=True)
num_proc = NumericalizeProcessor(max_vocab=60_091, min_freq=2)
data_lm = TextLMDataBunch.load(path, 'lm-toknum', processor=[tok_proc, num_proc])

I called show_batch on this databunch and everything looked good.

Then using the vocabulary of my LM databunch (data_lm.vocab), I was able to create. I’m showing individual steps here to specify whats going on. First I created a TextList

d = TextList.from_df(train_df, path, cols=['name', 'item_description'], vocab=data_lm.vocab)

Question 1: Do I need to pass the custom tokenizer/processor that used for the LM here? It works even without it, but I don’t see marked fields.

I split by index and perform the labeling. I went with the label_from_df from the tabular databunch creation:

d = d.split_by_idx(valid_idx)
d = d.label_from_df(cols=[dep_var], label_cls=FloatList, log=True)

Question 2: This takes some time, as I think tokenization and numericalization of the training and validation sets. Is that right?
Question 3: Does passing the dependent variable in the cols argument along with FloatList set this up for a regression problem as I think?

Next I add the test set:

d = d.add_test(TextList.from_df(test_df, path, cols=['name', 'item_description'], vocab=data_lm.vocab))

Question 4: Again, do I have to pass my custom tokenize/numericalize processors here?

Finally I create the databunch:

d = d.databunch()

When I call show_batch on this databunch, I am one column of text and another column of floats (i.e., log values of the price varialbe).

Question 5: There are two columns of text in the original data frame (name and item_description) representing two fields. Have these two been merged to get one full text field?

Question 6: The fields are not marked (i.e., I don’t see xfld 1 and xfld 2 as I do in the LM databunch. I’m guessing I need a custom tokenizer for that. Will that be the one I created for the LM databunch?

Thanks.

shaun1 · December 16, 2018, 8:32pm

So, I decided to go with my intuition and create a databunch for my regression problem. It got created without any errors, but I’m still not a 100% sure, whether what I have is correct and going to work. Here is the code (pretty much same as previous post, simplified):

data_reg = (TextList.from_df(train_df, path, cols=['name', 'item_description'], vocab=data_lm.vocab, processor=[tok_proc, num_proc])
           .split_by_idx(get_rdm_idx(train_df))
           .label_from_df(cols=['price'], label_cls=FloatList, log=True)
           .add_test(TextList.from_df(test_df, path, cols=['name', 'item_description'], vocab=data_lm.vocab, processor=[tok_proc, num_proc]))
           .databunch())

There is one problem. Since data_reg is using the same vocab as data_lm, I would think that the vocabulary size would also same. But I get different values for the stoi's (but same for the itos's):

len(data_lm.vocab.itos)
60093
len(data_lm.vocab.stoi)
60093

len(data_reg.vocab.itos)
60093
len(data_reg.vocab.stoi)
295127

I don’t know why data_reg.vocab.stoi is so much better than data_reg.vocab.itos. Should they actually be the same, since stoi is created from itos?

sgugger · December 16, 2018, 11:07pm

mark_fields is set to False by default, so you should pass a processor that sets it to True. I think this is a bug since I believe we decided to default mark_fields to False when there is only one column and to True where there are several, let me check.
Yes the tokenization and numericalization happen at the end of the labelling.
Absolutely
Same answer as 2
Yes, columns are merged to make one big text, with field separators if mark_fields is True.
It should be the same for all your tasks, if you want those fields marked.

shaun1 · December 17, 2018, 12:11am

Thank you for your replies. It helps me a lot in using the library to do what I want to do.

I’m still not exactly sure why stoi and itos lengths are different for the regression databunch vocab. My concern is that the LM vocab is not being utilized correctly for the regression task (even though I’m passing it in during creation).

Also, fastai.text has a language_model_learner and text_classifer_learner. What would I need to do to get a custom learner for the regression problem now that my data is ready? Do I create a custom learner from the base class RNNLearner?

Thanks.

shaun1 · December 17, 2018, 8:49pm

I’m still trying to figure out why the length of the stoi and itos in the vocab are different. After creating my databunch, I check the length:

len(data_reg.vocab.itos)
60093
len(data_reg.vocab.stoi)
295127

But I see that that the stoi is created from the itos in line 123, file text.data.py:

self.stoi = collections.defaultdict(int,{v:k for k,v in enumerate(self.itos)})

Sure enough, when I execute that command separately, I get the correct length:

len(collections.defaultdict(int,{v:k for k,v in enumerate(data_reg.vocab.itos)}))
60093

shaun1 · December 17, 2018, 9:01pm

Were you able to get a regressor working?

How can the output size be 0? What would the classifier be outputing then? Also, where did you get the number 50?

Finally, other the PoolingLInearClassifier, do we have to meddle with the other functions such as MultiBatchRNNCore and SequentialRNN?

Thanks.

shaun1 · December 19, 2018, 9:46pm

I used the code posted by @britton, unfortunately the execution gets stuck at 0% when I call learn.lr_find() . I basically used the PoolingLinearRegressor and created corresponding get_rnn_regressor and text_regressor_learner that calls the appropriate functions (without anything else different from the original source code).

And I still haven’t found why there is a disparity between the stoi and itos of my databunch vocabulary.

I’m trying to figure out the best way to proceed forward and would greatly appreciate some pointers.

Thanks.

britton · December 23, 2018, 4:10pm

Hey Shaun! Sorry I’m just replying– holiday study break

I was able to run regression, though I am not positive that it worked. The loss was decreasing, but my predictions never got very good. What I don’t know is whether that is due to insufficient data (I’m using Twitter text to predict likes, which is a very incomplete prediction) or if it’s due to not having my data, model and loss function set up correctly.

I am not sure how the output of a model can be size 0, that is a good question! Here’s how I got that number: after creating the text_classifier_learner with the factory method, I used learner.model to examine the layers, and the final layer listed as in_features=50, out_features=0. You’re right– having an output of zero doesn’t seem possible, does it?

I originally edited the PoolingLinearClassifier's mod_layers to hard code the final layer, but I changed my approach to customize the text_classifier_learner, changing the line that reads this:

vocab_size, n_class = len(data.vocab.itos), data.c

to this: vocab_size, n_class = len(data.vocab.itos), 1

From my digging it looked like establishing n_class here then filters down into text_rnn_classifier to hard code the final layer output size. I didn’t do anything to the PoolingLinearClassifier in this case. That said, it’s possible I am not yet confident I’ve configured regression correctly! But it runs and my loss goes down, albeit just a bit.

shaun1 · December 23, 2018, 4:50pm

Could you share you complete code? Also, did you fine-tune your language model with your data initially?

britton · December 23, 2018, 5:01pm

I sure can! I’m sorry it’s quite messy– feels a bit like airing my dirty laundry– but this is my code. Some training cells are leftover from previous runs, but I ran one cycle of LM fine-tuning, and two cells of regression.

In old runs of this notebook, yes I did fully fine-tune the LM. It made regression better, but still not great.

The notebook is here: https://nbviewer.jupyter.org/github/buttchurch/like-predictor/blob/master/TwitterLMTrain%20.ipynb

BTW Shaun, thanks for starting and working on this thread! I’ve found it very helpful

shaun1 · December 23, 2018, 5:26pm

Your welcome. I’m planning to spend a little bit more of time on this, but if I don’t find myself in a good position, I’m planning on converting my regression problem into a classification problem and just use the API thats already available.

benjaminvdb · March 5, 2019, 6:13pm

@shaun1: I know this is rather late, but perhaps the following will help you.

To try text regression I’ve taken a dataset very similar to the IMDB dataset, but I’ve replaced the labels by their actual rating between to 1 to 5 (i.e. integers). I then stored these under 5 separate folders, one for each rating.

(Note: this is probably not the best idea to try regression on, since the labels are sort of rounded, but let’s just use it as an example.)

I then used this to cast the labels to floats:

data_regr = (TextList.from_folder(path, vocab=data_lm.vocab)
             .split_by_folder(valid='test')
             .label_from_folder(label_cls=FloatList)
             .databunch(bs=bs))

I then changed the loss function to MSELossFlat as suggested by Jeremy in class.

learn = text_classifier_learner(data_regr, AWD_LSTM, drop_mult=0.5)
learn.load_encoder('fine_tuned_enc')
learn.loss_func = MSELossFlat()

Here, fine_tuned_enc is the saved encoder after fine-tuning the language model on the target dataset.

Running the learner as usual with fit_one_cycle now works as expected.

shaun1 · March 5, 2019, 7:37pm

Thank you. The problem I had is to predict a price of an item which includes decimals and goes from $3-$2000. So this might not work. But as I mentioned earlier, I could convert this into a classification problem by binning the output prices into several classes, which is similar to what you have done.

danield · March 7, 2019, 10:54pm

Reaching out to see if anyone here has managed to come up with a successful approach to regression using the the Text module. Looking to predict a score between 1-5 based on a review. Been successful in binning the results and building a classification model (similar to @benjaminvdb approach) but would really appreciate if anyone has working examples of a regression problem. Thanks in advance!

LIBER · March 15, 2019, 10:59am

It has been 5 months since you posted this, but you haven’t solved it yet? I’m trying to do something similar for a competition due in 2 weeks but this really seems like a time black hole and perhaps I should just quit.

shaun1 · March 15, 2019, 12:44pm

No I have not.

danield · March 18, 2019, 2:49pm

@LIBER @shaun1
I managed to figure out a solution to this.

Databunch object:

data_regr = (TextList.from_df(df=df, path=path, cols='text', vocab=data_lm.vocab)
             .split_by_rand_pct()
             .label_from_df('label', label_cls=FloatList)
             .add_test(TextList.from_df(df=test_df, path=path, cols='text'))
             .databunch())

And the Learner object:

learn = text_classifier_learner(data_regr, 
                                arch = AWD_LSTM, 
                                drop_mult=0.3,
                                metrics = rmse)
learn.loss_func=MSELossFlat()
learn.load_encoder('fine_tuned_enc')
learn.freeze()

Hope this helps.

quan.tran · April 8, 2019, 6:40pm

I think it’s a bit too late for your competition (I hope it’s not PetFinder…) but I wrote some code to create a new Databunch for tab + text and train a model with that and it seems to work.