Regression using Fine-tuned Language Model

I believe the code needs to be updated to cast the pytorch tensors -> numpy (and then from numpy -> pytorch tensor on load).

Something like:

np.save(cache_path/f'train_ids.npy', [i.numpy() for i in self.train_ds.x.items])
1 Like

Yes I ended up doing that. I’m just wondering if thats what in fact is intended by the devs and also trying to bring it to their attention so that they can do it in the code.

FYI: Just pulled the latest from master and this looks to be fixed. Not sure when it will be rolled into one of the release branches though.

1 Like

Yup :slight_smile: I did the same thing and it is fixed!

Yes load and save were broken, and I fixed them this afternoon.

1 Like

I was able to successfully fine-tune a LM using the pre-trained model with the datablock API on a custom dataset. I highlight the (small number of) steps here for documentation:

Assuming, our data is in a pandas dataframe with just different fields that need to be added to the text:

# my dataset consists of name and item_description
data_lm = (TextList.from_df(texts, PATH, cols=['name', 'item_description']) 
          .random_split_by_pct(0.1)
          .label_for_lm() # this does the tokenization and numericalization
          .databunch())

data_lm.save('lm-tokens')
# load the data (can be used in the future as well to prevent reprocessing)
data_lm = TextLMDataBunch.load(PATH, 'lm-tokens')
data_lm.show_batch() # take a look at the batch fed into the GPU
learn = language_model_learner(data_lm, pretrained_model=URLs.WT103, drop_mult=0.5, callback_fns=ShowGraph)
learn.lr_find()
learn.recorder.plot()
learn.fit_one_cycle(1, 1e-2, moms=(0.8,0.7))
learn.recorder.plot_losses()
learn.save('fit-head')
learn.load('fit-head')
learn.unfreeze()
learn.lr_find()
learn.recorder.plot()
learn.fit_one_cycle(11, 1e-3, moms=(0.8,0.7))

With a dataset size of 5,635,745, it took me 21 hours, 22 minutes, and 6 seconds to run this on a V100 with a final training loss of 2.697805, valid loss of 2.571279, and accuracy of 0.524987.

2 Likes

Did you wind up getting regression to work with a pre-trained LM?

Not yet. Although I haven’t worked on it for a while. The devs mentioned that the latest data block API made working on a regression problem easier. But I haven’t yet looked at it. I will in the next couple of days.

Gotcha. I’m trying to figure out how to do regression on the indices of particular tokens, kind of like bounding boxes with category labels in a CNN but haven’t yet worked out what the input/head needs to look like for that to work.

I’m working on a similar task (trying to turn the text classifier into a regressor) and thanks to the posts above, have gotten my databunch set up. I’ve changed the loss function like so:

def rmse(preds, targs):
"""Compute root mean squared error"""
return np.sqrt(torch.mean((targs - preds).pow(2)))

learn.loss_func = F.mse_loss
learn.metrics = [rmse]

The final layer of the classifier is a Linear layer with input size 50 to output size 0, and I’ve changed that to output size 1 by altering the init method of the PoolingLinearClassifier: (left out the rest for brevity)

class PoolingLinearRegressor(nn.Module):
"Create a linear regressor with pooling."

def __init__(self, layers:Collection[int], drops:Collection[float]):
    super().__init__()
    mod_layers = []
    activs = [nn.ReLU(inplace=True)] * (len(layers) - 2) + [None]
    for n_in,n_out,p,actn in zip(layers[:-1],layers[1:], drops, activs):
        mod_layers += bn_drop_lin(n_in, n_out, p=p, actn=actn)
   mod_layers[-1] = nn.Linear(in_features=50, out_features=1, bias=True)
    self.layers = nn.Sequential(*mod_layers)

When I try training the learner, I hit an embedding index error:

~/anaconda3/envs/fastai/lib/python3.6/site-packages/torch/nn/functional.py in 
embedding(input, weight, padding_idx, max_norm, norm_type, scale_grad_by_freq, 
sparse)
   1410         # remove once script supports set_grad_enabled
   1411         torch.no_grad_embedding_renorm_(weight, input, max_norm, 
norm_type)
-> 1412     return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, 
sparse)
   1413 
   1414 

RuntimeError: index out of range at /opt/conda/conda-bld/pytorch- 
nightly_1543482224190/work/aten/src/TH/generic/THTensorEvenMoreMath.cpp:191

I don’t understand yet how my model changes have affected the embeddings.

So, I’ve started working on this problem again. I decided to build a new LM due to lots of API changes and more importantly, I like to have updated code. I ran the same code as above and I get the following error now:

data_lm = (TextList.from_df(texts_df, path, cols=['name', 'item_description'], processor=[tok_proc, num_proc])
         .random_split_by_pct(0.1)
         .label_for_lm(cols=['name', 'item_description'])
         .databunch())
---------------------------------------------------------------------------
_RemoteTraceback                          Traceback (most recent call last)
_RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/net/vaosl01/opt/NFS/sw/anaconda3/envs/mer/lib/python3.7/concurrent/futures/process.py", line 232, in _process_worker
    r = call_item.fn(*call_item.args, **call_item.kwargs)
  File "/net/vaosl01/opt/NFS/sw/anaconda3/envs/mer/lib/python3.7/concurrent/futures/process.py", line 191, in _process_chunk
    return [fn(*args) for args in chunk]
  File "/net/vaosl01/opt/NFS/sw/anaconda3/envs/mer/lib/python3.7/concurrent/futures/process.py", line 191, in <listcomp>
    return [fn(*args) for args in chunk]
  File "/home/su0/fastai/fastai/text/transform.py", line 111, in _process_all_1
    return [self.process_text(t, tok) for t in texts]
  File "/home/su0/fastai/fastai/text/transform.py", line 111, in <listcomp>
    return [self.process_text(t, tok) for t in texts]
  File "/home/su0/fastai/fastai/text/transform.py", line 102, in process_text
    for rule in self.pre_rules: t = rule(t)
  File "/home/su0/fastai/fastai/text/transform.py", line 58, in fix_html
    x = x.replace('#39;', "'").replace('amp;', '&').replace('#146;', "'").replace(
AttributeError: 'float' object has no attribute 'replace'
"""

The above exception was the direct cause of the following exception:

AttributeError                            Traceback (most recent call last)
<timed exec> in <module>

~/fastai/fastai/data_block.py in _inner(*args, **kwargs)
    391             self.valid = fv(*args, **kwargs)
    392             self.__class__ = LabelLists
--> 393             self.process()
    394             return self
    395         return _inner

~/fastai/fastai/data_block.py in process(self)
    438         "Process the inner datasets."
    439         xp,yp = self.get_processors()
--> 440         for i,ds in enumerate(self.lists): ds.process(xp, yp, filter_missing_y=i==0)
    441         return self
    442 

~/fastai/fastai/data_block.py in process(self, xp, yp, filter_missing_y)
    565             filt = array([o is None for o in self.y])
    566             if filt.sum()>0: self.x,self.y = self.x[~filt],self.y[~filt]
--> 567         self.x.process(xp)
    568         return self
    569 

~/fastai/fastai/data_block.py in process(self, processor)
     66         if processor is not None: self.processor = processor
     67         self.processor = listify(self.processor)
---> 68         for p in self.processor: p.process(self)
     69         return self
     70 

~/fastai/fastai/text/data.py in process(self, ds)
    241         tokens = []
    242         for i in progress_bar(range(0,len(ds),self.chunksize), leave=False):
--> 243             tokens += self.tokenizer.process_all(ds.items[i:i+self.chunksize])
    244         ds.items = tokens
    245 

~/fastai/fastai/text/transform.py in process_all(self, texts)
    115         if self.n_cpus <= 1: return self._process_all_1(texts)
    116         with ProcessPoolExecutor(self.n_cpus) as e:
--> 117             return sum(e.map(self._process_all_1, partition_by_cores(texts, self.n_cpus)), [])
    118 
    119 class Vocab():

/net/vaosl01/opt/NFS/sw/anaconda3/envs/mer/lib/python3.7/concurrent/futures/process.py in _chain_from_iterable_of_lists(iterable)
    474     careful not to keep references to yielded objects.
    475     """
--> 476     for element in iterable:
    477         element.reverse()
    478         while element:

/net/vaosl01/opt/NFS/sw/anaconda3/envs/mer/lib/python3.7/concurrent/futures/_base.py in result_iterator()
    584                     # Careful not to keep a reference to the popped future
    585                     if timeout is None:
--> 586                         yield fs.pop().result()
    587                     else:
    588                         yield fs.pop().result(end_time - time.monotonic())

/net/vaosl01/opt/NFS/sw/anaconda3/envs/mer/lib/python3.7/concurrent/futures/_base.py in result(self, timeout)
    430                 raise CancelledError()
    431             elif self._state == FINISHED:
--> 432                 return self.__get_result()
    433             else:
    434                 raise TimeoutError()

/net/vaosl01/opt/NFS/sw/anaconda3/envs/mer/lib/python3.7/concurrent/futures/_base.py in __get_result(self)
    382     def __get_result(self):
    383         if self._exception:
--> 384             raise self._exception
    385         else:
    386             return self._result

AttributeError: 'float' object has no attribute 'replace'

More specifically, the error is at this step:

.label_for_lm()

I’m trying to debug this, but by default FastAI uses multiple CPUs and its hard to figure out errors when multiple CPUs are involved. I tried passing .label_for_lm(cols=['name', 'item_description'], n_cpus=1) but it kept using multiple CPUs. Furthermore, I couldn’t figure out where exactly the tokenizer is called by following the code.

Any help is appreciated.
Thanks.

Earlier, I was running this piece of code which gave no errors:

data_lm = (TextList.from_df(texts_df, path, col=['name', 'item_description'], processor=[tok_proc, num_proc])
         .random_split_by_pct(0.1)
         .label_for_lm()
         .databunch())

and now this which gives the error in the previous post:

data_lm = (TextList.from_df(texts_df, path, cols=['name', 'item_description'], processor=[tok_proc, num_proc])
         .random_split_by_pct(0.1)
         .label_for_lm()
         .databunch())

with the processors:

tok_proc = TokenizeProcessor(mark_fields=True)
num_proc = NumericalizeProcessor(max_vocab=60_091, min_freq=2)

The key part is the argument to TextList.from_df. Initially I had the column as just col which worked. But when I switched to cols it gave the error. I did this because, I wanted to both the fields (name and item_description) to show up in the dataset. With just using col it only produced xxfld 1.

When I used a smaller dataset with cols it didn’t give an error and I was able to confirm that both xxfld 1 and xxfld 2 showed up. But when I run it with the full dataset I get the error.

I’m still trying to debug these errors and I’m hoping @sgugger could clarify these questions:

  1. Why is both col and cols accepted as args and what are the differences between them?
  2. Where exactly is the tokenizer called during .label_for_lm? This is where the error occurs when .fix_htmlis called which is called as part of default_pre_rules in the tokenizer.
  3. How do I force the tokenzier to use only 1 cpu so that its easier to debug?

Thanks.

1 Like

Just an update:

Following my previous post, if pass in col as argument (when I don’t get the error on the entire dataset) and printing a random db.train_ds element, I get the following:

(Text xxbos xxfld 1 xxmaj rave xxmaj outfit xxmaj bundle, Category 0)

Now, I try with cols as the argument but with small dataset size (using the entire dataset gives the error I mentioned) and printing a random db.train_ds element, I get the following:

(Text xxbos xxfld 1 3 lip gloss for xxunk xxfld 2 lip gloss xxunk 2- c - thru xxunk space,
 Category 0)

As can be clearly seen, using the multiple columns cols argument, facilitates having xxfld 1 and xxfld 2 for name and item_description (which is what I want) in the batches, but that is not the case when using the singular col argument where I think only the name part seems to show up.

I’m still not sure what difference col and cols makes when creating the TextList. Will investigate further.

To answer your first questions:

  1. col is just ignored. I believe you can probably pass any kwargs. In the first case, cols will default to the first column of your dataframe. Are you sure the second column contains string elements up until the end? It seems weird it would bug with that.

  2. The tokenizer is called after the labeling in the process call (this line exactly).

  3. To force your own tokenizer, you must pass a custom preprocessor as you did.

Thank you so much!

Here is a sample of the source dataframe:

	name	item_description
0	MLB Cincinnati Reds T Shirt Size XL	No description yet
1	Razer BlackWidow Chroma Keyboard	This keyboard is in great condition and works ...
2	AVA-VIV Blouse	Adorable top with a hint of lace and a key hol...
3	Leather Horse Statues	New with tags. Leather horses. Retail for [rm]...
4	24K GOLD plated rose	Complete with certificate of authenticity

As you can see first column is the name and second column as item_description both of which are strings:

texts_df.dtypes()

name                object
item_description    object
dtype: object

Currently, I’m reading a csv files containing the texts into a dataframe and using .from_df class method to create my databunch. Perhaps, I could try to use the .from_csv and see it works.

I don’t like that with Python :frowning:

Update. The problem was with my dataset. There were 10 (just 10!) entries in item_description that were NaNs that I didn’t take care of. One of them was what was causing the problem. Once I fixed that, I was able to create the databunch without any problems with all the fields marked correctly. Sorry about that!

1 Like

After building the LM, now I’ve started working on the regression problem. Here is a sample of my training data:

train_df.head()
train_id	name	price	item_description
0	0	MLB Cincinnati Reds T Shirt Size XL	10.0	No description yet
1	1	Razer BlackWidow Chroma Keyboard	52.0	This keyboard is in great condition and works ...
2	2	AVA-VIV Blouse	10.0	Adorable top with a hint of lace and a key hol...
3	3	Leather Horse Statues	35.0	New with tags. Leather horses. Retail for [rm]...
4	4	24K GOLD plated rose	44.0	Complete with certificate of authenticity

Testing data has similar structure.

Using the data block API, I think I was able to create the databunch I want but I have a few questions about what I got and where to go from here. These are the things I did:

  1. I initialized my custom tokenize and numericalize processors and loaded up my saved language model databunch:
tok_proc = TokenizeProcessor(mark_fields=True)
num_proc = NumericalizeProcessor(max_vocab=60_091, min_freq=2)
data_lm = TextLMDataBunch.load(path, 'lm-toknum', processor=[tok_proc, num_proc])

I called show_batch on this databunch and everything looked good.

  1. Then using the vocabulary of my LM databunch (data_lm.vocab), I was able to create. I’m showing individual steps here to specify whats going on. First I created a TextList
d = TextList.from_df(train_df, path, cols=['name', 'item_description'], vocab=data_lm.vocab)

Question 1: Do I need to pass the custom tokenizer/processor that used for the LM here? It works even without it, but I don’t see marked fields.

  1. I split by index and perform the labeling. I went with the label_from_df from the tabular databunch creation:
d = d.split_by_idx(valid_idx)
d = d.label_from_df(cols=[dep_var], label_cls=FloatList, log=True)

Question 2: This takes some time, as I think tokenization and numericalization of the training and validation sets. Is that right?
Question 3: Does passing the dependent variable in the cols argument along with FloatList set this up for a regression problem as I think?

  1. Next I add the test set:
d = d.add_test(TextList.from_df(test_df, path, cols=['name', 'item_description'], vocab=data_lm.vocab))

Question 4: Again, do I have to pass my custom tokenize/numericalize processors here?

  1. Finally I create the databunch:
d = d.databunch()

When I call show_batch on this databunch, I am one column of text and another column of floats (i.e., log values of the price varialbe).

Question 5: There are two columns of text in the original data frame (name and item_description) representing two fields. Have these two been merged to get one full text field?

Question 6: The fields are not marked (i.e., I don’t see xfld 1 and xfld 2 as I do in the LM databunch. I’m guessing I need a custom tokenizer for that. Will that be the one I created for the LM databunch?

Thanks.

So, I decided to go with my intuition and create a databunch for my regression problem. It got created without any errors, but I’m still not a 100% sure, whether what I have is correct and going to work. Here is the code (pretty much same as previous post, simplified):

data_reg = (TextList.from_df(train_df, path, cols=['name', 'item_description'], vocab=data_lm.vocab, processor=[tok_proc, num_proc])
           .split_by_idx(get_rdm_idx(train_df))
           .label_from_df(cols=['price'], label_cls=FloatList, log=True)
           .add_test(TextList.from_df(test_df, path, cols=['name', 'item_description'], vocab=data_lm.vocab, processor=[tok_proc, num_proc]))
           .databunch())

There is one problem. Since data_reg is using the same vocab as data_lm, I would think that the vocabulary size would also same. But I get different values for the stoi's (but same for the itos's):

len(data_lm.vocab.itos)
60093
len(data_lm.vocab.stoi)
60093
len(data_reg.vocab.itos)
60093
len(data_reg.vocab.stoi)
295127

I don’t know why data_reg.vocab.stoi is so much better than data_reg.vocab.itos. Should they actually be the same, since stoi is created from itos?

  1. mark_fields is set to False by default, so you should pass a processor that sets it to True. I think this is a bug since I believe we decided to default mark_fields to False when there is only one column and to True where there are several, let me check.
  2. Yes the tokenization and numericalization happen at the end of the labelling.
  3. Absolutely
  4. Same answer as 2 :wink:
  5. Yes, columns are merged to make one big text, with field separators if mark_fields is True.
  6. It should be the same for all your tasks, if you want those fields marked.

Thank you for your replies. It helps me a lot in using the library to do what I want to do.

I’m still not exactly sure why stoi and itos lengths are different for the regression databunch vocab. My concern is that the LM vocab is not being utilized correctly for the regression task (even though I’m passing it in during creation).

Also, fastai.text has a language_model_learner and text_classifer_learner. What would I need to do to get a custom learner for the regression problem now that my data is ready? Do I create a custom learner from the base class RNNLearner?

Thanks.