Regression using Fine-tuned Language Model

sgugger · November 15, 2018, 3:30pm

Oh yeah, looks very inefficient. Can you change that line to

train_idx = np.setdiff1d(arange_of(self.items), valid_idx)

and report if it solves your problem?

shaun1 · November 15, 2018, 3:43pm

Yup That worked. Took 4.11s:

%%time
data_lm = data_lm.random_split_by_pct(0.1)

CPU times: user 19.5 s, sys: 851 ms, total: 20.4 s
Wall time: 4.11 s

Just to confirm I didn’t run the sample:

len(data_lm.train)
5072171

Thank you for that! I’m guessing a future git pull will have this modified code in there.

shaun1 · November 15, 2018, 5:14pm

After the fix, everything went smoothly and I was able to process and create a databunch. However, when I tried to save the result of the processing I got the following error:

data_lm.save('lm-tokens')
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-8-fe083dabc18d> in <module>
----> 1 data_lm.save('lm-tokens')

~/fastai/fastai/text/data.py in save(self, cache_name)
    106         cache_path = self.path/cache_name
    107         pickle.dump(self.train_ds.vocab.itos, open(cache_path/f'itos.pkl', 'wb'))
--> 108         np.save(cache_path/f'train_ids.npy', self.train_ds.x.items)
    109         np.save(cache_path/f'train_lbl.npy', self.train_ds.y.items)
    110         np.save(cache_path/f'valid_ids.npy', self.valid_ds.x.items)

/net/vaosl01/opt/NFS/sw/anaconda3/envs/mer-su0/lib/python3.7/site-packages/numpy/lib/npyio.py in save(file, arr, allow_pickle, fix_imports)
    517 
    518     try:
--> 519         arr = np.asanyarray(arr)
    520         format.write_array(fid, arr, allow_pickle=allow_pickle,
    521                            pickle_kwargs=pickle_kwargs)

/net/vaosl01/opt/NFS/sw/anaconda3/envs/mer-su0/lib/python3.7/site-packages/numpy/core/numeric.py in asanyarray(a, dtype, order)
    551 
    552     """
--> 553     return array(a, dtype, copy=False, order=order, subok=True)
    554 
    555 

ValueError: only one element tensors can be converted to Python scalars

Upon investigating the issue further, I came across this question in stackoverflow which talks about the same issue. The basic problem seemed to that self.train_ds.x.items is a list of Pytorch tensors which is causing some problem when called with np.save. The solution was to explicitly convert the tenors into numpy arrays and then call np.save on the new list of numpy arrays.

I did this in debug mode (i.e., not by changing the actual code) and am not yet sure how this will affect loading. But if the devs have a different solution I’d be glad to hear it!

wgpubs · November 15, 2018, 6:48pm

Also I’m noticing that when in jupyter notebook, if I type data_lm.save( and then shift+tab … jupyter notebook just hangs for several seconds.

wgpubs · November 15, 2018, 7:07pm

I believe the code needs to be updated to cast the pytorch tensors -> numpy (and then from numpy -> pytorch tensor on load).

Something like:

np.save(cache_path/f'train_ids.npy', [i.numpy() for i in self.train_ds.x.items])

shaun1 · November 15, 2018, 7:37pm

Yes I ended up doing that. I’m just wondering if thats what in fact is intended by the devs and also trying to bring it to their attention so that they can do it in the code.

wgpubs · November 15, 2018, 7:51pm

FYI: Just pulled the latest from master and this looks to be fixed. Not sure when it will be rolled into one of the release branches though.

shaun1 · November 15, 2018, 7:52pm

Yup I did the same thing and it is fixed!

sgugger · November 15, 2018, 11:34pm

Yes load and save were broken, and I fixed them this afternoon.

shaun1 · November 17, 2018, 3:57pm

I was able to successfully fine-tune a LM using the pre-trained model with the datablock API on a custom dataset. I highlight the (small number of) steps here for documentation:

Assuming, our data is in a pandas dataframe with just different fields that need to be added to the text:

# my dataset consists of name and item_description
data_lm = (TextList.from_df(texts, PATH, cols=['name', 'item_description']) 
          .random_split_by_pct(0.1)
          .label_for_lm() # this does the tokenization and numericalization
          .databunch())

data_lm.save('lm-tokens')

# load the data (can be used in the future as well to prevent reprocessing)
data_lm = TextLMDataBunch.load(PATH, 'lm-tokens')
data_lm.show_batch() # take a look at the batch fed into the GPU

learn = language_model_learner(data_lm, pretrained_model=URLs.WT103, drop_mult=0.5, callback_fns=ShowGraph)
learn.lr_find()
learn.recorder.plot()
learn.fit_one_cycle(1, 1e-2, moms=(0.8,0.7))
learn.recorder.plot_losses()
learn.save('fit-head')

learn.load('fit-head')
learn.unfreeze()
learn.lr_find()
learn.recorder.plot()

learn.fit_one_cycle(11, 1e-3, moms=(0.8,0.7))

With a dataset size of 5,635,745, it took me 21 hours, 22 minutes, and 6 seconds to run this on a V100 with a final training loss of 2.697805, valid loss of 2.571279, and accuracy of 0.524987.

msmedes · November 30, 2018, 11:19pm

Did you wind up getting regression to work with a pre-trained LM?

shaun1 · November 30, 2018, 11:29pm

Not yet. Although I haven’t worked on it for a while. The devs mentioned that the latest data block API made working on a regression problem easier. But I haven’t yet looked at it. I will in the next couple of days.

msmedes · November 30, 2018, 11:34pm

Gotcha. I’m trying to figure out how to do regression on the indices of particular tokens, kind of like bounding boxes with category labels in a CNN but haven’t yet worked out what the input/head needs to look like for that to work.

britton · December 9, 2018, 6:01am

I’m working on a similar task (trying to turn the text classifier into a regressor) and thanks to the posts above, have gotten my databunch set up. I’ve changed the loss function like so:

def rmse(preds, targs):
"""Compute root mean squared error"""
return np.sqrt(torch.mean((targs - preds).pow(2)))

learn.loss_func = F.mse_loss
learn.metrics = [rmse]

The final layer of the classifier is a Linear layer with input size 50 to output size 0, and I’ve changed that to output size 1 by altering the init method of the PoolingLinearClassifier: (left out the rest for brevity)

class PoolingLinearRegressor(nn.Module):
"Create a linear regressor with pooling."

def __init__(self, layers:Collection[int], drops:Collection[float]):
    super().__init__()
    mod_layers = []
    activs = [nn.ReLU(inplace=True)] * (len(layers) - 2) + [None]
    for n_in,n_out,p,actn in zip(layers[:-1],layers[1:], drops, activs):
        mod_layers += bn_drop_lin(n_in, n_out, p=p, actn=actn)
   mod_layers[-1] = nn.Linear(in_features=50, out_features=1, bias=True)
    self.layers = nn.Sequential(*mod_layers)

When I try training the learner, I hit an embedding index error:

~/anaconda3/envs/fastai/lib/python3.6/site-packages/torch/nn/functional.py in 
embedding(input, weight, padding_idx, max_norm, norm_type, scale_grad_by_freq, 
sparse)
   1410         # remove once script supports set_grad_enabled
   1411         torch.no_grad_embedding_renorm_(weight, input, max_norm, 
norm_type)
-> 1412     return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, 
sparse)
   1413 
   1414 

RuntimeError: index out of range at /opt/conda/conda-bld/pytorch- 
nightly_1543482224190/work/aten/src/TH/generic/THTensorEvenMoreMath.cpp:191

I don’t understand yet how my model changes have affected the embeddings.

shaun1 · December 13, 2018, 1:15pm

So, I’ve started working on this problem again. I decided to build a new LM due to lots of API changes and more importantly, I like to have updated code. I ran the same code as above and I get the following error now:

data_lm = (TextList.from_df(texts_df, path, cols=['name', 'item_description'], processor=[tok_proc, num_proc])
         .random_split_by_pct(0.1)
         .label_for_lm(cols=['name', 'item_description'])
         .databunch())

---------------------------------------------------------------------------
_RemoteTraceback                          Traceback (most recent call last)
_RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/net/vaosl01/opt/NFS/sw/anaconda3/envs/mer/lib/python3.7/concurrent/futures/process.py", line 232, in _process_worker
    r = call_item.fn(*call_item.args, **call_item.kwargs)
  File "/net/vaosl01/opt/NFS/sw/anaconda3/envs/mer/lib/python3.7/concurrent/futures/process.py", line 191, in _process_chunk
    return [fn(*args) for args in chunk]
  File "/net/vaosl01/opt/NFS/sw/anaconda3/envs/mer/lib/python3.7/concurrent/futures/process.py", line 191, in <listcomp>
    return [fn(*args) for args in chunk]
  File "/home/su0/fastai/fastai/text/transform.py", line 111, in _process_all_1
    return [self.process_text(t, tok) for t in texts]
  File "/home/su0/fastai/fastai/text/transform.py", line 111, in <listcomp>
    return [self.process_text(t, tok) for t in texts]
  File "/home/su0/fastai/fastai/text/transform.py", line 102, in process_text
    for rule in self.pre_rules: t = rule(t)
  File "/home/su0/fastai/fastai/text/transform.py", line 58, in fix_html
    x = x.replace('#39;', "'").replace('amp;', '&').replace('#146;', "'").replace(
AttributeError: 'float' object has no attribute 'replace'
"""

The above exception was the direct cause of the following exception:

AttributeError                            Traceback (most recent call last)
<timed exec> in <module>

~/fastai/fastai/data_block.py in _inner(*args, **kwargs)
    391             self.valid = fv(*args, **kwargs)
    392             self.__class__ = LabelLists
--> 393             self.process()
    394             return self
    395         return _inner

~/fastai/fastai/data_block.py in process(self)
    438         "Process the inner datasets."
    439         xp,yp = self.get_processors()
--> 440         for i,ds in enumerate(self.lists): ds.process(xp, yp, filter_missing_y=i==0)
    441         return self
    442 

~/fastai/fastai/data_block.py in process(self, xp, yp, filter_missing_y)
    565             filt = array([o is None for o in self.y])
    566             if filt.sum()>0: self.x,self.y = self.x[~filt],self.y[~filt]
--> 567         self.x.process(xp)
    568         return self
    569 

~/fastai/fastai/data_block.py in process(self, processor)
     66         if processor is not None: self.processor = processor
     67         self.processor = listify(self.processor)
---> 68         for p in self.processor: p.process(self)
     69         return self
     70 

~/fastai/fastai/text/data.py in process(self, ds)
    241         tokens = []
    242         for i in progress_bar(range(0,len(ds),self.chunksize), leave=False):
--> 243             tokens += self.tokenizer.process_all(ds.items[i:i+self.chunksize])
    244         ds.items = tokens
    245 

~/fastai/fastai/text/transform.py in process_all(self, texts)
    115         if self.n_cpus <= 1: return self._process_all_1(texts)
    116         with ProcessPoolExecutor(self.n_cpus) as e:
--> 117             return sum(e.map(self._process_all_1, partition_by_cores(texts, self.n_cpus)), [])
    118 
    119 class Vocab():

/net/vaosl01/opt/NFS/sw/anaconda3/envs/mer/lib/python3.7/concurrent/futures/process.py in _chain_from_iterable_of_lists(iterable)
    474     careful not to keep references to yielded objects.
    475     """
--> 476     for element in iterable:
    477         element.reverse()
    478         while element:

/net/vaosl01/opt/NFS/sw/anaconda3/envs/mer/lib/python3.7/concurrent/futures/_base.py in result_iterator()
    584                     # Careful not to keep a reference to the popped future
    585                     if timeout is None:
--> 586                         yield fs.pop().result()
    587                     else:
    588                         yield fs.pop().result(end_time - time.monotonic())

/net/vaosl01/opt/NFS/sw/anaconda3/envs/mer/lib/python3.7/concurrent/futures/_base.py in result(self, timeout)
    430                 raise CancelledError()
    431             elif self._state == FINISHED:
--> 432                 return self.__get_result()
    433             else:
    434                 raise TimeoutError()

/net/vaosl01/opt/NFS/sw/anaconda3/envs/mer/lib/python3.7/concurrent/futures/_base.py in __get_result(self)
    382     def __get_result(self):
    383         if self._exception:
--> 384             raise self._exception
    385         else:
    386             return self._result

AttributeError: 'float' object has no attribute 'replace'

More specifically, the error is at this step:

.label_for_lm()

I’m trying to debug this, but by default FastAI uses multiple CPUs and its hard to figure out errors when multiple CPUs are involved. I tried passing .label_for_lm(cols=['name', 'item_description'], n_cpus=1) but it kept using multiple CPUs. Furthermore, I couldn’t figure out where exactly the tokenizer is called by following the code.

Any help is appreciated.
Thanks.

shaun1 · December 13, 2018, 2:48pm

Earlier, I was running this piece of code which gave no errors:

data_lm = (TextList.from_df(texts_df, path, col=['name', 'item_description'], processor=[tok_proc, num_proc])
         .random_split_by_pct(0.1)
         .label_for_lm()
         .databunch())

and now this which gives the error in the previous post:

data_lm = (TextList.from_df(texts_df, path, cols=['name', 'item_description'], processor=[tok_proc, num_proc])
         .random_split_by_pct(0.1)
         .label_for_lm()
         .databunch())

with the processors:

tok_proc = TokenizeProcessor(mark_fields=True)
num_proc = NumericalizeProcessor(max_vocab=60_091, min_freq=2)

The key part is the argument to TextList.from_df. Initially I had the column as just col which worked. But when I switched to cols it gave the error. I did this because, I wanted to both the fields (name and item_description) to show up in the dataset. With just using col it only produced xxfld 1.

When I used a smaller dataset with cols it didn’t give an error and I was able to confirm that both xxfld 1 and xxfld 2 showed up. But when I run it with the full dataset I get the error.

I’m still trying to debug these errors and I’m hoping @sgugger could clarify these questions:

Why is both col and cols accepted as args and what are the differences between them?
Where exactly is the tokenizer called during .label_for_lm? This is where the error occurs when .fix_htmlis called which is called as part of default_pre_rules in the tokenizer.
How do I force the tokenzier to use only 1 cpu so that its easier to debug?

Thanks.

shaun1 · December 13, 2018, 4:01pm

Just an update:

Following my previous post, if pass in col as argument (when I don’t get the error on the entire dataset) and printing a random db.train_ds element, I get the following:

(Text xxbos xxfld 1 xxmaj rave xxmaj outfit xxmaj bundle, Category 0)

Now, I try with cols as the argument but with small dataset size (using the entire dataset gives the error I mentioned) and printing a random db.train_ds element, I get the following:

(Text xxbos xxfld 1 3 lip gloss for xxunk xxfld 2 lip gloss xxunk 2- c - thru xxunk space,
 Category 0)

As can be clearly seen, using the multiple columns cols argument, facilitates having xxfld 1 and xxfld 2 for name and item_description (which is what I want) in the batches, but that is not the case when using the singular col argument where I think only the name part seems to show up.

I’m still not sure what difference col and cols makes when creating the TextList. Will investigate further.

sgugger · December 13, 2018, 4:11pm

To answer your first questions:

col is just ignored. I believe you can probably pass any kwargs. In the first case, cols will default to the first column of your dataframe. Are you sure the second column contains string elements up until the end? It seems weird it would bug with that.
The tokenizer is called after the labeling in the process call (this line exactly).
To force your own tokenizer, you must pass a custom preprocessor as you did.

shaun1 · December 13, 2018, 4:18pm

Thank you so much!

Here is a sample of the source dataframe:

	name	item_description
0	MLB Cincinnati Reds T Shirt Size XL	No description yet
1	Razer BlackWidow Chroma Keyboard	This keyboard is in great condition and works ...
2	AVA-VIV Blouse	Adorable top with a hint of lace and a key hol...
3	Leather Horse Statues	New with tags. Leather horses. Retail for [rm]...
4	24K GOLD plated rose	Complete with certificate of authenticity

As you can see first column is the name and second column as item_description both of which are strings:

texts_df.dtypes()

name                object
item_description    object
dtype: object

Currently, I’m reading a csv files containing the texts into a dataframe and using .from_df class method to create my databunch. Perhaps, I could try to use the .from_csv and see it works.

I don’t like that with Python

shaun1 · December 14, 2018, 7:42pm

Update. The problem was with my dataset. There were 10 (just 10!) entries in item_description that were NaNs that I didn’t take care of. One of them was what was causing the problem. Once I fixed that, I was able to create the databunch without any problems with all the fields marked correctly. Sorry about that!