Oh yeah, looks very inefficient. Can you change that line to
train_idx = np.setdiff1d(arange_of(self.items), valid_idx)
and report if it solves your problem?
Oh yeah, looks very inefficient. Can you change that line to
train_idx = np.setdiff1d(arange_of(self.items), valid_idx)
and report if it solves your problem?
Yup That worked. Took 4.11s:
%%time
data_lm = data_lm.random_split_by_pct(0.1)
CPU times: user 19.5 s, sys: 851 ms, total: 20.4 s
Wall time: 4.11 s
Just to confirm I didn’t run the sample:
len(data_lm.train)
5072171
Thank you for that! I’m guessing a future git pull will have this modified code in there.
After the fix, everything went smoothly and I was able to process and create a databunch. However, when I tried to save the result of the processing I got the following error:
data_lm.save('lm-tokens')
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-8-fe083dabc18d> in <module>
----> 1 data_lm.save('lm-tokens')
~/fastai/fastai/text/data.py in save(self, cache_name)
106 cache_path = self.path/cache_name
107 pickle.dump(self.train_ds.vocab.itos, open(cache_path/f'itos.pkl', 'wb'))
--> 108 np.save(cache_path/f'train_ids.npy', self.train_ds.x.items)
109 np.save(cache_path/f'train_lbl.npy', self.train_ds.y.items)
110 np.save(cache_path/f'valid_ids.npy', self.valid_ds.x.items)
/net/vaosl01/opt/NFS/sw/anaconda3/envs/mer-su0/lib/python3.7/site-packages/numpy/lib/npyio.py in save(file, arr, allow_pickle, fix_imports)
517
518 try:
--> 519 arr = np.asanyarray(arr)
520 format.write_array(fid, arr, allow_pickle=allow_pickle,
521 pickle_kwargs=pickle_kwargs)
/net/vaosl01/opt/NFS/sw/anaconda3/envs/mer-su0/lib/python3.7/site-packages/numpy/core/numeric.py in asanyarray(a, dtype, order)
551
552 """
--> 553 return array(a, dtype, copy=False, order=order, subok=True)
554
555
ValueError: only one element tensors can be converted to Python scalars
Upon investigating the issue further, I came across this question in stackoverflow which talks about the same issue. The basic problem seemed to that self.train_ds.x.items
is a list of Pytorch tensors which is causing some problem when called with np.save
. The solution was to explicitly convert the tenors into numpy
arrays and then call np.save
on the new list of numpy
arrays.
I did this in debug mode (i.e., not by changing the actual code) and am not yet sure how this will affect loading. But if the devs have a different solution I’d be glad to hear it!
Also I’m noticing that when in jupyter notebook, if I type data_lm.save(
and then shift+tab … jupyter notebook just hangs for several seconds.
I believe the code needs to be updated to cast the pytorch tensors -> numpy (and then from numpy -> pytorch tensor on load
).
Something like:
np.save(cache_path/f'train_ids.npy', [i.numpy() for i in self.train_ds.x.items])
Yes I ended up doing that. I’m just wondering if thats what in fact is intended by the devs and also trying to bring it to their attention so that they can do it in the code.
FYI: Just pulled the latest from master and this looks to be fixed. Not sure when it will be rolled into one of the release branches though.
Yup I did the same thing and it is fixed!
Yes load and save were broken, and I fixed them this afternoon.
I was able to successfully fine-tune a LM using the pre-trained model with the datablock API on a custom dataset. I highlight the (small number of) steps here for documentation:
Assuming, our data is in a pandas dataframe with just different fields that need to be added to the text:
# my dataset consists of name and item_description
data_lm = (TextList.from_df(texts, PATH, cols=['name', 'item_description'])
.random_split_by_pct(0.1)
.label_for_lm() # this does the tokenization and numericalization
.databunch())
data_lm.save('lm-tokens')
# load the data (can be used in the future as well to prevent reprocessing)
data_lm = TextLMDataBunch.load(PATH, 'lm-tokens')
data_lm.show_batch() # take a look at the batch fed into the GPU
learn = language_model_learner(data_lm, pretrained_model=URLs.WT103, drop_mult=0.5, callback_fns=ShowGraph)
learn.lr_find()
learn.recorder.plot()
learn.fit_one_cycle(1, 1e-2, moms=(0.8,0.7))
learn.recorder.plot_losses()
learn.save('fit-head')
learn.load('fit-head')
learn.unfreeze()
learn.lr_find()
learn.recorder.plot()
learn.fit_one_cycle(11, 1e-3, moms=(0.8,0.7))
With a dataset size of 5,635,745, it took me 21 hours, 22 minutes, and 6 seconds to run this on a V100 with a final training loss of 2.697805, valid loss of 2.571279, and accuracy of 0.524987.
Did you wind up getting regression to work with a pre-trained LM?
Not yet. Although I haven’t worked on it for a while. The devs mentioned that the latest data block API made working on a regression problem easier. But I haven’t yet looked at it. I will in the next couple of days.
Gotcha. I’m trying to figure out how to do regression on the indices of particular tokens, kind of like bounding boxes with category labels in a CNN but haven’t yet worked out what the input/head needs to look like for that to work.
I’m working on a similar task (trying to turn the text classifier into a regressor) and thanks to the posts above, have gotten my databunch set up. I’ve changed the loss function like so:
def rmse(preds, targs):
"""Compute root mean squared error"""
return np.sqrt(torch.mean((targs - preds).pow(2)))
learn.loss_func = F.mse_loss
learn.metrics = [rmse]
The final layer of the classifier is a Linear layer with input size 50 to output size 0, and I’ve changed that to output size 1 by altering the init method of the PoolingLinearClassifier: (left out the rest for brevity)
class PoolingLinearRegressor(nn.Module):
"Create a linear regressor with pooling."
def __init__(self, layers:Collection[int], drops:Collection[float]):
super().__init__()
mod_layers = []
activs = [nn.ReLU(inplace=True)] * (len(layers) - 2) + [None]
for n_in,n_out,p,actn in zip(layers[:-1],layers[1:], drops, activs):
mod_layers += bn_drop_lin(n_in, n_out, p=p, actn=actn)
mod_layers[-1] = nn.Linear(in_features=50, out_features=1, bias=True)
self.layers = nn.Sequential(*mod_layers)
When I try training the learner, I hit an embedding index error:
~/anaconda3/envs/fastai/lib/python3.6/site-packages/torch/nn/functional.py in
embedding(input, weight, padding_idx, max_norm, norm_type, scale_grad_by_freq,
sparse)
1410 # remove once script supports set_grad_enabled
1411 torch.no_grad_embedding_renorm_(weight, input, max_norm,
norm_type)
-> 1412 return torch.embedding(weight, input, padding_idx, scale_grad_by_freq,
sparse)
1413
1414
RuntimeError: index out of range at /opt/conda/conda-bld/pytorch-
nightly_1543482224190/work/aten/src/TH/generic/THTensorEvenMoreMath.cpp:191
I don’t understand yet how my model changes have affected the embeddings.
So, I’ve started working on this problem again. I decided to build a new LM due to lots of API changes and more importantly, I like to have updated code. I ran the same code as above and I get the following error now:
data_lm = (TextList.from_df(texts_df, path, cols=['name', 'item_description'], processor=[tok_proc, num_proc])
.random_split_by_pct(0.1)
.label_for_lm(cols=['name', 'item_description'])
.databunch())
---------------------------------------------------------------------------
_RemoteTraceback Traceback (most recent call last)
_RemoteTraceback:
"""
Traceback (most recent call last):
File "/net/vaosl01/opt/NFS/sw/anaconda3/envs/mer/lib/python3.7/concurrent/futures/process.py", line 232, in _process_worker
r = call_item.fn(*call_item.args, **call_item.kwargs)
File "/net/vaosl01/opt/NFS/sw/anaconda3/envs/mer/lib/python3.7/concurrent/futures/process.py", line 191, in _process_chunk
return [fn(*args) for args in chunk]
File "/net/vaosl01/opt/NFS/sw/anaconda3/envs/mer/lib/python3.7/concurrent/futures/process.py", line 191, in <listcomp>
return [fn(*args) for args in chunk]
File "/home/su0/fastai/fastai/text/transform.py", line 111, in _process_all_1
return [self.process_text(t, tok) for t in texts]
File "/home/su0/fastai/fastai/text/transform.py", line 111, in <listcomp>
return [self.process_text(t, tok) for t in texts]
File "/home/su0/fastai/fastai/text/transform.py", line 102, in process_text
for rule in self.pre_rules: t = rule(t)
File "/home/su0/fastai/fastai/text/transform.py", line 58, in fix_html
x = x.replace('#39;', "'").replace('amp;', '&').replace('#146;', "'").replace(
AttributeError: 'float' object has no attribute 'replace'
"""
The above exception was the direct cause of the following exception:
AttributeError Traceback (most recent call last)
<timed exec> in <module>
~/fastai/fastai/data_block.py in _inner(*args, **kwargs)
391 self.valid = fv(*args, **kwargs)
392 self.__class__ = LabelLists
--> 393 self.process()
394 return self
395 return _inner
~/fastai/fastai/data_block.py in process(self)
438 "Process the inner datasets."
439 xp,yp = self.get_processors()
--> 440 for i,ds in enumerate(self.lists): ds.process(xp, yp, filter_missing_y=i==0)
441 return self
442
~/fastai/fastai/data_block.py in process(self, xp, yp, filter_missing_y)
565 filt = array([o is None for o in self.y])
566 if filt.sum()>0: self.x,self.y = self.x[~filt],self.y[~filt]
--> 567 self.x.process(xp)
568 return self
569
~/fastai/fastai/data_block.py in process(self, processor)
66 if processor is not None: self.processor = processor
67 self.processor = listify(self.processor)
---> 68 for p in self.processor: p.process(self)
69 return self
70
~/fastai/fastai/text/data.py in process(self, ds)
241 tokens = []
242 for i in progress_bar(range(0,len(ds),self.chunksize), leave=False):
--> 243 tokens += self.tokenizer.process_all(ds.items[i:i+self.chunksize])
244 ds.items = tokens
245
~/fastai/fastai/text/transform.py in process_all(self, texts)
115 if self.n_cpus <= 1: return self._process_all_1(texts)
116 with ProcessPoolExecutor(self.n_cpus) as e:
--> 117 return sum(e.map(self._process_all_1, partition_by_cores(texts, self.n_cpus)), [])
118
119 class Vocab():
/net/vaosl01/opt/NFS/sw/anaconda3/envs/mer/lib/python3.7/concurrent/futures/process.py in _chain_from_iterable_of_lists(iterable)
474 careful not to keep references to yielded objects.
475 """
--> 476 for element in iterable:
477 element.reverse()
478 while element:
/net/vaosl01/opt/NFS/sw/anaconda3/envs/mer/lib/python3.7/concurrent/futures/_base.py in result_iterator()
584 # Careful not to keep a reference to the popped future
585 if timeout is None:
--> 586 yield fs.pop().result()
587 else:
588 yield fs.pop().result(end_time - time.monotonic())
/net/vaosl01/opt/NFS/sw/anaconda3/envs/mer/lib/python3.7/concurrent/futures/_base.py in result(self, timeout)
430 raise CancelledError()
431 elif self._state == FINISHED:
--> 432 return self.__get_result()
433 else:
434 raise TimeoutError()
/net/vaosl01/opt/NFS/sw/anaconda3/envs/mer/lib/python3.7/concurrent/futures/_base.py in __get_result(self)
382 def __get_result(self):
383 if self._exception:
--> 384 raise self._exception
385 else:
386 return self._result
AttributeError: 'float' object has no attribute 'replace'
More specifically, the error is at this step:
.label_for_lm()
I’m trying to debug this, but by default FastAI uses multiple CPUs and its hard to figure out errors when multiple CPUs are involved. I tried passing .label_for_lm(cols=['name', 'item_description'], n_cpus=1)
but it kept using multiple CPUs. Furthermore, I couldn’t figure out where exactly the tokenizer is called by following the code.
Any help is appreciated.
Thanks.
Earlier, I was running this piece of code which gave no errors:
data_lm = (TextList.from_df(texts_df, path, col=['name', 'item_description'], processor=[tok_proc, num_proc])
.random_split_by_pct(0.1)
.label_for_lm()
.databunch())
and now this which gives the error in the previous post:
data_lm = (TextList.from_df(texts_df, path, cols=['name', 'item_description'], processor=[tok_proc, num_proc])
.random_split_by_pct(0.1)
.label_for_lm()
.databunch())
with the processors:
tok_proc = TokenizeProcessor(mark_fields=True)
num_proc = NumericalizeProcessor(max_vocab=60_091, min_freq=2)
The key part is the argument to TextList.from_df
. Initially I had the column as just col
which worked. But when I switched to cols
it gave the error. I did this because, I wanted to both the fields (name
and item_description
) to show up in the dataset. With just using col
it only produced xxfld 1
.
When I used a smaller dataset with cols
it didn’t give an error and I was able to confirm that both xxfld 1
and xxfld 2
showed up. But when I run it with the full dataset I get the error.
I’m still trying to debug these errors and I’m hoping @sgugger could clarify these questions:
col
and cols
accepted as args and what are the differences between them?.label_for_lm
? This is where the error occurs when .fix_html
is called which is called as part of default_pre_rules
in the tokenizer.Thanks.
Just an update:
Following my previous post, if pass in col
as argument (when I don’t get the error on the entire dataset) and printing a random db.train_ds
element, I get the following:
(Text xxbos xxfld 1 xxmaj rave xxmaj outfit xxmaj bundle, Category 0)
Now, I try with cols
as the argument but with small dataset size (using the entire dataset gives the error I mentioned) and printing a random db.train_ds
element, I get the following:
(Text xxbos xxfld 1 3 lip gloss for xxunk xxfld 2 lip gloss xxunk 2- c - thru xxunk space,
Category 0)
As can be clearly seen, using the multiple columns cols
argument, facilitates having xxfld 1
and xxfld 2
for name
and item_description
(which is what I want) in the batches, but that is not the case when using the singular col
argument where I think only the name
part seems to show up.
I’m still not sure what difference col
and cols
makes when creating the TextList
. Will investigate further.
To answer your first questions:
col
is just ignored. I believe you can probably pass any kwargs. In the first case, cols will default to the first column of your dataframe. Are you sure the second column contains string elements up until the end? It seems weird it would bug with that.
The tokenizer is called after the labeling in the process
call (this line exactly).
To force your own tokenizer, you must pass a custom preprocessor
as you did.
Thank you so much!
Here is a sample of the source dataframe:
name item_description
0 MLB Cincinnati Reds T Shirt Size XL No description yet
1 Razer BlackWidow Chroma Keyboard This keyboard is in great condition and works ...
2 AVA-VIV Blouse Adorable top with a hint of lace and a key hol...
3 Leather Horse Statues New with tags. Leather horses. Retail for [rm]...
4 24K GOLD plated rose Complete with certificate of authenticity
As you can see first column is the name
and second column as item_description
both of which are strings:
texts_df.dtypes()
name object
item_description object
dtype: object
Currently, I’m reading a csv files containing the texts into a dataframe and using .from_df
class method to create my databunch. Perhaps, I could try to use the .from_csv
and see it works.
I don’t like that with Python
Update. The problem was with my dataset. There were 10 (just 10!) entries in item_description
that were NaN
s that I didn’t take care of. One of them was what was causing the problem. Once I fixed that, I was able to create the databunch without any problems with all the fields marked correctly. Sorry about that!