I believe the code needs to be updated to cast the pytorch tensors -> numpy (and then from numpy -> pytorch tensor on load
).
Something like:
np.save(cache_path/f'train_ids.npy', [i.numpy() for i in self.train_ds.x.items])
I believe the code needs to be updated to cast the pytorch tensors -> numpy (and then from numpy -> pytorch tensor on load
).
Something like:
np.save(cache_path/f'train_ids.npy', [i.numpy() for i in self.train_ds.x.items])
Yes I ended up doing that. I’m just wondering if thats what in fact is intended by the devs and also trying to bring it to their attention so that they can do it in the code.
FYI: Just pulled the latest from master and this looks to be fixed. Not sure when it will be rolled into one of the release branches though.
Yup I did the same thing and it is fixed!
Yes load and save were broken, and I fixed them this afternoon.
I was able to successfully fine-tune a LM using the pre-trained model with the datablock API on a custom dataset. I highlight the (small number of) steps here for documentation:
Assuming, our data is in a pandas dataframe with just different fields that need to be added to the text:
# my dataset consists of name and item_description
data_lm = (TextList.from_df(texts, PATH, cols=['name', 'item_description'])
.random_split_by_pct(0.1)
.label_for_lm() # this does the tokenization and numericalization
.databunch())
data_lm.save('lm-tokens')
# load the data (can be used in the future as well to prevent reprocessing)
data_lm = TextLMDataBunch.load(PATH, 'lm-tokens')
data_lm.show_batch() # take a look at the batch fed into the GPU
learn = language_model_learner(data_lm, pretrained_model=URLs.WT103, drop_mult=0.5, callback_fns=ShowGraph)
learn.lr_find()
learn.recorder.plot()
learn.fit_one_cycle(1, 1e-2, moms=(0.8,0.7))
learn.recorder.plot_losses()
learn.save('fit-head')
learn.load('fit-head')
learn.unfreeze()
learn.lr_find()
learn.recorder.plot()
learn.fit_one_cycle(11, 1e-3, moms=(0.8,0.7))
With a dataset size of 5,635,745, it took me 21 hours, 22 minutes, and 6 seconds to run this on a V100 with a final training loss of 2.697805, valid loss of 2.571279, and accuracy of 0.524987.
Did you wind up getting regression to work with a pre-trained LM?
Not yet. Although I haven’t worked on it for a while. The devs mentioned that the latest data block API made working on a regression problem easier. But I haven’t yet looked at it. I will in the next couple of days.
Gotcha. I’m trying to figure out how to do regression on the indices of particular tokens, kind of like bounding boxes with category labels in a CNN but haven’t yet worked out what the input/head needs to look like for that to work.
I’m working on a similar task (trying to turn the text classifier into a regressor) and thanks to the posts above, have gotten my databunch set up. I’ve changed the loss function like so:
def rmse(preds, targs):
"""Compute root mean squared error"""
return np.sqrt(torch.mean((targs - preds).pow(2)))
learn.loss_func = F.mse_loss
learn.metrics = [rmse]
The final layer of the classifier is a Linear layer with input size 50 to output size 0, and I’ve changed that to output size 1 by altering the init method of the PoolingLinearClassifier: (left out the rest for brevity)
class PoolingLinearRegressor(nn.Module):
"Create a linear regressor with pooling."
def __init__(self, layers:Collection[int], drops:Collection[float]):
super().__init__()
mod_layers = []
activs = [nn.ReLU(inplace=True)] * (len(layers) - 2) + [None]
for n_in,n_out,p,actn in zip(layers[:-1],layers[1:], drops, activs):
mod_layers += bn_drop_lin(n_in, n_out, p=p, actn=actn)
mod_layers[-1] = nn.Linear(in_features=50, out_features=1, bias=True)
self.layers = nn.Sequential(*mod_layers)
When I try training the learner, I hit an embedding index error:
~/anaconda3/envs/fastai/lib/python3.6/site-packages/torch/nn/functional.py in
embedding(input, weight, padding_idx, max_norm, norm_type, scale_grad_by_freq,
sparse)
1410 # remove once script supports set_grad_enabled
1411 torch.no_grad_embedding_renorm_(weight, input, max_norm,
norm_type)
-> 1412 return torch.embedding(weight, input, padding_idx, scale_grad_by_freq,
sparse)
1413
1414
RuntimeError: index out of range at /opt/conda/conda-bld/pytorch-
nightly_1543482224190/work/aten/src/TH/generic/THTensorEvenMoreMath.cpp:191
I don’t understand yet how my model changes have affected the embeddings.
So, I’ve started working on this problem again. I decided to build a new LM due to lots of API changes and more importantly, I like to have updated code. I ran the same code as above and I get the following error now:
data_lm = (TextList.from_df(texts_df, path, cols=['name', 'item_description'], processor=[tok_proc, num_proc])
.random_split_by_pct(0.1)
.label_for_lm(cols=['name', 'item_description'])
.databunch())
---------------------------------------------------------------------------
_RemoteTraceback Traceback (most recent call last)
_RemoteTraceback:
"""
Traceback (most recent call last):
File "/net/vaosl01/opt/NFS/sw/anaconda3/envs/mer/lib/python3.7/concurrent/futures/process.py", line 232, in _process_worker
r = call_item.fn(*call_item.args, **call_item.kwargs)
File "/net/vaosl01/opt/NFS/sw/anaconda3/envs/mer/lib/python3.7/concurrent/futures/process.py", line 191, in _process_chunk
return [fn(*args) for args in chunk]
File "/net/vaosl01/opt/NFS/sw/anaconda3/envs/mer/lib/python3.7/concurrent/futures/process.py", line 191, in <listcomp>
return [fn(*args) for args in chunk]
File "/home/su0/fastai/fastai/text/transform.py", line 111, in _process_all_1
return [self.process_text(t, tok) for t in texts]
File "/home/su0/fastai/fastai/text/transform.py", line 111, in <listcomp>
return [self.process_text(t, tok) for t in texts]
File "/home/su0/fastai/fastai/text/transform.py", line 102, in process_text
for rule in self.pre_rules: t = rule(t)
File "/home/su0/fastai/fastai/text/transform.py", line 58, in fix_html
x = x.replace('#39;', "'").replace('amp;', '&').replace('#146;', "'").replace(
AttributeError: 'float' object has no attribute 'replace'
"""
The above exception was the direct cause of the following exception:
AttributeError Traceback (most recent call last)
<timed exec> in <module>
~/fastai/fastai/data_block.py in _inner(*args, **kwargs)
391 self.valid = fv(*args, **kwargs)
392 self.__class__ = LabelLists
--> 393 self.process()
394 return self
395 return _inner
~/fastai/fastai/data_block.py in process(self)
438 "Process the inner datasets."
439 xp,yp = self.get_processors()
--> 440 for i,ds in enumerate(self.lists): ds.process(xp, yp, filter_missing_y=i==0)
441 return self
442
~/fastai/fastai/data_block.py in process(self, xp, yp, filter_missing_y)
565 filt = array([o is None for o in self.y])
566 if filt.sum()>0: self.x,self.y = self.x[~filt],self.y[~filt]
--> 567 self.x.process(xp)
568 return self
569
~/fastai/fastai/data_block.py in process(self, processor)
66 if processor is not None: self.processor = processor
67 self.processor = listify(self.processor)
---> 68 for p in self.processor: p.process(self)
69 return self
70
~/fastai/fastai/text/data.py in process(self, ds)
241 tokens = []
242 for i in progress_bar(range(0,len(ds),self.chunksize), leave=False):
--> 243 tokens += self.tokenizer.process_all(ds.items[i:i+self.chunksize])
244 ds.items = tokens
245
~/fastai/fastai/text/transform.py in process_all(self, texts)
115 if self.n_cpus <= 1: return self._process_all_1(texts)
116 with ProcessPoolExecutor(self.n_cpus) as e:
--> 117 return sum(e.map(self._process_all_1, partition_by_cores(texts, self.n_cpus)), [])
118
119 class Vocab():
/net/vaosl01/opt/NFS/sw/anaconda3/envs/mer/lib/python3.7/concurrent/futures/process.py in _chain_from_iterable_of_lists(iterable)
474 careful not to keep references to yielded objects.
475 """
--> 476 for element in iterable:
477 element.reverse()
478 while element:
/net/vaosl01/opt/NFS/sw/anaconda3/envs/mer/lib/python3.7/concurrent/futures/_base.py in result_iterator()
584 # Careful not to keep a reference to the popped future
585 if timeout is None:
--> 586 yield fs.pop().result()
587 else:
588 yield fs.pop().result(end_time - time.monotonic())
/net/vaosl01/opt/NFS/sw/anaconda3/envs/mer/lib/python3.7/concurrent/futures/_base.py in result(self, timeout)
430 raise CancelledError()
431 elif self._state == FINISHED:
--> 432 return self.__get_result()
433 else:
434 raise TimeoutError()
/net/vaosl01/opt/NFS/sw/anaconda3/envs/mer/lib/python3.7/concurrent/futures/_base.py in __get_result(self)
382 def __get_result(self):
383 if self._exception:
--> 384 raise self._exception
385 else:
386 return self._result
AttributeError: 'float' object has no attribute 'replace'
More specifically, the error is at this step:
.label_for_lm()
I’m trying to debug this, but by default FastAI uses multiple CPUs and its hard to figure out errors when multiple CPUs are involved. I tried passing .label_for_lm(cols=['name', 'item_description'], n_cpus=1)
but it kept using multiple CPUs. Furthermore, I couldn’t figure out where exactly the tokenizer is called by following the code.
Any help is appreciated.
Thanks.
Earlier, I was running this piece of code which gave no errors:
data_lm = (TextList.from_df(texts_df, path, col=['name', 'item_description'], processor=[tok_proc, num_proc])
.random_split_by_pct(0.1)
.label_for_lm()
.databunch())
and now this which gives the error in the previous post:
data_lm = (TextList.from_df(texts_df, path, cols=['name', 'item_description'], processor=[tok_proc, num_proc])
.random_split_by_pct(0.1)
.label_for_lm()
.databunch())
with the processors:
tok_proc = TokenizeProcessor(mark_fields=True)
num_proc = NumericalizeProcessor(max_vocab=60_091, min_freq=2)
The key part is the argument to TextList.from_df
. Initially I had the column as just col
which worked. But when I switched to cols
it gave the error. I did this because, I wanted to both the fields (name
and item_description
) to show up in the dataset. With just using col
it only produced xxfld 1
.
When I used a smaller dataset with cols
it didn’t give an error and I was able to confirm that both xxfld 1
and xxfld 2
showed up. But when I run it with the full dataset I get the error.
I’m still trying to debug these errors and I’m hoping @sgugger could clarify these questions:
col
and cols
accepted as args and what are the differences between them?.label_for_lm
? This is where the error occurs when .fix_html
is called which is called as part of default_pre_rules
in the tokenizer.Thanks.
Just an update:
Following my previous post, if pass in col
as argument (when I don’t get the error on the entire dataset) and printing a random db.train_ds
element, I get the following:
(Text xxbos xxfld 1 xxmaj rave xxmaj outfit xxmaj bundle, Category 0)
Now, I try with cols
as the argument but with small dataset size (using the entire dataset gives the error I mentioned) and printing a random db.train_ds
element, I get the following:
(Text xxbos xxfld 1 3 lip gloss for xxunk xxfld 2 lip gloss xxunk 2- c - thru xxunk space,
Category 0)
As can be clearly seen, using the multiple columns cols
argument, facilitates having xxfld 1
and xxfld 2
for name
and item_description
(which is what I want) in the batches, but that is not the case when using the singular col
argument where I think only the name
part seems to show up.
I’m still not sure what difference col
and cols
makes when creating the TextList
. Will investigate further.
To answer your first questions:
col
is just ignored. I believe you can probably pass any kwargs. In the first case, cols will default to the first column of your dataframe. Are you sure the second column contains string elements up until the end? It seems weird it would bug with that.
The tokenizer is called after the labeling in the process
call (this line exactly).
To force your own tokenizer, you must pass a custom preprocessor
as you did.
Thank you so much!
Here is a sample of the source dataframe:
name item_description
0 MLB Cincinnati Reds T Shirt Size XL No description yet
1 Razer BlackWidow Chroma Keyboard This keyboard is in great condition and works ...
2 AVA-VIV Blouse Adorable top with a hint of lace and a key hol...
3 Leather Horse Statues New with tags. Leather horses. Retail for [rm]...
4 24K GOLD plated rose Complete with certificate of authenticity
As you can see first column is the name
and second column as item_description
both of which are strings:
texts_df.dtypes()
name object
item_description object
dtype: object
Currently, I’m reading a csv files containing the texts into a dataframe and using .from_df
class method to create my databunch. Perhaps, I could try to use the .from_csv
and see it works.
I don’t like that with Python
Update. The problem was with my dataset. There were 10 (just 10!) entries in item_description
that were NaN
s that I didn’t take care of. One of them was what was causing the problem. Once I fixed that, I was able to create the databunch without any problems with all the fields marked correctly. Sorry about that!
After building the LM, now I’ve started working on the regression problem. Here is a sample of my training data:
train_df.head()
train_id name price item_description
0 0 MLB Cincinnati Reds T Shirt Size XL 10.0 No description yet
1 1 Razer BlackWidow Chroma Keyboard 52.0 This keyboard is in great condition and works ...
2 2 AVA-VIV Blouse 10.0 Adorable top with a hint of lace and a key hol...
3 3 Leather Horse Statues 35.0 New with tags. Leather horses. Retail for [rm]...
4 4 24K GOLD plated rose 44.0 Complete with certificate of authenticity
Testing data has similar structure.
Using the data block API, I think I was able to create the databunch I want but I have a few questions about what I got and where to go from here. These are the things I did:
tok_proc = TokenizeProcessor(mark_fields=True)
num_proc = NumericalizeProcessor(max_vocab=60_091, min_freq=2)
data_lm = TextLMDataBunch.load(path, 'lm-toknum', processor=[tok_proc, num_proc])
I called show_batch
on this databunch and everything looked good.
data_lm.vocab
), I was able to create. I’m showing individual steps here to specify whats going on. First I created a TextList
d = TextList.from_df(train_df, path, cols=['name', 'item_description'], vocab=data_lm.vocab)
Question 1: Do I need to pass the custom tokenizer/processor that used for the LM here? It works even without it, but I don’t see marked fields.
label_from_df
from the tabular databunch creation:d = d.split_by_idx(valid_idx)
d = d.label_from_df(cols=[dep_var], label_cls=FloatList, log=True)
Question 2: This takes some time, as I think tokenization and numericalization of the training and validation sets. Is that right?
Question 3: Does passing the dependent variable in the cols
argument along with FloatList
set this up for a regression problem as I think?
d = d.add_test(TextList.from_df(test_df, path, cols=['name', 'item_description'], vocab=data_lm.vocab))
Question 4: Again, do I have to pass my custom tokenize/numericalize processors here?
d = d.databunch()
When I call show_batch
on this databunch, I am one column of text and another column of floats (i.e., log values of the price
varialbe).
Question 5: There are two columns of text in the original data frame (name
and item_description
) representing two fields. Have these two been merged to get one full text field?
Question 6: The fields are not marked (i.e., I don’t see xfld 1
and xfld 2
as I do in the LM databunch. I’m guessing I need a custom tokenizer for that. Will that be the one I created for the LM databunch?
Thanks.
So, I decided to go with my intuition and create a databunch for my regression problem. It got created without any errors, but I’m still not a 100% sure, whether what I have is correct and going to work. Here is the code (pretty much same as previous post, simplified):
data_reg = (TextList.from_df(train_df, path, cols=['name', 'item_description'], vocab=data_lm.vocab, processor=[tok_proc, num_proc])
.split_by_idx(get_rdm_idx(train_df))
.label_from_df(cols=['price'], label_cls=FloatList, log=True)
.add_test(TextList.from_df(test_df, path, cols=['name', 'item_description'], vocab=data_lm.vocab, processor=[tok_proc, num_proc]))
.databunch())
There is one problem. Since data_reg
is using the same vocab
as data_lm
, I would think that the vocabulary size would also same. But I get different values for the stoi
's (but same for the itos
's):
len(data_lm.vocab.itos)
60093
len(data_lm.vocab.stoi)
60093
len(data_reg.vocab.itos)
60093
len(data_reg.vocab.stoi)
295127
I don’t know why data_reg.vocab.stoi
is so much better than data_reg.vocab.itos
. Should they actually be the same, since stoi
is created from itos
?
mark_fields
is set to False
by default, so you should pass a processor that sets it to True
. I think this is a bug since I believe we decided to default mark_fields to False when there is only one column and to True where there are several, let me check.mark_fields
is True.Thank you for your replies. It helps me a lot in using the library to do what I want to do.
I’m still not exactly sure why stoi
and itos
lengths are different for the regression databunch vocab. My concern is that the LM vocab is not being utilized correctly for the regression task (even though I’m passing it in during creation).
Also, fastai.text
has a language_model_learner
and text_classifer_learner
. What would I need to do to get a custom learner for the regression problem now that my data is ready? Do I create a custom learner from the base class RNNLearner
?
Thanks.