Wiki topic to discuss fastai v2 text module.
[Apologies in advance if this is covered in one of the walkthroughs – I wanted to jump in and haven’t had time to watch everything yet!]
I’m going through the wikitext tutorial for v2 (35_tutorial_wikitext and http://dev.fast.ai/tutorial.wikitext.html, which are almost but not exactly the same), and have hit a couple of snags.
First off, it seems that untar_data(URLs.WIKITEXT_TINY)
used to create train.txt, valid.txt, and test.txt files but now it’s creating train.csv and test.csv only, which is fine but then later I’m not sure I adapted all the preprocessing correctly to accommodate the lack of validation set.
Creating a databunch seems to work, but then when I try to look at a batch I get this error:
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-87-c8230bb1e75f> in <module>
----> 1 dbch.one_batch()
~/fastai_dev/fastai2/data/load.py in one_batch(self)
100 def create_item(self, s): return next(self.it) if s is None else self.dataset[s]
101 def create_batch(self, b): return (fa_collate,fa_convert)[self.bs is None](b)
--> 102 def one_batch(self): return next(iter(self))
103 def do_item(self, s): return self.after_item(self.create_item(s))
104 def do_batch(self, b): return self.retain(self.create_batch(self.before_batch(b)), b)
~/fastai_dev/fastai2/data/load.py in __iter__(self)
65 self.rng = random.Random(self.rng.randint(0,2**32-1))
66 self.before_iter()
---> 67 for b in _loaders[self.fake_l.num_workers==0](self.fake_l): yield self.after_batch(b)
68 self.after_iter()
69
~/anaconda3/lib/python3.7/site-packages/torch/utils/data/dataloader.py in __init__(self, loader)
335 class _SingleProcessDataLoaderIter(_BaseDataLoaderIter):
336 def __init__(self, loader):
--> 337 super(_SingleProcessDataLoaderIter, self).__init__(loader)
338 assert self._timeout == 0
339 assert self._num_workers == 0
~/anaconda3/lib/python3.7/site-packages/torch/utils/data/dataloader.py in __init__(self, loader)
301 def __init__(self, loader):
302 self._dataset = loader.dataset
--> 303 self._dataset_kind = loader._dataset_kind
304 self._auto_collation = loader._auto_collation
305 self._drop_last = loader.drop_last
~/fastai_dev/fastai2/core.py in __getattr__(self, k)
207 def __getattr__(self,k):
208 if k not in ('_xtra',self._default) and (self._xtra is None or k in self._xtra): return getattr(getattr(self,self._default), k)
--> 209 raise AttributeError(k)
210 def __dir__(self): return custom_dir(self, self._xtra)
211 def __setstate__(self,data): self.__dict__.update(data)
AttributeError: _dataset_kind
Any ideas what to make of this?
This is because you don’t have PyTorch 1.3
Also, I have fixed the tutorial to use the dataset in the format we have it (with untar_data
) and not the original format.
Actually I think it’s because you do have v1.3, but you haven’t got the latest fastai v2.
Ooops, sorry
Aha, thank you! Yeah, now that I’m thinking back on it I think I pulled but forgot to install the most bleeding edge version.
in 38_tutorial_ulmfit.ipynb, when I run:
learn = language_model_learner(dbunch, AWD_LSTM, vocab, metrics=[accuracy, Perplexity()], path=path, opt_func = partial(Adam, wd=0.1)).to_fp16()
I got an error “not a gzip file”, any suggest from your experts.
I got an error “not a gzip file”, any suggest from your experts.
Details
OSError Traceback (most recent call last)
~/.conda/envs/fastai_env/lib/python3.7/tarfile.py in gzopen(cls, name, mode, fileobj, compresslevel, **kwargs)
1644 try:
-> 1645 t = cls.taropen(name, mode, fileobj, **kwargs)
1646 except OSError:
~/.conda/envs/fastai_env/lib/python3.7/tarfile.py in taropen(cls, name, mode, fileobj, **kwargs)
1620 raise ValueError(“mode must be ‘r’, ‘a’, ‘w’ or ‘x’”)
-> 1621 return cls(name, mode, fileobj, **kwargs)
1622
~/.conda/envs/fastai_env/lib/python3.7/tarfile.py in init(self, name, mode, fileobj, format, tarinfo, dereference, ignore_zeros, encoding, errors, pax_headers, debug, errorlevel, copybufsize)
1483 self.firstmember = None
-> 1484 self.firstmember = self.next()
1485
~/.conda/envs/fastai_env/lib/python3.7/tarfile.py in next(self)
2288 try:
-> 2289 tarinfo = self.tarinfo.fromtarfile(self)
2290 except EOFHeaderError as e:
~/.conda/envs/fastai_env/lib/python3.7/tarfile.py in fromtarfile(cls, tarfile)
1093 “”"
-> 1094 buf = tarfile.fileobj.read(BLOCKSIZE)
1095 obj = cls.frombuf(buf, tarfile.encoding, tarfile.errors)
~/.conda/envs/fastai_env/lib/python3.7/gzip.py in read(self, size)
275 raise OSError(errno.EBADF, “read() on write-only GzipFile object”)
–> 276 return self._buffer.read(size)
277
~/.conda/envs/fastai_env/lib/python3.7/_compression.py in readinto(self, b)
67 with memoryview(b) as view, view.cast(“B”) as byte_view:
—> 68 data = self.read(len(byte_view))
69 byte_view[:len(data)] = data
~/.conda/envs/fastai_env/lib/python3.7/gzip.py in read(self, size)
462 self._init_read()
–> 463 if not self._read_gzip_header():
464 self._size = self._pos
~/.conda/envs/fastai_env/lib/python3.7/gzip.py in _read_gzip_header(self)
410 if magic != b’\037\213’:
–> 411 raise OSError(‘Not a gzipped file (%r)’ % magic)
412
OSError: Not a gzipped file (b’ReadError Traceback (most recent call last)
in
----> 1 learn = language_model_learner(dbunch, AWD_LSTM, vocab, metrics=[accuracy, Perplexity()], path=path, opt_func = partial(Adam, wd=0.1))
~/environments/fastai_dev/dev/local/text/learner.py in language_model_learner(dbunch, arch, vocab, config, drop_mult, pretrained, pretrained_fnames, **kwargs)
90 warn(“There are no pretrained weights for that architecture yet!”)
91 return learn
—> 92 model_path = untar_data(meta[‘url’] , c_key=‘model’)
93 fnames = [list(model_path.glob(f’*.{ext}’))[0] for ext in [‘pth’, ‘pkl’]]
94 learn = learn.load_pretrained(*fnames, vocab)
~/environments/fastai_dev/dev/local/data/external.py in untar_data(url, fname, dest, c_key, force_download, extract_func)
207 if _get_check(url) and _check_file(fname) != _get_check(url):
208 print(f"File downloaded is broken. Remove {fname} and try again.")
–> 209 extract_func(fname, dest.parent)
210 return dest
~/environments/fastai_dev/dev/local/data/external.py in tar_extract(fname, dest)
189 def tar_extract(fname, dest):
190 "Extract fname
to dest
using tarfile
"
–> 191 tarfile.open(fname, ‘r:gz’).extractall(dest)
192
193 #Cell
~/.conda/envs/fastai_env/lib/python3.7/tarfile.py in open(cls, name, mode, fileobj, bufsize, **kwargs)
1589 else:
1590 raise CompressionError(“unknown compression type %r” % comptype)
-> 1591 return func(name, filemode, fileobj, **kwargs)
1592
1593 elif “|” in mode:
~/.conda/envs/fastai_env/lib/python3.7/tarfile.py in gzopen(cls, name, mode, fileobj, compresslevel, **kwargs)
1647 fileobj.close()
1648 if mode == ‘r’:
-> 1649 raise ReadError(“not a gzip file”)
1650 raise
1651 except:
ReadError: not a gzip file
Thanks.
Just have a quick question here to make sure I understand properly. We can define a sequence length. That is how far back in words/sequences our data is known/in memory to use? (eg previous 7 words if sl
=7)
Also, I know for the IMDB_SAMPLE
we can use attrgetter
and pass in our column to grab our data. When I try replacing this with ColReader
my databunch is messed up. It instead popps it out as an array for my text instead of the values, @sgugger? It is a databunch:
imdb_clas = DataBlock(blocks=(TextBlock(vocab), CategoryBlock),
get_x=ColReader('text'),
get_y=ColReader('label'),
splitter=RandomSplitter())
dbunch = imdb_clas.databunch(df_tok, bs=64, seq_len=72)
Example output:
It is specifically on the x’s
Your dataframe should have listy elements, not strings. I suspect that’s why you have the problem.
Ah yes! Good catch! It looks to be doing so after tokenize_df
df = pd.read_csv(path/'texts.csv')
df_tok, count = tokenize_df(df, text_cols='text')
Interesting conumdrum actually. As this worked just fine for my language model
imbd_lm = DataBlock(blocks=(TextBlock(vocab, is_lm=True), ),
get_x=ColReader('text'),
splitter=RandomSplitter(0.1))
Are you saying it’s due to my labels? That was exactly it! Thank you @sgugger I’m surprised that the x’s must be as a attrgetter though
I tried a getter function that worked too:
def _imdb_items(x): return (
x['text'], x['label'])
@sgugger sorry to bug you again I’m trying to fit on a classifier (just in case it was the databunch I kept it in as a attrgetter
) and I get:
RuntimeError: `lengths` array must be sorted in decreasing order when `enforce_sorted` is True. You can pass `enforce_sorted=False` to pack_padded_sequence and/or pack_sequence to sidestep this requirement if you do not need ONNX exportability.
I databunch like so:
dbunch = imdb_clas.databunch(df_tok, bs=64, seq_len=72)
Learner:
learn = text_classifier_learner(dbunch, AWD_LSTM, vocab, metrics=[accuracy],
drop_mult=0.5, opt_func=opt_func, path=path)
learn = learn.load_encoder('finetuned');
learn = learn.to_fp16(clip=0.1)
learn.fit_one_cycle(1)
Does this hint at my DataBunch’s sequences not aligning correctly?
Full Stack:
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
<ipython-input-166-4dfb24161c57> in <module>()
----> 1 learn.fit_one_cycle(1)
11 frames
/usr/local/lib/python3.6/dist-packages/fastai2/callback/schedule.py in fit_one_cycle(self, n_epoch, lr_max, div, div_final, pct_start, wd, moms, cbs, reset_opt)
96 scheds = {'lr': combined_cos(pct_start, lr_max/div, lr_max, lr_max/div_final),
97 'mom': combined_cos(pct_start, *moms)}
---> 98 self.fit(n_epoch, cbs=ParamScheduler(scheds)+L(cbs), reset_opt=reset_opt, wd=wd)
99
100 #Cell
/usr/local/lib/python3.6/dist-packages/fastai2/learner.py in fit(self, n_epoch, lr, wd, cbs, reset_opt)
259 try:
260 self.epoch=epoch; self('begin_epoch')
--> 261 self._do_epoch_train()
262 self._do_epoch_validate()
263 except CancelEpochException: self('after_cancel_epoch')
/usr/local/lib/python3.6/dist-packages/fastai2/learner.py in _do_epoch_train(self)
237 try:
238 self.dl = self.dbunch.train_dl; self('begin_train')
--> 239 self.all_batches()
240 except CancelTrainException: self('after_cancel_train')
241 finally: self('after_train')
/usr/local/lib/python3.6/dist-packages/fastai2/learner.py in all_batches(self)
215 def all_batches(self):
216 self.n_iter = len(self.dl)
--> 217 for o in enumerate(self.dl): self.one_batch(*o)
218
219 def one_batch(self, i, b):
/usr/local/lib/python3.6/dist-packages/fastai2/learner.py in one_batch(self, i, b)
221 try:
222 self._split(b); self('begin_batch')
--> 223 self.pred = self.model(*self.xb); self('after_pred')
224 if len(self.yb) == 0: return
225 self.loss = self.loss_func(self.pred, *self.yb); self('after_loss')
/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs)
539 result = self._slow_forward(*input, **kwargs)
540 else:
--> 541 result = self.forward(*input, **kwargs)
542 for hook in self._forward_hooks.values():
543 hook_result = hook(self, input, result)
/usr/local/lib/python3.6/dist-packages/torch/nn/modules/container.py in forward(self, input)
90 def forward(self, input):
91 for module in self._modules.values():
---> 92 input = module(input)
93 return input
94
/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs)
539 result = self._slow_forward(*input, **kwargs)
540 else:
--> 541 result = self.forward(*input, **kwargs)
542 for hook in self._forward_hooks.values():
543 hook_result = hook(self, input, result)
/usr/local/lib/python3.6/dist-packages/fastai2/text/models/core.py in forward(self, input)
82 raw_outputs,outputs,masks = [],[],[]
83 for i in range(0, sl, self.bptt):
---> 84 r,o = self.module(input[:,i: min(i+self.bptt, sl)])
85 masks.append(input[:,i: min(i+self.bptt, sl)] == self.pad_idx)
86 raw_outputs.append(r)
/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs)
539 result = self._slow_forward(*input, **kwargs)
540 else:
--> 541 result = self.forward(*input, **kwargs)
542 for hook in self._forward_hooks.values():
543 hook_result = hook(self, input, result)
/usr/local/lib/python3.6/dist-packages/fastai2/text/models/awdlstm.py in forward(self, inp, from_embeds)
106 new_hidden,raw_outputs,outputs = [],[],[]
107 for l, (rnn,hid_dp) in enumerate(zip(self.rnns, self.hidden_dps)):
--> 108 if self.packed: raw_output = pack_padded_sequence(raw_output, lens, batch_first=True)
109 raw_output, new_h = rnn(raw_output, self.hidden[l])
110 if self.packed: raw_output = pad_packed_sequence(raw_output, batch_first=True)[0]
/usr/local/lib/python3.6/dist-packages/torch/nn/utils/rnn.py in pack_padded_sequence(input, lengths, batch_first, enforce_sorted)
280
281 data, batch_sizes = \
--> 282 _VF._pack_padded_sequence(input, lengths, batch_first)
283 return PackedSequence(data, batch_sizes, sorted_indices, None)
284
RuntimeError: `lengths` array must be sorted in decreasing order when `enforce_sorted` is True. You can pass `enforce_sorted=False` to pack_padded_sequence and/or pack_sequence to sidestep this requirement if you do not need ONNX exportability.
You should double-check your DataLoader
s are SortedDL
, and if not, pass this as dl_type
.
They were not, they were TransformDL
.
dbunch = imdb_clas.databunch(df_tok, bs=64, seq_len=72, dl_type=SortedDL)
solved the issue
Is this something that should be automatically assumed for text classification? (eg and make it deterministic/automatic)? Just realized we can pass this in to our DataBlock
call as a dl_type
. Easy enough!
imdb_clas = DataBlock(blocks=(TextBlock(vocab), CategoryBlock),
get_items = _imdb_items,
splitter=RandomSplitter(),
dl_type=SortedDL)
Thanks for the help!!!
(Just noticed I missed that in your ULMFiT tutorial… oopsie!)
@sgugger, just looked at the source code for TextBlock
:
def TextBlock(vocab=None, is_lm=False):
return TransformBlock(type_tfms=Numericalize(vocab), dl_type=LMDataLoader if is_lm else SortedDL,
dbunch_kwargs={} if is_lm else {'before_batch': pad_input})
It should already be doing this, any thoughts to why it defaults to a TransformDL
instead? (I’ll look into this later if I can, just passing on what I found)
There’s probably a bug in DataBlock then. Since it only happens with a target and not for the LM I think the dl_type may be erased by the CategoryBlock
, will investigate and fix tomorrow.
@muellerzr, it seems you have the same problem as myself when creating the learner for text model. It seems the pre_train model of AWD_LSTM not exist. Any idea? thanks.
Yi Fan
It does, I was able to get it. Which version are you using? (I had to put a PR in cause it wasn’t. I think it was last week or so That was merged)
It works now. I did not check before ask you the question. Sorry.
Hi,
I’m trying to understand the AWD_LSTM. And I came across the following to lines (link below):
raw_output = self.input_dp(inp if from_embeds else self.encoder_dp(inp))
new_hidden,raw_outputs,outputs = [],[],[]
I do not fully understand what’s going on in the first line. But seems like it is for nothing due to the line below it. Bug or feature ?
raw_output != raw_outputs
You see later at line 112 that the raw_output
is appended to raw_outputs
:
raw_outputs.append(raw_output)