Fastai v2 text

Wiki topic to discuss fastai v2 text module.

4 Likes

[Apologies in advance if this is covered in one of the walkthroughs – I wanted to jump in and haven’t had time to watch everything yet!]

I’m going through the wikitext tutorial for v2 (35_tutorial_wikitext and http://dev.fast.ai/tutorial.wikitext.html, which are almost but not exactly the same), and have hit a couple of snags.

First off, it seems that untar_data(URLs.WIKITEXT_TINY) used to create train.txt, valid.txt, and test.txt files but now it’s creating train.csv and test.csv only, which is fine but then later I’m not sure I adapted all the preprocessing correctly to accommodate the lack of validation set.

Creating a databunch seems to work, but then when I try to look at a batch I get this error:

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-87-c8230bb1e75f> in <module>
----> 1 dbch.one_batch()

~/fastai_dev/fastai2/data/load.py in one_batch(self)
    100     def create_item(self, s):  return next(self.it) if s is None else self.dataset[s]
    101     def create_batch(self, b): return (fa_collate,fa_convert)[self.bs is None](b)
--> 102     def one_batch(self):   return next(iter(self))
    103     def do_item(self, s):  return self.after_item(self.create_item(s))
    104     def do_batch(self, b): return self.retain(self.create_batch(self.before_batch(b)), b)

~/fastai_dev/fastai2/data/load.py in __iter__(self)
     65         self.rng = random.Random(self.rng.randint(0,2**32-1))
     66         self.before_iter()
---> 67         for b in _loaders[self.fake_l.num_workers==0](self.fake_l): yield self.after_batch(b)
     68         self.after_iter()
     69 

~/anaconda3/lib/python3.7/site-packages/torch/utils/data/dataloader.py in __init__(self, loader)
    335 class _SingleProcessDataLoaderIter(_BaseDataLoaderIter):
    336     def __init__(self, loader):
--> 337         super(_SingleProcessDataLoaderIter, self).__init__(loader)
    338         assert self._timeout == 0
    339         assert self._num_workers == 0

~/anaconda3/lib/python3.7/site-packages/torch/utils/data/dataloader.py in __init__(self, loader)
    301     def __init__(self, loader):
    302         self._dataset = loader.dataset
--> 303         self._dataset_kind = loader._dataset_kind
    304         self._auto_collation = loader._auto_collation
    305         self._drop_last = loader.drop_last

~/fastai_dev/fastai2/core.py in __getattr__(self, k)
    207     def __getattr__(self,k):
    208         if k not in ('_xtra',self._default) and (self._xtra is None or k in self._xtra): return getattr(getattr(self,self._default), k)
--> 209         raise AttributeError(k)
    210     def __dir__(self): return custom_dir(self, self._xtra)
    211     def __setstate__(self,data): self.__dict__.update(data)

AttributeError: _dataset_kind

Any ideas what to make of this?

This is because you don’t have PyTorch 1.3

Also, I have fixed the tutorial to use the dataset in the format we have it (with untar_data) and not the original format.

Actually I think it’s because you do have v1.3, but you haven’t got the latest fastai v2.

1 Like

Ooops, sorry :slight_smile:

Aha, thank you! Yeah, now that I’m thinking back on it I think I pulled but forgot to install the most bleeding edge version. :stuck_out_tongue_winking_eye:

in 38_tutorial_ulmfit.ipynb, when I run:
learn = language_model_learner(dbunch, AWD_LSTM, vocab, metrics=[accuracy, Perplexity()], path=path, opt_func = partial(Adam, wd=0.1)).to_fp16()

I got an error “not a gzip file”, any suggest from your experts.

I got an error “not a gzip file”, any suggest from your experts.

Details

OSError Traceback (most recent call last)
~/.conda/envs/fastai_env/lib/python3.7/tarfile.py in gzopen(cls, name, mode, fileobj, compresslevel, **kwargs)
1644 try:
-> 1645 t = cls.taropen(name, mode, fileobj, **kwargs)
1646 except OSError:

~/.conda/envs/fastai_env/lib/python3.7/tarfile.py in taropen(cls, name, mode, fileobj, **kwargs)
1620 raise ValueError(“mode must be ‘r’, ‘a’, ‘w’ or ‘x’”)
-> 1621 return cls(name, mode, fileobj, **kwargs)
1622

~/.conda/envs/fastai_env/lib/python3.7/tarfile.py in init(self, name, mode, fileobj, format, tarinfo, dereference, ignore_zeros, encoding, errors, pax_headers, debug, errorlevel, copybufsize)
1483 self.firstmember = None
-> 1484 self.firstmember = self.next()
1485

~/.conda/envs/fastai_env/lib/python3.7/tarfile.py in next(self)
2288 try:
-> 2289 tarinfo = self.tarinfo.fromtarfile(self)
2290 except EOFHeaderError as e:

~/.conda/envs/fastai_env/lib/python3.7/tarfile.py in fromtarfile(cls, tarfile)
1093 “”"
-> 1094 buf = tarfile.fileobj.read(BLOCKSIZE)
1095 obj = cls.frombuf(buf, tarfile.encoding, tarfile.errors)

~/.conda/envs/fastai_env/lib/python3.7/gzip.py in read(self, size)
275 raise OSError(errno.EBADF, “read() on write-only GzipFile object”)
–> 276 return self._buffer.read(size)
277

~/.conda/envs/fastai_env/lib/python3.7/_compression.py in readinto(self, b)
67 with memoryview(b) as view, view.cast(“B”) as byte_view:
—> 68 data = self.read(len(byte_view))
69 byte_view[:len(data)] = data

~/.conda/envs/fastai_env/lib/python3.7/gzip.py in read(self, size)
462 self._init_read()
–> 463 if not self._read_gzip_header():
464 self._size = self._pos

~/.conda/envs/fastai_env/lib/python3.7/gzip.py in _read_gzip_header(self)
410 if magic != b’\037\213’:
–> 411 raise OSError(‘Not a gzipped file (%r)’ % magic)
412

OSError: Not a gzipped file (b’ReadError Traceback (most recent call last)
in
----> 1 learn = language_model_learner(dbunch, AWD_LSTM, vocab, metrics=[accuracy, Perplexity()], path=path, opt_func = partial(Adam, wd=0.1))

~/environments/fastai_dev/dev/local/text/learner.py in language_model_learner(dbunch, arch, vocab, config, drop_mult, pretrained, pretrained_fnames, **kwargs)
90 warn(“There are no pretrained weights for that architecture yet!”)
91 return learn
—> 92 model_path = untar_data(meta[‘url’] , c_key=‘model’)
93 fnames = [list(model_path.glob(f’*.{ext}’))[0] for ext in [‘pth’, ‘pkl’]]
94 learn = learn.load_pretrained(*fnames, vocab)

~/environments/fastai_dev/dev/local/data/external.py in untar_data(url, fname, dest, c_key, force_download, extract_func)
207 if _get_check(url) and _check_file(fname) != _get_check(url):
208 print(f"File downloaded is broken. Remove {fname} and try again.")
–> 209 extract_func(fname, dest.parent)
210 return dest

~/environments/fastai_dev/dev/local/data/external.py in tar_extract(fname, dest)
189 def tar_extract(fname, dest):
190 "Extract fname to dest using tarfile"
–> 191 tarfile.open(fname, ‘r:gz’).extractall(dest)
192
193 #Cell

~/.conda/envs/fastai_env/lib/python3.7/tarfile.py in open(cls, name, mode, fileobj, bufsize, **kwargs)
1589 else:
1590 raise CompressionError(“unknown compression type %r” % comptype)
-> 1591 return func(name, filemode, fileobj, **kwargs)
1592
1593 elif “|” in mode:

~/.conda/envs/fastai_env/lib/python3.7/tarfile.py in gzopen(cls, name, mode, fileobj, compresslevel, **kwargs)
1647 fileobj.close()
1648 if mode == ‘r’:
-> 1649 raise ReadError(“not a gzip file”)
1650 raise
1651 except:

ReadError: not a gzip file

Thanks.

Just have a quick question here to make sure I understand properly. We can define a sequence length. That is how far back in words/sequences our data is known/in memory to use? (eg previous 7 words if sl=7)

Also, I know for the IMDB_SAMPLE we can use attrgetter and pass in our column to grab our data. When I try replacing this with ColReader my databunch is messed up. It instead popps it out as an array for my text instead of the values, @sgugger? It is a databunch:

imdb_clas = DataBlock(blocks=(TextBlock(vocab), CategoryBlock),
                      get_x=ColReader('text'),
                      get_y=ColReader('label'),
                      splitter=RandomSplitter())

dbunch = imdb_clas.databunch(df_tok, bs=64, seq_len=72)

Example output:

It is specifically on the x’s

Your dataframe should have listy elements, not strings. I suspect that’s why you have the problem.

1 Like

Ah yes! Good catch! It looks to be doing so after tokenize_df

df = pd.read_csv(path/'texts.csv')
df_tok, count = tokenize_df(df, text_cols='text')

Interesting conumdrum actually. As this worked just fine for my language model

imbd_lm = DataBlock(blocks=(TextBlock(vocab, is_lm=True), ),
                    get_x=ColReader('text'),
                    splitter=RandomSplitter(0.1))

Are you saying it’s due to my labels? That was exactly it! Thank you @sgugger :slight_smile: I’m surprised that the x’s must be as a attrgetter though

I tried a getter function that worked too:

def _imdb_items(x): return (
    x['text'], x['label'])

@sgugger sorry to bug you again :slight_smile: I’m trying to fit on a classifier (just in case it was the databunch I kept it in as a attrgetter) and I get:

RuntimeError: `lengths` array must be sorted in decreasing order when `enforce_sorted` is True. You can pass `enforce_sorted=False` to pack_padded_sequence and/or pack_sequence to sidestep this requirement if you do not need ONNX exportability.

I databunch like so:
dbunch = imdb_clas.databunch(df_tok, bs=64, seq_len=72)
Learner:

learn = text_classifier_learner(dbunch, AWD_LSTM, vocab, metrics=[accuracy],
                                drop_mult=0.5, opt_func=opt_func, path=path)
learn = learn.load_encoder('finetuned');
learn = learn.to_fp16(clip=0.1)

learn.fit_one_cycle(1)

Does this hint at my DataBunch’s sequences not aligning correctly?

Full Stack:

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-166-4dfb24161c57> in <module>()
----> 1 learn.fit_one_cycle(1)

11 frames
/usr/local/lib/python3.6/dist-packages/fastai2/callback/schedule.py in fit_one_cycle(self, n_epoch, lr_max, div, div_final, pct_start, wd, moms, cbs, reset_opt)
     96     scheds = {'lr': combined_cos(pct_start, lr_max/div, lr_max, lr_max/div_final),
     97               'mom': combined_cos(pct_start, *moms)}
---> 98     self.fit(n_epoch, cbs=ParamScheduler(scheds)+L(cbs), reset_opt=reset_opt, wd=wd)
     99 
    100 #Cell

/usr/local/lib/python3.6/dist-packages/fastai2/learner.py in fit(self, n_epoch, lr, wd, cbs, reset_opt)
    259                     try:
    260                         self.epoch=epoch;          self('begin_epoch')
--> 261                         self._do_epoch_train()
    262                         self._do_epoch_validate()
    263                     except CancelEpochException:   self('after_cancel_epoch')

/usr/local/lib/python3.6/dist-packages/fastai2/learner.py in _do_epoch_train(self)
    237         try:
    238             self.dl = self.dbunch.train_dl;                  self('begin_train')
--> 239             self.all_batches()
    240         except CancelTrainException:                         self('after_cancel_train')
    241         finally:                                             self('after_train')

/usr/local/lib/python3.6/dist-packages/fastai2/learner.py in all_batches(self)
    215     def all_batches(self):
    216         self.n_iter = len(self.dl)
--> 217         for o in enumerate(self.dl): self.one_batch(*o)
    218 
    219     def one_batch(self, i, b):

/usr/local/lib/python3.6/dist-packages/fastai2/learner.py in one_batch(self, i, b)
    221         try:
    222             self._split(b);                                  self('begin_batch')
--> 223             self.pred = self.model(*self.xb);                self('after_pred')
    224             if len(self.yb) == 0: return
    225             self.loss = self.loss_func(self.pred, *self.yb); self('after_loss')

/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs)
    539             result = self._slow_forward(*input, **kwargs)
    540         else:
--> 541             result = self.forward(*input, **kwargs)
    542         for hook in self._forward_hooks.values():
    543             hook_result = hook(self, input, result)

/usr/local/lib/python3.6/dist-packages/torch/nn/modules/container.py in forward(self, input)
     90     def forward(self, input):
     91         for module in self._modules.values():
---> 92             input = module(input)
     93         return input
     94 

/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs)
    539             result = self._slow_forward(*input, **kwargs)
    540         else:
--> 541             result = self.forward(*input, **kwargs)
    542         for hook in self._forward_hooks.values():
    543             hook_result = hook(self, input, result)

/usr/local/lib/python3.6/dist-packages/fastai2/text/models/core.py in forward(self, input)
     82         raw_outputs,outputs,masks = [],[],[]
     83         for i in range(0, sl, self.bptt):
---> 84             r,o = self.module(input[:,i: min(i+self.bptt, sl)])
     85             masks.append(input[:,i: min(i+self.bptt, sl)] == self.pad_idx)
     86             raw_outputs.append(r)

/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs)
    539             result = self._slow_forward(*input, **kwargs)
    540         else:
--> 541             result = self.forward(*input, **kwargs)
    542         for hook in self._forward_hooks.values():
    543             hook_result = hook(self, input, result)

/usr/local/lib/python3.6/dist-packages/fastai2/text/models/awdlstm.py in forward(self, inp, from_embeds)
    106         new_hidden,raw_outputs,outputs = [],[],[]
    107         for l, (rnn,hid_dp) in enumerate(zip(self.rnns, self.hidden_dps)):
--> 108             if self.packed: raw_output = pack_padded_sequence(raw_output, lens, batch_first=True)
    109             raw_output, new_h = rnn(raw_output, self.hidden[l])
    110             if self.packed: raw_output = pad_packed_sequence(raw_output, batch_first=True)[0]

/usr/local/lib/python3.6/dist-packages/torch/nn/utils/rnn.py in pack_padded_sequence(input, lengths, batch_first, enforce_sorted)
    280 
    281     data, batch_sizes = \
--> 282         _VF._pack_padded_sequence(input, lengths, batch_first)
    283     return PackedSequence(data, batch_sizes, sorted_indices, None)
    284 

RuntimeError: `lengths` array must be sorted in decreasing order when `enforce_sorted` is True. You can pass `enforce_sorted=False` to pack_padded_sequence and/or pack_sequence to sidestep this requirement if you do not need ONNX exportability.

You should double-check your DataLoaders are SortedDL, and if not, pass this as dl_type.

1 Like

They were not, they were TransformDL.

dbunch = imdb_clas.databunch(df_tok, bs=64, seq_len=72, dl_type=SortedDL) solved the issue :slight_smile:

Is this something that should be automatically assumed for text classification? (eg and make it deterministic/automatic)? Just realized we can pass this in to our DataBlock call as a dl_type. Easy enough! :slight_smile:

imdb_clas = DataBlock(blocks=(TextBlock(vocab), CategoryBlock),
                      get_items = _imdb_items,
                      splitter=RandomSplitter(),
                      dl_type=SortedDL)

Thanks for the help!!!

(Just noticed I missed that in your ULMFiT tutorial… oopsie!)

1 Like

@sgugger, just looked at the source code for TextBlock:

def TextBlock(vocab=None, is_lm=False):
    return TransformBlock(type_tfms=Numericalize(vocab), dl_type=LMDataLoader if is_lm else SortedDL,
                          dbunch_kwargs={} if is_lm else {'before_batch': pad_input})

It should already be doing this, any thoughts to why it defaults to a TransformDL instead? (I’ll look into this later if I can, just passing on what I found)

There’s probably a bug in DataBlock then. Since it only happens with a target and not for the LM I think the dl_type may be erased by the CategoryBlock, will investigate and fix tomorrow.

1 Like

@muellerzr, it seems you have the same problem as myself when creating the learner for text model. It seems the pre_train model of AWD_LSTM not exist. Any idea? thanks.

Yi Fan

It does, I was able to get it. Which version are you using? (I had to put a PR in cause it wasn’t. I think it was last week or so That was merged)

It works now. I did not check before ask you the question. Sorry. :wink:

Hi,

I’m trying to understand the AWD_LSTM. And I came across the following to lines (link below):

raw_output = self.input_dp(inp if from_embeds else self.encoder_dp(inp))
new_hidden,raw_outputs,outputs = [],[],[]

I do not fully understand what’s going on in the first line. But seems like it is for nothing due to the line below it. Bug or feature :wink: ?

raw_output != raw_outputs

You see later at line 112 that the raw_output is appended to raw_outputs:
raw_outputs.append(raw_output)

3 Likes