Fitting resnet101 fails after installing fastai 1.0.27

Carsten · November 18, 2018, 9:03pm

Hi there,

I am trying to fit a resnet101. It did work perfectly before upgrading to fastai 1.0.27:

np.random.seed(42)
data = ImageDataBunch.from_folder(trainingDataDestinationRoot, train=".", valid_pct=0.1,
ds_tfms=get_transforms(), size=224, num_workers=4, bs=32).normalize(imagenet_stats)
learn = create_cnn(data, models.resnet101, metrics=error_rate)
learn.fit_one_cycle(10, max_lr =slice(2e-3))

fit_one_cycle fails with following error message after the first loading bar finished:

TypeError Traceback (most recent call last)
in
----> 1 learn.fit_one_cycle(10, max_lr =slice(2e-3))

~/miniconda3/envs/fastai-3.6/lib/python3.6/site-packages/fastai/train.py in fit_one_cycle(learn, cyc_len, max_lr, moms, div_factor, pct_start, wd, callbacks, **kwargs)
18 callbacks.append(OneCycleScheduler(learn, max_lr, moms=moms, div_factor=div_factor,
19 pct_start=pct_start, **kwargs))
—> 20 learn.fit(cyc_len, max_lr, wd=wd, callbacks=callbacks)
21
22 def lr_find(learn:Learner, start_lr:Floats=1e-7, end_lr:Floats=10, num_it:int=100, stop_div:bool=True, **kwargs:Any):

~/miniconda3/envs/fastai-3.6/lib/python3.6/site-packages/fastai/basic_train.py in fit(self, epochs, lr, wd, callbacks)
160 callbacks = [cb(self) for cb in self.callback_fns] + listify(callbacks)
161 fit(epochs, self.model, self.loss_func, opt=self.opt, data=self.data, metrics=self.metrics,
–> 162 callbacks=self.callbacks+callbacks)
163
164 def create_opt(self, lr:Floats, wd:Floats=0.)->None:

~/miniconda3/envs/fastai-3.6/lib/python3.6/site-packages/fastai/basic_train.py in fit(epochs, model, loss_func, opt, data, callbacks, metrics)
92 except Exception as e:
93 exception = e
—> 94 raise e
95 finally: cb_handler.on_train_end(exception)
96

~/miniconda3/envs/fastai-3.6/lib/python3.6/site-packages/fastai/basic_train.py in fit(epochs, model, loss_func, opt, data, callbacks, metrics)
87 if hasattr(data,‘valid_dl’) and data.valid_dl is not None:
88 val_loss = validate(model, data.valid_dl, loss_func=loss_func,
—> 89 cb_handler=cb_handler, pbar=pbar)
90 else: val_loss=None
91 if cb_handler.on_epoch_end(val_loss): break

~/miniconda3/envs/fastai-3.6/lib/python3.6/site-packages/fastai/basic_train.py in validate(model, dl, loss_func, cb_handler, pbar, average, n_batch)
47 with torch.no_grad():
48 val_losses,nums = [],[]
—> 49 for xb,yb in progress_bar(dl, parent=pbar, leave=(pbar is not None)):
50 if cb_handler: xb, yb = cb_handler.on_batch_begin(xb, yb, train=False)
51 val_losses.append(loss_batch(model, xb, yb, loss_func, cb_handler=cb_handler))

~/miniconda3/envs/fastai-3.6/lib/python3.6/site-packages/fastprogress/fastprogress.py in iter(self)
63 self.update(0)
64 try:
—> 65 for i,o in enumerate(self._gen):
66 yield o
67 if self.auto_update: self.update(i+1)

~/miniconda3/envs/fastai-3.6/lib/python3.6/site-packages/fastai/basic_data.py in iter(self)
45 def iter(self):
46 “Process and returns items from DataLoader.”
—> 47 for b in self.dl:
48 y = b[1][0] if is_listy(b[1]) else b[1]
49 if not self.skip_size1 or y.size(0) != 1: yield self.proc_batch(b)

~/miniconda3/envs/fastai-3.6/lib/python3.6/site-packages/torch/utils/data/dataloader.py in next(self)
635 self.reorder_dict[idx] = batch
636 continue
–> 637 return self._process_next_batch(batch)
638
639 next = next # Python 2 compatibility

~/miniconda3/envs/fastai-3.6/lib/python3.6/site-packages/torch/utils/data/dataloader.py in _process_next_batch(self, batch)
656 self._put_indices()
657 if isinstance(batch, ExceptionWrapper):
–> 658 raise batch.exc_type(batch.exc_msg)
659 return batch
660

TypeError: Traceback (most recent call last):
File “/home/farbgeist/miniconda3/envs/fastai-3.6/lib/python3.6/site-packages/torch/utils/data/dataloader.py”, line 138, in _worker_loop
samples = collate_fn([dataset[i] for i in batch_indices])
File “/home/farbgeist/miniconda3/envs/fastai-3.6/lib/python3.6/site-packages/fastai/torch_core.py”, line 94, in data_collate
return torch.utils.data.dataloader.default_collate(to_data(batch))
File “/home/farbgeist/miniconda3/envs/fastai-3.6/lib/python3.6/site-packages/torch/utils/data/dataloader.py”, line 232, in default_collate
return [default_collate(samples) for samples in transposed]
File “/home/farbgeist/miniconda3/envs/fastai-3.6/lib/python3.6/site-packages/torch/utils/data/dataloader.py”, line 232, in
return [default_collate(samples) for samples in transposed]
File “/home/farbgeist/miniconda3/envs/fastai-3.6/lib/python3.6/site-packages/torch/utils/data/dataloader.py”, line 223, in default_collate
return torch.LongTensor(batch)
TypeError: an integer is required (got type NoneType)

any idea, how to fix this?

Best regards,
Carsten

tamlyn · November 23, 2018, 9:05pm

I’m getting this too. Did you manage to fix it?

Notebook with full stack trace:

gist.github.com

https://gist.github.com/tamlyn/84b80f934d012e7d048f0c545ae3afc9

Barknet.ipynb

{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 436
    },

This file has been truncated. show original

Ralph · November 23, 2018, 9:50pm

The lesson 1 - pets notebook works fine with RN34 and 101, but if I modify your gist to use the pets data, it gets the same error. Error happens after the first training epoch but before the validation run.

I’m on 1.0.28.

Comparing the two notebooks to see what is different should help locate the issue.

Carsten · November 25, 2018, 9:19am

Got it working again. Not sure what exactly did the trick, but I am going to list what I did to make it work again:

Tried out the Cats vs. Dogs example once again to make sure my setup of fastai did not completely break.
Removed “model” directory, that fastai did create in my training data folder, moved training data one folder level deeper, so model will be created one level above.
Did minimize my training data to a small fraction to make sure that the problem is not due to corrupted training data (this actually made my training work again)
Removed unbalanced classes with very few training images ( had some classes with 1-10 training images only) and added them to a class called “background”.

@tamlyn:
Maybe try some of those steps and let us know what helped.

tamlyn · November 27, 2018, 1:19pm

Thanks for the tips. I’ve narrowed it down to .split_by_idx causing the error during the validation phase. If I replace that with .random_split_by_pct it works, but I don’t want a random split. I also tried .split_by_files but that doesn’t work either.

Smells like a bug in fastai to me but I need to narrow it down further.

sermakarevich · December 10, 2018, 9:33pm

it looks like a bug for me.

split_by_files creates func = lambda o: o.name in valid_names and pass it into split_by_valid_func. which iterates through self.items. by default self.items is a np.array of type str and does not have .name attribute

need some help here: @Sylvain

sgugger · December 11, 2018, 12:16am

No, the self.items should be an array of Path object at this stage.

sermakarevich · December 11, 2018, 7:34am

src = ImageItemList.from_df
in from_df we have res = super().from_df(df, path=path, cols=cols, **kwargs)
in from_df we have :

inputs = df.iloc[:,df_names_to_idx(cols, df)]
res = cls(items=_maybe_squeeze(inputs.values), path=path, xtra = df, **kwargs)

these are our items, df.iloc[:,df_names_to_idx(cols, df)] will return array of str, as those are image filenames like some_random.image.jpg
and def _maybe_squeeze(arr): return (arr if is1d(arr) else np.squeeze(arr))
after point 2, when res gets its value

res.items = np.char.add(np.char.add(f'{folder}/', res.items.astype(str)), suffix)
res.items = np.char.add(f'{res.path}/', res.items)

so its np.array of str
__version__ = '1.0.30'

sgugger · December 11, 2018, 3:29pm

Good catch, pushed a fix.