Fitting resnet101 fails after installing fastai 1.0.27

Hi there,

I am trying to fit a resnet101. It did work perfectly before upgrading to fastai 1.0.27:

np.random.seed(42)
data = ImageDataBunch.from_folder(trainingDataDestinationRoot, train=".", valid_pct=0.1,
ds_tfms=get_transforms(), size=224, num_workers=4, bs=32).normalize(imagenet_stats)
learn = create_cnn(data, models.resnet101, metrics=error_rate)
learn.fit_one_cycle(10, max_lr =slice(2e-3))

fit_one_cycle fails with following error message after the first loading bar finished:


TypeError Traceback (most recent call last)
in
----> 1 learn.fit_one_cycle(10, max_lr =slice(2e-3))

~/miniconda3/envs/fastai-3.6/lib/python3.6/site-packages/fastai/train.py in fit_one_cycle(learn, cyc_len, max_lr, moms, div_factor, pct_start, wd, callbacks, **kwargs)
18 callbacks.append(OneCycleScheduler(learn, max_lr, moms=moms, div_factor=div_factor,
19 pct_start=pct_start, **kwargs))
—> 20 learn.fit(cyc_len, max_lr, wd=wd, callbacks=callbacks)
21
22 def lr_find(learn:Learner, start_lr:Floats=1e-7, end_lr:Floats=10, num_it:int=100, stop_div:bool=True, **kwargs:Any):

~/miniconda3/envs/fastai-3.6/lib/python3.6/site-packages/fastai/basic_train.py in fit(self, epochs, lr, wd, callbacks)
160 callbacks = [cb(self) for cb in self.callback_fns] + listify(callbacks)
161 fit(epochs, self.model, self.loss_func, opt=self.opt, data=self.data, metrics=self.metrics,
–> 162 callbacks=self.callbacks+callbacks)
163
164 def create_opt(self, lr:Floats, wd:Floats=0.)->None:

~/miniconda3/envs/fastai-3.6/lib/python3.6/site-packages/fastai/basic_train.py in fit(epochs, model, loss_func, opt, data, callbacks, metrics)
92 except Exception as e:
93 exception = e
—> 94 raise e
95 finally: cb_handler.on_train_end(exception)
96

~/miniconda3/envs/fastai-3.6/lib/python3.6/site-packages/fastai/basic_train.py in fit(epochs, model, loss_func, opt, data, callbacks, metrics)
87 if hasattr(data,‘valid_dl’) and data.valid_dl is not None:
88 val_loss = validate(model, data.valid_dl, loss_func=loss_func,
—> 89 cb_handler=cb_handler, pbar=pbar)
90 else: val_loss=None
91 if cb_handler.on_epoch_end(val_loss): break

~/miniconda3/envs/fastai-3.6/lib/python3.6/site-packages/fastai/basic_train.py in validate(model, dl, loss_func, cb_handler, pbar, average, n_batch)
47 with torch.no_grad():
48 val_losses,nums = [],[]
—> 49 for xb,yb in progress_bar(dl, parent=pbar, leave=(pbar is not None)):
50 if cb_handler: xb, yb = cb_handler.on_batch_begin(xb, yb, train=False)
51 val_losses.append(loss_batch(model, xb, yb, loss_func, cb_handler=cb_handler))

~/miniconda3/envs/fastai-3.6/lib/python3.6/site-packages/fastprogress/fastprogress.py in iter(self)
63 self.update(0)
64 try:
—> 65 for i,o in enumerate(self._gen):
66 yield o
67 if self.auto_update: self.update(i+1)

~/miniconda3/envs/fastai-3.6/lib/python3.6/site-packages/fastai/basic_data.py in iter(self)
45 def iter(self):
46 “Process and returns items from DataLoader.”
—> 47 for b in self.dl:
48 y = b[1][0] if is_listy(b[1]) else b[1]
49 if not self.skip_size1 or y.size(0) != 1: yield self.proc_batch(b)

~/miniconda3/envs/fastai-3.6/lib/python3.6/site-packages/torch/utils/data/dataloader.py in next(self)
635 self.reorder_dict[idx] = batch
636 continue
–> 637 return self._process_next_batch(batch)
638
639 next = next # Python 2 compatibility

~/miniconda3/envs/fastai-3.6/lib/python3.6/site-packages/torch/utils/data/dataloader.py in _process_next_batch(self, batch)
656 self._put_indices()
657 if isinstance(batch, ExceptionWrapper):
–> 658 raise batch.exc_type(batch.exc_msg)
659 return batch
660

TypeError: Traceback (most recent call last):
File “/home/farbgeist/miniconda3/envs/fastai-3.6/lib/python3.6/site-packages/torch/utils/data/dataloader.py”, line 138, in _worker_loop
samples = collate_fn([dataset[i] for i in batch_indices])
File “/home/farbgeist/miniconda3/envs/fastai-3.6/lib/python3.6/site-packages/fastai/torch_core.py”, line 94, in data_collate
return torch.utils.data.dataloader.default_collate(to_data(batch))
File “/home/farbgeist/miniconda3/envs/fastai-3.6/lib/python3.6/site-packages/torch/utils/data/dataloader.py”, line 232, in default_collate
return [default_collate(samples) for samples in transposed]
File “/home/farbgeist/miniconda3/envs/fastai-3.6/lib/python3.6/site-packages/torch/utils/data/dataloader.py”, line 232, in
return [default_collate(samples) for samples in transposed]
File “/home/farbgeist/miniconda3/envs/fastai-3.6/lib/python3.6/site-packages/torch/utils/data/dataloader.py”, line 223, in default_collate
return torch.LongTensor(batch)
TypeError: an integer is required (got type NoneType)

any idea, how to fix this?

Best regards,
Carsten

2 Likes

I’m getting this too. Did you manage to fix it?

Notebook with full stack trace:

The lesson 1 - pets notebook works fine with RN34 and 101, but if I modify your gist to use the pets data, it gets the same error. Error happens after the first training epoch but before the validation run.

I’m on 1.0.28.

Comparing the two notebooks to see what is different should help locate the issue.

Got it working again. Not sure what exactly did the trick, but I am going to list what I did to make it work again:

  1. Tried out the Cats vs. Dogs example once again to make sure my setup of fastai did not completely break.
  2. Removed “model” directory, that fastai did create in my training data folder, moved training data one folder level deeper, so model will be created one level above.
  3. Did minimize my training data to a small fraction to make sure that the problem is not due to corrupted training data (this actually made my training work again)
  4. Removed unbalanced classes with very few training images ( had some classes with 1-10 training images only) and added them to a class called “background”.

@tamlyn:
Maybe try some of those steps and let us know what helped.

Thanks for the tips. I’ve narrowed it down to .split_by_idx causing the error during the validation phase. If I replace that with .random_split_by_pct it works, but I don’t want a random split. I also tried .split_by_files but that doesn’t work either.

Smells like a bug in fastai to me but I need to narrow it down further.

it looks like a bug for me.

split_by_files creates func = lambda o: o.name in valid_names and pass it into split_by_valid_func. which iterates through self.items. by default self.items is a np.array of type str and does not have .name attribute

need some help here: @Sylvain

No, the self.items should be an array of Path object at this stage.

1 Like
  1. src = ImageItemList.from_df

  2. in from_df we have res = super().from_df(df, path=path, cols=cols, **kwargs)

  3. in from_df we have :

  • inputs = df.iloc[:,df_names_to_idx(cols, df)]
  • res = cls(items=_maybe_squeeze(inputs.values), path=path, xtra = df, **kwargs)
  1. these are our items, df.iloc[:,df_names_to_idx(cols, df)] will return array of str, as those are image filenames like some_random.image.jpg

  2. and def _maybe_squeeze(arr): return (arr if is1d(arr) else np.squeeze(arr))

  3. after point 2, when res gets its value

  • res.items = np.char.add(np.char.add(f'{folder}/', res.items.astype(str)), suffix)
  • res.items = np.char.add(f'{res.path}/', res.items)

so its np.array of str
__version__ = '1.0.30'

5 Likes

Good catch, pushed a fix.

2 Likes