.no_split() not working

ChristophNeuner · February 10, 2019, 11:26am

Hello,

I am trying to setup a pipeline with the datablock api where the validation set is empty because I want to train the model on the whole data.
Therefore I am using the .no_split() method, but unfortunately an error is raised when I call .databunch().
A workaround for now would be to make the validation set really small, but I want this to work with .no_split().

Thanks a lot in advance!

Here is my code and the stacktrace:

data = (ImageItemList.from_csv(path=PATH, csv_name=CSV_NAME, folder=TRAIN_NAME, suffix=’.tif’)
#.random_split_by_pct(0.2)
.no_split()
.label_from_df()
.add_test_folder(test_folder=TEST_NAME, label=None)
.transform(tfms)
.databunch(bs=bs, num_workers=nw)
#.normalize(imagenet_stats))
.normalize())

IndexError Traceback (most recent call last)
in
4 .add_test_folder(test_folder=TEST_NAME, label=None)
5 .transform(tfms)
----> 6 .databunch(bs=bs, num_workers=nw)
7 #.normalize(imagenet_stats))
8 .normalize())

~/work/network/fastai/fastai/vision/data.py in normalize(self, stats, do_x, do_y)
181 “Add normalize transform using stats (defaults to DataBunch.batch_stats)”
182 if getattr(self,‘norm’,False): raise Exception(‘Can not call normalize twice’)
–> 183 if stats is None: self.stats = self.batch_stats()
184 else: self.stats = stats
185 self.norm,self.denorm = normalize_funcs(*self.stats, do_x=do_x, do_y=do_y)

~/work/network/fastai/fastai/vision/data.py in batch_stats(self, funcs)
175 “Grab a batch of data and call reduction function func per channel”
176 funcs = ifnone(funcs, [torch.mean,torch.std])
–> 177 x = self.one_batch(ds_type=DatasetType.Valid, denorm=False)[0].cpu()
178 return [func(channel_view(x), 1) for func in funcs]
179

~/work/network/fastai/fastai/basic_data.py in one_batch(self, ds_type, detach, denorm, cpu)
140 w = self.num_workers
141 self.num_workers = 0
–> 142 try: x,y = next(iter(dl))
143 finally: self.num_workers = w
144 if detach: x,y = to_detach(x,cpu=cpu),to_detach(y,cpu=cpu)

~/work/network/fastai/fastai/basic_data.py in iter(self)
69 def iter(self):
70 “Process and returns items from DataLoader.”
—> 71 for b in self.dl: yield self.proc_batch(b)
72
73 @classmethod

/opt/conda/lib/python3.6/site-packages/torch/utils/data/dataloader.py in next(self)
635 self.reorder_dict[idx] = batch
636 continue
–> 637 return self._process_next_batch(batch)
638
639 next = next # Python 2 compatibility

/opt/conda/lib/python3.6/site-packages/torch/utils/data/dataloader.py in _process_next_batch(self, batch)
656 self._put_indices()
657 if isinstance(batch, ExceptionWrapper):
–> 658 raise batch.exc_type(batch.exc_msg)
659 return batch
660

IndexError: Traceback (most recent call last):
File “/opt/conda/lib/python3.6/site-packages/torch/utils/data/dataloader.py”, line 138, in _worker_loop
samples = collate_fn([dataset[i] for i in batch_indices])
File “/opt/conda/lib/python3.6/site-packages/torch/utils/data/dataloader.py”, line 138, in
samples = collate_fn([dataset[i] for i in batch_indices])
File “…/fastai/fastai/data_block.py”, line 587, in getitem
if self.item is None: x,y = self.x[idxs],self.y[idxs]
File “…/fastai/fastai/data_block.py”, line 102, in getitem
if isinstance(idxs, numbers.Integral): return self.get(idxs)
File “…/fastai/fastai/vision/data.py”, line 276, in get
fn = super().get(i)
File “…/fastai/fastai/data_block.py”, line 62, in get
return self.items[i]
IndexError: index 0 is out of bounds for axis 0 with size 0

sgugger · February 10, 2019, 1:46pm

It doesn’t come from no_split but normalize. Since you’re not passing any stats, it tries to compute them on a batch of the validation set, which fails for obvious reasons

ChristophNeuner · February 10, 2019, 2:15pm

Ah thanks a lot, that makes sense!
Is there a way to tell fastai, that it should compute the stats from the train set?
Or would this be nonsense?

sgugger · February 10, 2019, 7:12pm

Note that you can do that yourself with adjusting the source code here.

ChristophNeuner · February 12, 2019, 7:05pm

Thanks alot!

YounessELMARHRAOUI · July 9, 2020, 2:11pm

Instead of no_split() try using split_none().

Here is my code:

data = (TabularList.from_df(df_train, path='.', cat_names=cat_names, cont_names=cont_names, procs=procs)
                    .split_none()
                    .label_from_df(cols = dep_var)
                    .add_test(test, label=0)
                    .databunch())

`