Hi!
First-time poster, so please forgive me if this is in the wrong place! (I did search the forums first to try to find an answer before posting
)
I’m trying to follow along with Lesson 2, in the context of wanting to use some PyTorch datasets we already have set up at work with fastai’s DataBlock API. In particular, I think things like .show_batch()
and the cleaner
seem immediately useful for inspecting our data, and the DataBlock API overall sounds great for combining data in different ways to reframe problems, so I’d love to be able to use it!
I’ve been approaching this by trying to follow along with the DataBlock tutorial (notebook 50_tutorial.datablock.ipynb
), trying to adapt it to use an existing dataset rather than a path / list of filenames. The existing dataset is basically a collection of images and the corresponding target/y; each item in existing_dataset
is a dict with keys ['images', 'target']
mapping 'images'>tensor
, 'target'>tensor
. So in theory it’s structurally super similar to the example in Lesson 2 and in the DataBlock tutorial notebook. Before getting to dblock.dataloaders()
, I figured I would start by trying to get the simpler dblock.datasets()
going. I would think that something like this would work:
def get_items(existing_dataset):
# just a passthrough
return existing_dataset
def get_target_from_item(item):
import pdb
pdb.set_trace() # for debug only, but isn't executed?
return item['target'] # shape: torch.Size([1])
def get_image_from_item(item):
import pdb
pdb.set_trace() # for debug only, but isn't executed?
return item['images'] # shape: torch.Size([3, 256, 256])
dblock = DataBlock(
blocks = (ImageBlock, RegressionBlock),
get_items = get_items,
get_y = get_target_from_item,
get_x = get_image_from_item,
splitter = RandomSplitter()
)
print(f"{len(existing_dataset)=}")
print(existing_dataset[0])
dsets = dblock.datasets(existing_dataset, verbose=True)
dsets.train[0]
But unfortunately this doesn’t work: I get an IndexError
deep inside fastcore.foundation
. (Full traceback below, since it’s long.) My first instinct was to check that I’m at least getting the image and target correctly with get_x
and get_y
, so I put in those pdb.set_trace()
statements so I could inspect the item
and see what I was returning… but it doesn’t seem like the pdb
breakpoints there are ever even executed! Is there something structurally broken about this approach even before we get to using the actual data?
I’m feeling stuck. I’ve been reading around a bunch, and see lots of people using the provided example datasets, and creating their own datasets in a similar format (files in the filesystem), but I haven’t yet found a case where someone’s using an existing (pytorch) dataset and wants to use the DataBlock API with it. Could someone point me in the right direction, please? Thank you so much!
Full output w traceback:
len(existing_dataset)=3513681
{'targets': tensor([0.5000]), 'images': tensor([[[0.8549, 0.8549, 0.8549, ..., 0.8706, 0.8706, 0.8706],
[0.8549, 0.8549, 0.8549, ..., 0.8706, 0.8706, 0.8706],
[0.8549, 0.8549, 0.8549, ..., 0.8706, 0.8706, 0.8706],
...,
[0.8588, 0.8588, 0.8588, ..., 0.8902, 0.8902, 0.8902],
[0.8588, 0.8588, 0.8588, ..., 0.8902, 0.8902, 0.8902],
[0.8588, 0.8588, 0.8588, ..., 0.8902, 0.8902, 0.8902]],
[[0.8588, 0.8588, 0.8588, ..., 0.8745, 0.8745, 0.8745],
[0.8588, 0.8588, 0.8588, ..., 0.8745, 0.8745, 0.8745],
[0.8588, 0.8588, 0.8588, ..., 0.8745, 0.8745, 0.8745],
...,
[0.8549, 0.8549, 0.8549, ..., 0.8902, 0.8902, 0.8902],
[0.8549, 0.8549, 0.8549, ..., 0.8902, 0.8902, 0.8902],
[0.8549, 0.8549, 0.8549, ..., 0.8902, 0.8902, 0.8902]],
[[0.8392, 0.8353, 0.8392, ..., 0.8824, 0.8824, 0.8824],
[0.8392, 0.8353, 0.8392, ..., 0.8824, 0.8824, 0.8824],
[0.8392, 0.8353, 0.8392, ..., 0.8824, 0.8824, 0.8824],
...,
[0.8392, 0.8392, 0.8392, ..., 0.8824, 0.8824, 0.8824],
[0.8392, 0.8392, 0.8392, ..., 0.8824, 0.8824, 0.8824],
[0.8392, 0.8392, 0.8392, ..., 0.8824, 0.8824, 0.8824]]])}
Collecting items from <existing_dataset_class[redacted] object at 0x7f04a8f414c0>
Found 3513681 items
2 datasets of sizes 2810945,702736
Setting up Pipeline: get_image_from_item -> PILBase.create
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
Input In [95], in <cell line: 34>()
32 print(f"{len(existing_dataset)=}")
33 print(existing_dataset[0])
---> 34 dsets = dblock.datasets(existing_dataset, verbose=True)
36 dsets.train[0]
File /opt/conda/lib/python3.8/site-packages/fastai/data/block.py:147, in DataBlock.datasets(self, source, verbose)
145 splits = (self.splitter or RandomSplitter())(items)
146 pv(f"{len(splits)} datasets of sizes {','.join([str(len(s)) for s in splits])}", verbose)
--> 147 return Datasets(items, tfms=self._combine_type_tfms(), splits=splits, dl_type=self.dl_type, n_inp=self.n_inp, verbose=verbose)
File /opt/conda/lib/python3.8/site-packages/fastai/data/core.py:451, in Datasets.__init__(self, items, tfms, tls, n_inp, dl_type, **kwargs)
442 def __init__(self,
443 items:list=None, # List of items to create `Datasets`
444 tfms:list|Pipeline=None, # List of `Transform`(s) or `Pipeline` to apply
(...)
448 **kwargs
449 ):
450 super().__init__(dl_type=dl_type)
--> 451 self.tls = L(tls if tls else [TfmdLists(items, t, **kwargs) for t in L(ifnone(tfms,[None]))])
452 self.n_inp = ifnone(n_inp, max(1, len(self.tls)-1))
File /opt/conda/lib/python3.8/site-packages/fastai/data/core.py:451, in <listcomp>(.0)
442 def __init__(self,
443 items:list=None, # List of items to create `Datasets`
444 tfms:list|Pipeline=None, # List of `Transform`(s) or `Pipeline` to apply
(...)
448 **kwargs
449 ):
450 super().__init__(dl_type=dl_type)
--> 451 self.tls = L(tls if tls else [TfmdLists(items, t, **kwargs) for t in L(ifnone(tfms,[None]))])
452 self.n_inp = ifnone(n_inp, max(1, len(self.tls)-1))
File /opt/conda/lib/python3.8/site-packages/fastcore/foundation.py:98, in _L_Meta.__call__(cls, x, *args, **kwargs)
96 def __call__(cls, x=None, *args, **kwargs):
97 if not args and not kwargs and x is not None and isinstance(x,cls): return x
---> 98 return super().__call__(x, *args, **kwargs)
File /opt/conda/lib/python3.8/site-packages/fastai/data/core.py:365, in TfmdLists.__init__(self, items, tfms, use_list, do_setup, split_idx, train_setup, splits, types, verbose, dl_type)
363 if do_setup:
364 pv(f"Setting up {self.tfms}", verbose)
--> 365 self.setup(train_setup=train_setup)
File /opt/conda/lib/python3.8/site-packages/fastai/data/core.py:386, in TfmdLists.setup(self, train_setup)
383 def setup(self,
384 train_setup:bool=True # Apply `Transform`(s) only on training `DataLoader`
385 ):
--> 386 self.tfms.setup(self, train_setup)
387 if len(self) != 0:
388 x = super().__getitem__(0) if self.splits is None else super().__getitem__(self.splits[0])[0]
File /opt/conda/lib/python3.8/site-packages/fastcore/transform.py:200, in Pipeline.setup(self, items, train_setup)
198 tfms = self.fs[:]
199 self.fs.clear()
--> 200 for t in tfms: self.add(t,items, train_setup)
File /opt/conda/lib/python3.8/site-packages/fastcore/transform.py:204, in Pipeline.add(self, ts, items, train_setup)
202 def add(self,ts, items=None, train_setup=False):
203 if not is_listy(ts): ts=[ts]
--> 204 for t in ts: t.setup(items, train_setup)
205 self.fs+=ts
206 self.fs = self.fs.sorted(key='order')
File /opt/conda/lib/python3.8/site-packages/fastcore/transform.py:87, in Transform.setup(self, items, train_setup)
85 def setup(self, items=None, train_setup=False):
86 train_setup = train_setup if self.train_setup is None else self.train_setup
---> 87 return self.setups(getattr(items, 'train', items) if train_setup else items)
File /opt/conda/lib/python3.8/site-packages/fastai/data/core.py:338, in <lambda>(i, x)
334 dls = [dl] + [dl.new(self.subset(i), **merge(kwargs,def_kwargs,val_kwargs,dl_kwargs[i]))
335 for i in range(1, self.n_subsets)]
336 return self._dbunch_type(*dls, path=path, device=device)
--> 338 FilteredBase.train,FilteredBase.valid = add_props(lambda i,x: x.subset(i))
340 # %% ../../nbs/03_data.core.ipynb 52
341 class TfmdLists(FilteredBase, L, GetAttr):
File /opt/conda/lib/python3.8/site-packages/fastai/data/core.py:373, in TfmdLists.subset(self, i)
--> 373 def subset(self, i): return self._new(self._get(self.splits[i]), split_idx=i)
File /opt/conda/lib/python3.8/site-packages/fastcore/foundation.py:120, in L._get(self, i)
116 if is_indexer(i) or isinstance(i,slice): return getattr(self.items,'iloc',self.items)[i]
117 i = mask2idxs(i)
118 return (self.items.iloc[list(i)] if hasattr(self.items,'iloc')
119 else self.items.__array__()[(i,)] if hasattr(self.items,'__array__')
--> 120 else [self.items[i_] for i_ in i])
File /opt/conda/lib/python3.8/site-packages/fastcore/foundation.py:120, in <listcomp>(.0)
116 if is_indexer(i) or isinstance(i,slice): return getattr(self.items,'iloc',self.items)[i]
117 i = mask2idxs(i)
118 return (self.items.iloc[list(i)] if hasattr(self.items,'iloc')
119 else self.items.__array__()[(i,)] if hasattr(self.items,'__array__')
--> 120 else [self.items[i_] for i_ in i])
IndexError: list index out of range