Generate a partial subset from validation.ds

I am running out of memory when using interp = ClassificationInterpretation.from_learner(learn)

As suggested by Sylvain, I would like to generate a partial subset of the validation set in order to run the previous command. Tried something like:

interp = ClassificationInterpretation.from_learner(learn, slice(dls.valid_ds[int(1):int(1000)]))

But I got an error:

TypeError                                 Traceback (most recent call last)
<ipython-input-49-e87882ae4fd1> in <module>
----> 1 interp = ClassificationInterpretation.from_learner(learn, slice(dls.valid_ds[int(1):int(1000)]))

~/anaconda3/envs/fastai2/lib/python3.7/site-packages/fastai2/ in from_learner(cls, learn, ds_idx, dl, act)
     23     def from_learner(cls, learn, ds_idx=1, dl=None, act=None):
     24         "Construct interpretatio object from a learner"
---> 25         if dl is None: dl = learn.dls[ds_idx]
     26         return cls(dl, *learn.get_preds(dl=dl, with_input=True, with_loss=True, with_decoded=True, act=None))

~/anaconda3/envs/fastai2/lib/python3.7/site-packages/fastai2/data/ in __getitem__(self, i)
    121         self.device = device
--> 123     def __getitem__(self, i): return self.loaders[i]
    124     def new_empty(self):
    125         loaders = [ for dl in self.loaders]

TypeError: slice indices must be integers or None or have an __index__ method

Any idea how to generate this subset? Thanks!

Partial answer:

dls.valid =[:1000])

Generate the subset. However, when running:

interp = ClassificationInterpretation.from_learner(learn)

I got another error:

AttributeError                            Traceback (most recent call last)
<ipython-input-13-aa7f7b70a42b> in <module>
----> 1 interp = ClassificationInterpretation.from_learner(learn)

~/anaconda3/envs/fastai2/lib/python3.7/site-packages/fastai2/ in from_learner(cls, learn, ds_idx, dl, act)
     24         "Construct interpretatio object from a learner"
     25         if dl is None: dl = learn.dls[ds_idx]
---> 26         return cls(dl, *learn.get_preds(dl=dl, with_input=True, with_loss=True, with_decoded=True, act=None))
     28     def top_losses(self, k=None, largest=True):

~/anaconda3/envs/fastai2/lib/python3.7/site-packages/fastai2/ in __init__(self, dl, inputs, preds, targs, decoded, losses)
     51     def __init__(self, dl, inputs, preds, targs, decoded, losses):
     52         super().__init__(dl, inputs, preds, targs, decoded, losses)
---> 53         self.vocab = self.dl.vocab
     54         if is_listy(self.vocab): self.vocab = self.vocab[-1]

~/anaconda3/envs/fastai2/lib/python3.7/site-packages/fastcore/ in __getattr__(self, k)
    220         if self._component_attr_filter(k):
    221             attr = getattr(self,self._default,None)
--> 222             if attr is not None: return getattr(attr,k)
    223         raise AttributeError(k)
    224     def __dir__(self): return custom_dir(self,self._dir())

AttributeError: 'list' object has no attribute 'vocab'

Which is weird because dls.vocab gives me the correct labels.

FINAL answer:

dls.valid =[:1000])
dls[1].vocab = dls.vocab
interp = ClassificationInterpretation.from_learner(learn)

The fact you are losing the vocab is weird though. I’ll try to look at why it happens because it’s not a good sign.

1 Like

Ok, here is what is happening: dls.valid_ds[:1000] is a list. It’s not a Datasets anymore, so it loses all information about the inner transforms.

The fastest way to get a proper dl that works is to use DataLoaders.test_dl, which allows you to keep the labels if you pass along with_labels=True:

new_dl = dls.test_dl(dls.valid_ds.items[:1000], with_labels=True)

(note the .items to pass the filenames and not the actual elements of the dataset). You can even store it as a new validation dataloader (and keep the one you have) with:

dls.loaders.append(dls.test_dl(dls.valid_ds.items[:1000], with_labels=True))

and then use

interp = ClassificationInterpretation.from_learner(learn, ds_idx=2)

Hi! I want to use the hack above, to subset my training set:

Eg: MNIST, downloaded using untar_data:

mnist_block = DataBlock(blocks=(ImageBlock(cls=PILImageBW), CategoryBlock),
                     get_y = parent_label,
                     splitter=GrandparentSplitter(train_name='training', valid_name='testing'))
mnist_dls = mnist_block.dataloaders(source=data_path)

I got, as expected, Training dataset: 60000 Validation dataset: 10000

I want to take for training a much smaller sample (say, 180). Using the exemplified command should get me:

mnist_dls.train = mnist_dls.test_dl(mnist_dls.train_ds.items[:180], with_labels=True)

Well, it works, show_batch() display plausible results and the learner learns.(Can’t train yet, I need some shuffling, not only one digit).

Training dataset: 180 Validation dataset: 10000

Note that test_dl is some sort of monkeypatch to the DataLoaders class, so there is no train_dl() function: test_dl()


  1. How fastai2 is the above method? How are the possible, future augmentation transforms affected by the “subsetting”? test_ds is intended for testing/validation…

  2. I tried to add shuffle=True to the mnist_block.dataloaders(source=data_path, shuffle=True) it crashes with TypeError: type object got multiple values for keyword argument 'shuffle'

  3. Is there any other way to do the subsetting and shuffling? [I found a way writing my custom splitter that relies on GrandparentSplitter then shuffles+subsets the training set. Should (2.) be another question?

Thank you!

LE: Quick shuffling:

selected_items = np.random.choice(mnist_dls.train_ds.items, 180, replace=False)
mnist_dls.train = mnist_dls.test_dl(selected_items, with_labels=True)

added expected 180:10000 result. Also, the training works fine with the above hacks.

The easiest is probably to write your custom get_items, using get_image_files but returning less element for each folder (while getting elements of all classes). Your code probably only gets 180 images of the same class. One way could be:

def get_items(source):
    items = get_image_items(source)
    return items.shuffle()[:400]

This would return 400 random elements, so probably from all classes and train/validation set.

Then you pass that get_items in your DataBlock.


Don’t we also have a RandomSubsetSplitter that takes a train_pct and val_pct and does exactly that?

And then if you want a particular one to have the subset then set the other’s pct to 1

Edit: Seems you can’t do that, it only allows for full subsets on both sadly (both must be less than 100%), however you could very easily adapt the code too :slight_smile:

1 Like

@muellerzr, thank you, I missed it while reading the docs.
I get the first two lines, but the 3rd?

 assert 0 < train_sz < 1
 assert 0 < valid_sz < 1
 assert train_sz + valid_sz <= 1.

Also, can I chain the splitters? In a pipeline or sth? Somehow, doubt . . . But nice to know that is there!

@sgugger Thank you too, it is not exactly what I wanted, afaik get_items is called before the splitter. However, I will look into it and post another question. Don’t want to hijack the thread anymore.

So this is making sure essentilly that both of the training sizes together are less than or equal to 1.0. Which means say a 70%, 30% split is one option but 80% and 80% is not. However if you choose to remove those three assert statements you should be good to go with the freedom you want :slight_smile:

1 Like