Generate a partial subset from validation.ds

Joan · April 5, 2020, 9:39am

I am running out of memory when using interp = ClassificationInterpretation.from_learner(learn)

As suggested by Sylvain, I would like to generate a partial subset of the validation set in order to run the previous command. Tried something like:

interp = ClassificationInterpretation.from_learner(learn, slice(dls.valid_ds[int(1):int(1000)]))

But I got an error:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-49-e87882ae4fd1> in <module>
----> 1 interp = ClassificationInterpretation.from_learner(learn, slice(dls.valid_ds[int(1):int(1000)]))

~/anaconda3/envs/fastai2/lib/python3.7/site-packages/fastai2/interpret.py in from_learner(cls, learn, ds_idx, dl, act)
     23     def from_learner(cls, learn, ds_idx=1, dl=None, act=None):
     24         "Construct interpretatio object from a learner"
---> 25         if dl is None: dl = learn.dls[ds_idx]
     26         return cls(dl, *learn.get_preds(dl=dl, with_input=True, with_loss=True, with_decoded=True, act=None))
     27 

~/anaconda3/envs/fastai2/lib/python3.7/site-packages/fastai2/data/core.py in __getitem__(self, i)
    121         self.device = device
    122 
--> 123     def __getitem__(self, i): return self.loaders[i]
    124     def new_empty(self):
    125         loaders = [dl.new(dl.dataset.new_empty()) for dl in self.loaders]

TypeError: slice indices must be integers or None or have an __index__ method

Any idea how to generate this subset? Thanks!

Joan · April 5, 2020, 9:48am

Partial answer:

dls.valid = dls.valid.new(dls.valid_ds[:1000])

Generate the subset. However, when running:

interp = ClassificationInterpretation.from_learner(learn)

I got another error:

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-13-aa7f7b70a42b> in <module>
----> 1 interp = ClassificationInterpretation.from_learner(learn)

~/anaconda3/envs/fastai2/lib/python3.7/site-packages/fastai2/interpret.py in from_learner(cls, learn, ds_idx, dl, act)
     24         "Construct interpretatio object from a learner"
     25         if dl is None: dl = learn.dls[ds_idx]
---> 26         return cls(dl, *learn.get_preds(dl=dl, with_input=True, with_loss=True, with_decoded=True, act=None))
     27 
     28     def top_losses(self, k=None, largest=True):

~/anaconda3/envs/fastai2/lib/python3.7/site-packages/fastai2/interpret.py in __init__(self, dl, inputs, preds, targs, decoded, losses)
     51     def __init__(self, dl, inputs, preds, targs, decoded, losses):
     52         super().__init__(dl, inputs, preds, targs, decoded, losses)
---> 53         self.vocab = self.dl.vocab
     54         if is_listy(self.vocab): self.vocab = self.vocab[-1]
     55 

~/anaconda3/envs/fastai2/lib/python3.7/site-packages/fastcore/foundation.py in __getattr__(self, k)
    220         if self._component_attr_filter(k):
    221             attr = getattr(self,self._default,None)
--> 222             if attr is not None: return getattr(attr,k)
    223         raise AttributeError(k)
    224     def __dir__(self): return custom_dir(self,self._dir())

AttributeError: 'list' object has no attribute 'vocab'

Which is weird because dls.vocab gives me the correct labels.

FINAL answer:

dls.valid = dls.valid.new(dls.valid_ds[:1000])
dls[1].vocab = dls.vocab
interp = ClassificationInterpretation.from_learner(learn)

sgugger · April 5, 2020, 1:53pm

The fact you are losing the vocab is weird though. I’ll try to look at why it happens because it’s not a good sign.

sgugger · April 5, 2020, 3:33pm

Ok, here is what is happening: dls.valid_ds[:1000] is a list. It’s not a Datasets anymore, so it loses all information about the inner transforms.

The fastest way to get a proper dl that works is to use DataLoaders.test_dl, which allows you to keep the labels if you pass along with_labels=True:

new_dl = dls.test_dl(dls.valid_ds.items[:1000], with_labels=True)

(note the .items to pass the filenames and not the actual elements of the dataset). You can even store it as a new validation dataloader (and keep the one you have) with:

dls.loaders.append(dls.test_dl(dls.valid_ds.items[:1000], with_labels=True))

and then use

interp = ClassificationInterpretation.from_learner(learn, ds_idx=2)

visoft · April 8, 2020, 2:17pm

Hi! I want to use the hack above, to subset my training set:

Eg: MNIST, downloaded using untar_data:

mnist_block = DataBlock(blocks=(ImageBlock(cls=PILImageBW), CategoryBlock),
                     get_items=get_image_files,
                     get_y = parent_label,
                     splitter=GrandparentSplitter(train_name='training', valid_name='testing'))
mnist_dls = mnist_block.dataloaders(source=data_path)

I got, as expected, Training dataset: 60000 Validation dataset: 10000

I want to take for training a much smaller sample (say, 180). Using the exemplified command should get me:

mnist_dls.train = mnist_dls.test_dl(mnist_dls.train_ds.items[:180], with_labels=True)

Well, it works, show_batch() display plausible results and the learner learns.~~(Can’t train yet, I need some shuffling, not only one digit)~~.

Training dataset: 180 Validation dataset: 10000

Note that test_dl is some sort of monkeypatch to the DataLoaders class, so there is no train_dl() function: test_dl()

Questions:

How fastai2 is the above method? How are the possible, future augmentation transforms affected by the “subsetting”? test_ds is intended for testing/validation…
I tried to add shuffle=True to the mnist_block.dataloaders(source=data_path, shuffle=True) it crashes with TypeError: type object got multiple values for keyword argument 'shuffle'
Is there any other way to do the subsetting and shuffling? [I found a way writing my custom splitter that relies on GrandparentSplitter then shuffles+subsets the training set. Should (2.) be another question?

Thank you!

LE: Quick shuffling:

selected_items = np.random.choice(mnist_dls.train_ds.items, 180, replace=False)
mnist_dls.train = mnist_dls.test_dl(selected_items, with_labels=True)

added expected 180:10000 result. Also, the training works fine with the above hacks.

sgugger · April 8, 2020, 2:26pm

The easiest is probably to write your custom get_items, using get_image_files but returning less element for each folder (while getting elements of all classes). Your code probably only gets 180 images of the same class. One way could be:

def get_items(source):
    items = get_image_items(source)
    return items.shuffle()[:400]

This would return 400 random elements, so probably from all classes and train/validation set.

Then you pass that get_items in your DataBlock.

muellerzr · April 8, 2020, 2:29pm

Don’t we also have a RandomSubsetSplitter that takes a train_pct and val_pct and does exactly that?

https://dev.fast.ai/data.transforms#RandomSubsetSplitter

~~And then if you want a particular one to have the subset then set the other’s pct to 1~~

Edit: Seems you can’t do that, it only allows for full subsets on both sadly (both must be less than 100%), however you could very easily adapt the code too

visoft · April 8, 2020, 2:41pm

@muellerzr, thank you, I missed it while reading the docs.
I get the first two lines, but the 3rd?

 assert 0 < train_sz < 1
 assert 0 < valid_sz < 1
 assert train_sz + valid_sz <= 1.

Also, can I chain the splitters? In a pipeline or sth? Somehow, doubt . . . But nice to know that is there!

@sgugger Thank you too, it is not exactly what I wanted, afaik get_items is called before the splitter. However, I will look into it and post another question. Don’t want to hijack the thread anymore.

muellerzr · April 8, 2020, 2:43pm

So this is making sure essentilly that both of the training sizes together are less than or equal to 1.0. Which means say a 70%, 30% split is one option but 80% and 80% is not. However if you choose to remove those three assert statements you should be good to go with the freedom you want