So, in PyTorch 1.3 they now error when you replace the sampler on a DataLoader which is how OverSamplingCallback works. So:
from fastai.vision import *
DATA = untar_data(URLs.MNIST)
src = (ImageList.from_folder(DATA).filter_by_rand(0.1)
data = src.databunch(bs=4, num_workers=0)
lrn = cnn_learner(data, models.resnet18, callback_fns=[callbacks.OverSamplingCallback])
gives you: ValueError: batch_sampler attribute should not be set after DataLoader is initialized
So, I’ve updated it to use DeviceDataLoader.new() in line with the distributed stuff (so it’s not affected) and cleaned it up a little. I’ve pushed a fix, but I’ll also add a test as OverSamplingCallback is currently untested (and test manually a little more). Might not get to it until tomorrow but probably not going to affect that many people.
One thing I was wondering while looking at it (this being partly cleaned):
ds,dl = self.data.train_ds,self.data.train_dl
self.labels = ds.y.items
_, counts = np.unique(self.labels,return_counts=True)
if self.weights is None: self.weights = torch.DoubleTensor((1/counts)[self.labels])
self.label_counts = np.bincount([ds.y[i].data for i in range(len(ds))])
why is it calculating label_counts and counts. How can np.unique(self.labels,return_counts=True) be different to np.bincount([ds.y[i].data for i in range(len(ds))]). Am I missing something or can the latter just be counts.
Or is it intending to actually iterate the dataloader, and so include any sampler on it (but then you can’t get the original y)?
I had written the OverSamplingCallback. Yes, you are right, that line of code is unnecessary. I accidentally added it when I was fixing the correct length of the oversampled dataset. It should be fine just to change counts to self.label_counts and change references of counts to self.label_counts.
I didn’t know that PyTorch 1.3 will not let us change the batch sampler after initialization. Thanks for fixing this and pointing out my error!
OK, testing was more effort than I thought as the basic fake data is labelled with floats but oversampling only works on int labels. I added an assert to clearly fail there (and not try and do np.unique on filenames or anything). I’ve tested it against the various label_from functions and it works.
Best test I came up with was:
Not really sure it worth it. Pulls in a lot of stuff so could fail incidentally. It also could fail given it’s random sampling, but only check > 20 when it should be ~50 and without sampling would be 1.
So I’ll PR without but can add that if you want it @sgugger.
When changing the sampler, you directly change the sampler with WeightedRandomSampler, but not the batch_sampler? Is there a reason for that? I remember it only working if I used the batch_sampler at least with PyTorch 1.2. Is there a way to change the batch_sampler with the dl.new command?
If passing batch_sampler then PyTorch would complain as fastai was appending whatever args are passed to the saved args. So it got both a batch_sampler and a batch_size. So I’d have to get rid of any batch options to pass a batch_sampler (passing None would likely work).
Things were different before when directly modifying the internal sampler parameters as then you’d end up with both a sampler and a batch_sampler which would likely cause an error. This method of just specifying a new sampler and letting the batch_sampler get created is what the Distributed stuff does.
In terms of enhancements:
I think if you take the items and counts from np.unique then you could map self.labels to indexes into the unique counts and it should work for arbitrary items. You might then want to check you weren’t getting something silly like a list of filenames (or RLE masks as I once tried without thinking). It would be pretty slow with lots of unique values (you might also want to sort the labels/counts first to be able to use np.searchsorted).
Another possible option would be allowing a function from item to a label ID. For instance if doing multi-class segmentation you could provide the class of mask.That would also allow using arbitrary columns in the inner_df. Again possibly a little slow on big datasets but this only happens in on_train_begin so should generally be fine.`
That also reminds me I tried to use it once with weights, in a mutti-label segmentation task, and it had issues because it was still doing some of the calculations on labels which were masks. It probably should skip some of those calculations if provided weights.
Another enhancement would be allowing the caller to specify the total_len_oversample, then you can use it for under-sampling as well (for when over-sampling results in overly long epochs or over-representation of a class). I use code for this which I think is basically the same as yours except that I set the epoch length to less than the original size.
Thanks for your suggestions. I do plan to allow for undersampling in the future too, and convert it from OverSamplingCallback to just SamplingCallback. However, I think it might be best to leave some of these changes till fastai v2. I anyway plan to prioritize TPU callback (thanks for all your help so far) and a couple Kaggle competitions for the next couple months.
Yeah, obviously some limitations with backward compatibility so a rewrite for v2 may be best for some. Also not major things.
Better support for passing in weights is probably the biggest as you likely quickly get into cases where it’d be better just calculating the weights yourself and there aren’t necessarily sensible defaults E.g. how long should an epoch be when undersampling? I just have a config param fixing it to a value that gives me regular enough validations (which depends on model size). So just taking care of the fastai side and a couple of samples for doing your own weights might be just as good.
Yeah, when you pass in weights it still does (and did, no change) self.label_counts = np.unique(ds.y.items) and still sets the epoch length to int(self.data.c*np.max(self.label_counts)). So you get silly things if say every label is unique (e.g. mask filename or RLE)…
So it’s a problem mainly for segmentation problems? I will look into it but I have not worked much with segmentation problems. I will see if I can come up with a quick fix. If not, I will probably have to fix it for fastai v2.
Well, anything where you want to weight based on label group, not the actual label, but yeah segmentation would be the common one. The main thing is just not doing any label calculations if weights are provided (which likely means the user also needs to pass the epoch length). Actually calculating the weights depends on the problem.
Alternatively you could allow the user to provide one-hot encoded group membership. My code is:
# One-hot encoded group on axis 1 (i.e. 0/1 column for each group, so n_items x n_groups
groups = np.stack(ds.inner_df.Group == g for g in groups], axis=1)
# Group counts on axis 0
group_weights = 1/groups.sum(axis=0)
# Broadcast and multiply so each row has the weight of group(s) its a member of then sum for n_items weights
weights = tensor(groups * group_weights).sum(dim=1)
ws = WeightedRandomSampler(weights, epoch_len, replacement=True) # where user specifies epoch_len
So if user provided groups and epoch_len the rest is general. I can look to add it in when I have some time for such things.
Might just be multi-class segmentation. Though not that neccessary if you avoid extra weight calculations where you hit the current limitation where I force labels to be integers. If you had:
if self.weights is None:
self.weights = ... # Current calc
if self.epoch_len is None: # epoch_len being a new parameter to set what is now total_len_oversample
self.epoch_len = ... # current total_len_oversample calc
sampler = ...
Then it should allow any label type, avoid unnecessary recalculations if weights are provided (and errors when they don’t work) and allow undersampling.
Then I could have just done the above calcs and provided weights/epoch length.