PyTorch 1.3 breaks OverSamplingCallback

Yeah it currently only works with integer labels. If you have any suggestions for other types of labels please let me know. Thanks for providing these tests!

When changing the sampler, you directly change the sampler with WeightedRandomSampler, but not the batch_sampler? Is there a reason for that? I remember it only working if I used the batch_sampler at least with PyTorch 1.2. Is there a way to change the batch_sampler with the dl.new command?

If passing batch_sampler then PyTorch would complain as fastai was appending whatever args are passed to the saved args. So it got both a batch_sampler and a batch_size. So I’d have to get rid of any batch options to pass a batch_sampler (passing None would likely work).
Things were different before when directly modifying the internal sampler parameters as then you’d end up with both a sampler and a batch_sampler which would likely cause an error. This method of just specifying a new sampler and letting the batch_sampler get created is what the Distributed stuff does.

As long as it works the same way with the correct distribution then it’s fine.

You can check that it’s oversampling correctly using a snippet like this:

labels = []
for img,target in data.train_dl:
    labels.append(target)
labels = torch.cat(labels)
plt.hist(labels)
1 Like

That’s nice.

In terms of enhancements:
I think if you take the items and counts from np.unique then you could map self.labels to indexes into the unique counts and it should work for arbitrary items. You might then want to check you weren’t getting something silly like a list of filenames (or RLE masks as I once tried without thinking). It would be pretty slow with lots of unique values (you might also want to sort the labels/counts first to be able to use np.searchsorted).
Another possible option would be allowing a function from item to a label ID. For instance if doing multi-class segmentation you could provide the class of mask.That would also allow using arbitrary columns in the inner_df. Again possibly a little slow on big datasets but this only happens in on_train_begin so should generally be fine.`

That also reminds me I tried to use it once with weights, in a mutti-label segmentation task, and it had issues because it was still doing some of the calculations on labels which were masks. It probably should skip some of those calculations if provided weights.

Another enhancement would be allowing the caller to specify the total_len_oversample, then you can use it for under-sampling as well (for when over-sampling results in overly long epochs or over-representation of a class). I use code for this which I think is basically the same as yours except that I set the epoch length to less than the original size.

1 Like

Thanks for your suggestions. I do plan to allow for undersampling in the future too, and convert it from OverSamplingCallback to just SamplingCallback. However, I think it might be best to leave some of these changes till fastai v2. I anyway plan to prioritize TPU callback (thanks for all your help so far) and a couple Kaggle competitions for the next couple months.

1 Like

Yeah, obviously some limitations with backward compatibility so a rewrite for v2 may be best for some. Also not major things.
Better support for passing in weights is probably the biggest as you likely quickly get into cases where it’d be better just calculating the weights yourself and there aren’t necessarily sensible defaults E.g. how long should an epoch be when undersampling? I just have a config param fixing it to a value that gives me regular enough validations (which depends on model size). So just taking care of the fastai side and a couple of samples for doing your own weights might be just as good.

Isn’t passing of weights already supported? Are there problems with it?

Yeah, when you pass in weights it still does (and did, no change) self.label_counts = np.unique(ds.y.items) and still sets the epoch length to int(self.data.c*np.max(self.label_counts)). So you get silly things if say every label is unique (e.g. mask filename or RLE)…

So it’s a problem mainly for segmentation problems? I will look into it but I have not worked much with segmentation problems. I will see if I can come up with a quick fix. If not, I will probably have to fix it for fastai v2.

Well, anything where you want to weight based on label group, not the actual label, but yeah segmentation would be the common one. The main thing is just not doing any label calculations if weights are provided (which likely means the user also needs to pass the epoch length). Actually calculating the weights depends on the problem.

Alternatively you could allow the user to provide one-hot encoded group membership. My code is:

# One-hot encoded group on axis 1 (i.e. 0/1 column for each group, so n_items x n_groups
groups = np.stack(ds.inner_df.Group == g for g in groups], axis=1) 
# Group counts on axis 0
group_weights = 1/groups.sum(axis=0)
# Broadcast and multiply so each row has the weight of group(s) its a member of then sum for n_items weights
weights = tensor(groups * group_weights).sum(dim=1) 
ws = WeightedRandomSampler(weights, epoch_len, replacement=True) # where user specifies epoch_len

So if user provided groups and epoch_len the rest is general. I can look to add it in when I have some time for such things.

1 Like

Thanks! I’ll look into it. I will ask some more questions regarding the kinds of problems where such a solution is needed.

Might just be multi-class segmentation. Though not that neccessary if you avoid extra weight calculations where you hit the current limitation where I force labels to be integers. If you had:

if self.weights is None:
    self.weights = ... # Current calc
if self.epoch_len is None: # epoch_len being a new parameter to set what is now total_len_oversample
   self.epoch_len = ... # current total_len_oversample calc
sampler = ...

Then it should allow any label type, avoid unnecessary recalculations if weights are provided (and errors when they don’t work) and allow undersampling.
Then I could have just done the above calcs and provided weights/epoch length.

1 Like

This is something to look into. I guess the only left might be to have some default weights for oversampling and undersampling, which might be easily done in a small function?

Here is the code for weighted_databunch:

Hi, is it fixed now? I’m still facing this issue when creating a learner with OverSamplingCallback and calling lr_find.

It’s fixed in master, but no release yet. Not sure if there’s something holding it back, but hopefully a new 1.3 compatible version will be pushed soon given this and the more general GridSampler warning…
@sgugger anything holding it up that people can help with?

1 Like

Hi @TomB, for now can I make it work by pulling fastai master?

Yeah, pulling master. Or you could also pip install --upgrade git+https://github.com/fastai/fastai (or if you use conda probably safer to pip install --no-deps --upgrade ... as mixing pip and conda packages can cause issues). Or given the limited scope of the fix you could just replace the altered file with the new version.

Though fastai seems to do a generally good job of keeping master usable (especially now when most dev effort in in v2) so should be fine to use that. I switch between release and master environments quite a bit without issue.

EDIT: Sorry put some bad URLs in there, the current should be right.

2 Likes

Thank you @TomB, I will definitely check it out.

1 Like

We’ll make a new release soon-ish, just wanted to catch every bug that appeared with PyTorch 1.3.

1 Like