Stratified labels sampling

devforfu · October 25, 2018, 3:06pm

Sorry guys if this question was already raised (and answered). Does the library contain stratified sampling methods? As I can see, ImageDataBunch automatically splits the data into training and validation set using random_split function, like in this method:

class ImageDataBunch(DataBunch):

    ...

    @classmethod
    def from_lists(cls, path:PathOrStr, fnames:FilePathList, labels:Collection[str], valid_pct:int=0.2, test:str=None, **kwargs):
        classes = uniqueify(labels)
        train,valid = random_split(valid_pct, fnames, labels)
        datasets = [ImageClassificationDataset(*train, classes),
                    ImageClassificationDataset(*valid, classes)]
        if test: datasets.append(ImageClassificationDataset.from_single_folder(Path(path)/test, classes=classes))
        return cls.create(*datasets, path=path, **kwargs)

And, the random_split function uses uniform distribution to separate observations. Is there a method similar to StratifiedShuffleSplit class from scikit-learn? Like, to split imbalanced dataset?

Or is it better to use scikit-learn itself to prepare data before feeding samples into data bunch?

Hope the question makes sense.

jeremy · October 25, 2018, 3:08pm

There isn’t anything like that. But you can preparate your datasets elsewhere, then call ImageDataBunch.create.

jamesp · October 25, 2018, 3:32pm

I can offer a worked example. I use a custom dataset to accomplish this, but everything else is vanilla fastai. I have a pandas dataframe with all of my data. I then split that into train, validation, and test data (for my current use-case, this is much easier for me than shuffling things around into different folders). I use a custom loader to make this really easy. For example, here is a classifier example derived from my scalar dataset

merged is my pandas dataframe, which contains (at least) a PosixPath column named ‘file_path’, and a column identified by the trait variable which has my categorical outcomes of interest.

class ImageCategoricalDataset(ImageDataset):
    def __init__(self, df:DataFrame, path_column:str='file_path', dependent_variable:str=None):
        # list(set(x)) basically takes the list of x, turns the values into the
        # map keys, then turns it back into a list (now unique-ified thanks to
        # the map key transform)
        self.classes = list(set(df[dependent_variable]))
        self.class2idx = {v:k for k,v in enumerate(self.classes)}
        y = np.array([self.class2idx[o] for o in df[dependent_variable]], dtype=np.int64)

        # The superclass does nice things for us like tensorizing the numpy
        # input
        super().__init__(df[path_column], y)
        
        self.loss_func = F.cross_entropy
        self.loss_fn = self.loss_func
        
    def __getitem__(self, i:int):
        return open_image(self.x[i]), self.y[i]

    def __len__(self)->int:
        return len(self.y)

I then set a random value on each value:

np.random.seed(31337)
merged['rand'] = np.random.uniform(low=0.0, high=1.0, size=(len(merged[trait],)))

I then generate datasets based on cutpoints of that random value:

dat_train = ImageCategoricalDataset(merged[merged['rand'] < 0.7], 'file_path', trait)
dat_valid = ImageCategoricalDataset(merged[(merged['rand'] >= 0.7) & (merged['rand'] < 0.9)], 'file_path', trait)
dat_test = ImageCategoricalDataset(merged[merged['rand'] >= 0.9], 'file_path', trait)

And finally I create my data bunch from these datasets and feed them into a learner:

data = ImageDataBunch.create(dat_train, dat_valid, dat_test,
    ds_tfms=get_transforms(),
    bs=128,
    size=128)

learn = ConvLearner(data, 
                    models.resnet50, 
                    metrics=[accuracy, dice], 
                    ps=0.5,
                    callback_fns=ShowGraph)

My background is in go, not python, so there are likely more efficient ways to accomplish this.

devforfu · October 25, 2018, 3:52pm

Right, makes sense.

Sorry, I am not sure how this version is different from standard image bunches? I mean, you split your samples using uniform distribution, right?

jamesp · October 25, 2018, 4:02pm

Sure, but the point is that rather than relying on fastai to randomly split for you, you can now choose any splitting function you want. I chose a uniform split, but you can use a stratified split. You can ignore the ImageCategoricalDataset and just use whatever tool you want to create standard ImageDatasets and then feed them into ImageDataBunch.create(). My main point was to show one way to accomplish that, once you have come up with a way to split up your data set (whether randomly or with scikit-learn or whatever you like).

devforfu · October 25, 2018, 4:14pm

Yes, right, that was my fallback option Just wanted to clarify that I am not doing extra work if split data myself. Because previously I was writing a lot of stuff like datasets, iterators, and loaders manually, so now I am trying to be more careful.

jeremy · October 25, 2018, 6:08pm

Good plan! Sometimes you’ll find a part of ML (like this) that we haven’t integrated in to the lib yet. If you come up with a solution, it’s a great exercise to try to factor it out into something simple and reusable. That’s how we build fastai ourselves!

henrique · July 6, 2019, 10:15pm

from sklearn.model_selection import StratifiedKFold
tr_ids, val_ids = next(StratifiedKFold(n_splits=10).split(X, y))
ImageList.from_df({...}).split_by_idx(val_ids)