Data block split_subsets - define train and val size independently

I tend to work with a small subset of the data first, then increase the dataset size. The data block api offers the use_partial_data so select a subset of the data (independent of train and valid). I however want to use full valid ds and only train on a smaller train ds. With the current API your can achieve this by combining use_partial_data, some manual math, and random_split_by_pct, but this is a bit cumbersome.

I wrote a little function split_subsets which allows you to define the size of the train and valid ds independently. It looks something like this:

# do the the normal split like random_split_by_pct
ImageList.from_csv(...).split_subsets(train_size=.8, valid_size=.2)
# only work with a tenth of the train ds but keep the valid ds size
ImageList.from_csv(...).split_subsets(train_size=.08, valid_size=.2)

If you’re interested I can put together a PR. I’m also open to suggestions (especially regarding the name of the function :slight_smile:).

Here is the current implementation:

Oh that seems interesting, yes please do suggest a PR!

Here we go

1 Like