Data block split_subsets - define train and val size independently

sotte · February 27, 2019, 4:02pm

I tend to work with a small subset of the data first, then increase the dataset size. The data block api offers the use_partial_data so select a subset of the data (independent of train and valid). I however want to use full valid ds and only train on a smaller train ds. With the current API your can achieve this by combining use_partial_data, some manual math, and random_split_by_pct, but this is a bit cumbersome.

I wrote a little function split_subsets which allows you to define the size of the train and valid ds independently. It looks something like this:

# do the the normal split like random_split_by_pct
ImageList.from_csv(...).split_subsets(train_size=.8, valid_size=.2)
# only work with a tenth of the train ds but keep the valid ds size
ImageList.from_csv(...).split_subsets(train_size=.08, valid_size=.2)

If you’re interested I can put together a PR. I’m also open to suggestions (especially regarding the name of the function ).

Here is the current implementation:

sgugger · February 27, 2019, 4:19pm

Oh that seems interesting, yes please do suggest a PR!

sotte · February 27, 2019, 4:48pm

Here we go