Hi!
I have N training data in a folder and I want x% of that data only, in my dataloaders’ training set. One use case is raising learning curves and another (that I actually need) is to re-implement self supervized learning from Epoching’s Blog using fastai2. At one point it is shown how a SSL trained classifier can learn digits with only 180 labeled samples. So I need to have 180 training samples and 1000 validation samples in a DataLoaders
object.
I have found 2 solutions, but I would like to know the most fastai2-ic. That is, the one that will allow me to build upon, with data augmentation and what not.
Motivation: I need all the “magic” defaults for image loading and converting to normalized float tensor, label categorization, batching, etc. so I would rather stick to the high level API (DataBlocks
).
For the next part, data_path
points to MNIST dataset downloaded with fastai2.
Solution 1. Write custom data splitter:
I need to take a percent of the training set while keeping the validation set intact. The first place where the train/validation sets are available as distinct entities are after the splitting. So I created my custom splitter to handle both actual splitting and train data subsetting:
def custom_splitter(train_name, valid_name, train_pct):
def fn(name_list):
train_idx, valid_idx = GrandparentSplitter(train_name=train_name, valid_name=valid_name)(name_list)
np.random.shuffle(train_idx)
train_len = int(len(train_idx) * train_pct)
return train_idx[0:train_len], valid_idx
return fn
mn_db = DataBlock(blocks=(ImageBlock(cls=PILImageBW), CategoryBlock),
get_items=get_image_files,
get_y = parent_label,
splitter=custom_splitter(train_name='training', valid_name='testing', train_pct=0.003))
mnist_small_dls = mn_db.dataloaders(data_path)
print(f"Training dataset: {len(mnist_small_dls.train_ds)} Validation dataset: {len(mnist_small_dls.valid_ds)}")
mnist_small_dls.show_batch()
The custom splitter calls the GrandparentSplitter
, shuffles the training data and then slices the first train_pct
samples. Returns two lists back, as expected by the DataBlock.
Solution 2. Use a trick intended for testing on new samples
As pointed out in this thread, one can create a new test dataloader from the validation dataloader:
mnist_block = DataBlock(blocks=(ImageBlock(cls=PILImageBW), CategoryBlock),
get_items=get_image_files,
get_y = parent_label,
splitter=GrandparentSplitter(train_name='training', valid_name='testing'))
mnist_dls = mnist_block.dataloaders(source=data_path)
selected_items = np.random.choice(mnist_dls.train_ds.items, 180, replace=False)
# Create a new dataloader and replace the existing train dataloader
mnist_dls.train = mnist_dls.test_dl(selected_items, with_labels=True)
print(f"Training dataset: {len(mnist_dls.train_ds)} Validation dataset: {len(mnist_dls.valid_ds)}")
mnist_dls.show_batch()
Also, writting my own get_items
as suggested by sgugger in the same thread, won’t do the trick, I do want all my validation set intact.
Discussions?
IMHO the 1st way is more fastai2 because I use the callbacks. The 2nd approach is a bit hacky. Both require shuffling, putting shuffle=True
in mnist_block.dataloaders
throws an error.
Probably with middle level API, it would be more natural but there, afaik, you have to write all the transforms yourself (eg loading, converting to float32, normalization, categorization, etc) So no magic. Also, for the 2nd solution, will it carry on the eventual data augmetation?
Are there other ways, more towards the fastai2 philosophy?
Thank you!