Hello,
You are correct and apologies for the incorrect comment about the default in dataset.py. Thank you for looking into this.
Looking at ‘from_csv’ below from dataset.py, it does pull through 20% if it has not been specified. So the default will be 20% of the images, not 1 image.
def from_csv(cls, path, folder, csv_fname, bs=64, tfms=(None,None),
val_idxs=None, suffix='', test_name=None, continuous=False, skip_header=True, num_workers=8):
""" Read in images and their labels given as a CSV file.
This method should be used when training image labels are given in an CSV file as opposed to
sub-directories with label names.
Arguments:
path: a root path of the data (used for storing trained models, precomputed values, etc)
folder: a name of the folder in which training images are contained.
csv_fname: a name of the CSV file which contains target labels.
bs: batch size
tfms: transformations (for data augmentations). e.g. output of `tfms_from_model`
val_idxs: index of images to be used for validation. e.g. output of `get_cv_idxs`.
If None, default arguments to get_cv_idxs are used.
suffix: suffix to add to image names in CSV file (sometimes CSV only contains the file name without file
extension e.g. '.jpg' - in which case, you can set suffix as '.jpg')
test_name: a name of the folder which contains test images.
continuous: TODO
skip_header: skip the first row of the CSV file.
num_workers: number of workers
Returns:
ImageClassifierData
"""
fnames,y,classes = csv_source(folder, csv_fname, skip_header, suffix, continuous=continuous)
**val_idxs = get_cv_idxs(len(fnames)) if val_idxs is None else val_idxs**
((val_fnames,trn_fnames),(val_y,trn_y)) = split_by_idx(val_idxs, np.array(fnames), y)
test_fnames = read_dir(path, test_name) if test_name else None
if continuous:
f = FilesIndexArrayRegressionDataset
else:
f = FilesIndexArrayDataset if len(trn_y.shape)==1 else FilesNhotArrayDataset
datasets = cls.get_ds(f, (trn_fnames,trn_y), (val_fnames,val_y), tfms,
path=path, test=test_fnames)
return cls(path, datasets, bs, num_workers, classes=classes)
In my case, I found a log loss figure improved in Kaggle when I trained on a whole data set -1 (val_idxs = [0]).
About unfreezing for the Dog Breed classification data, apparently it is best to use the pre-trained model for this particular case: