Cross Validation with

(sergii makarevych) #1

Hey All.

no intro, just sharing how to do CV predictions with fastai for train and test sets. Why not with ImageClassifierData.from_csv ? - well, because it gonna precompute activations every time you change folds both for train and for test. Obviously it should be enough to precompute activations once and than just change train/validation sets based on CV indexes.

So our steps:

  1. Precompute activations once:
    data = ImageClassifierData.from_csv(val_idxs =[0], test_name='test')
    learn = ConvLearner.pretrained(model, data, precompute=True)

  2. Create function to update fc_data with new activations for train/validation sets in your ConvLearner

    def change_fc_data(learn, train_index, val_index):

     tmpl = f'_{}_{}.bc'
     names = [os.path.join(learn.tmp_path, p+tmpl) for p in ('x_act', 'x_act_val', 'x_act_test')]
     act, val_act, test_act = [ for p in names]
     data_x = np.vstack([val_act, act])
     data_y = np.array(list( + list(
     train_x = data_x[train_index] 
     valid_x = data_x[val_index]
     train_y = data_y[train_index] 
     valid_y = data_y[val_index]
     learn.fc_data = ImageClassifierData.from_arrays(learn.data_.path,
                     (train_x, train_y), 
                     (valid_x, valid_y),, classes=learn.data_.classes,
                     test = test_act if learn.data_.test_dl else None, num_workers=8)
     return learn`
  3. Create CV iterator
    ind = pd.read_csv(f'{PATH}/labels.csv', index_col='id')
    skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=1)
    for train_index, val_index in skf.split(ind.index, ind['breed'])

  4. Inside CV cycle create learn and update its fc_data object:
    data = ImageClassifierData.from_csv(val_idxs =[0], test_name='test')
    learn = ConvLearner.pretrained(model, data, precompute=True)
    learn = change_fc_data(learn, train_index, val_index)

  5. Predict and save train/test


Kaggle Comp: Plant Seedlings Classification
Dog Breed Identification challenge
Dog Breed Identification challenge
(Jeremy Howard (Admin)) #2

That’s clever! Thanks for sharing.

(Another approach is to just leave precompute=False when doing CV.)


(WG) #3

So to clarify …

Inside the CV cycle is where you train, and predict/test, correct? And then after you’ve trained your 5 models, you average the predictions together to get a final prediction?

I like @jeremy’s recommendation to just set precommpute=False. If I understand things right, it would eliminate the need for the change_fc_data function.


(sergii makarevych) #4

Its just time consuming, right? With precompute=False we don`t have this advantage of not passing data through the whole network any more.


(Jeremy Howard (Admin)) #5

Right, but you don’t have the benefit of data augmentation either…


(sergii makarevych) #6

Thats it! This make sense only for Dog Breed challenge where data augmentation does not help (at least for me).


Kaggle Comp: Plant Seedlings Classification

I’m trying to do cross validation also and I’m getting weird behavior. That is, every cross validation set except the first cross validation set gets a huge boost in validation accuracy when I switch to precompute=False for one epoch after many epochs with precompute=True. I’m talking like 85% to 97%. The first set stays essentially at the same accuracy when I switch to precompute=False (85%-> 86%).

My question is - did you implement this solution to make the training faster because you don’t need to precompute activations again? or is there something inherently cheating about doing cross validation after you’ve already precomputed the activations with one set?

For the record, I’m doing something similar to the dog breeds prediction, but I’ve made csv files containing indexes of stratified validation sets that refer to the csv file containing class+path.


def get_val_idx_fromfile(validx_csv):
    validx_df =pd.read_csv(validx_csv, header=None)
    return validx_df[0].tolist()

def get_data(sz, bs, val_idxs, label_csv): # sz: image size, bs: batch size
    tfms = tfms_from_model(arch, sz, aug_tfms=transforms_side_on, max_zoom=1.0)
    data = ImageClassifierData.from_csv(PATH, 'train', label_csv,
                                   val_idxs=val_idxs, suffix='.png', tfms=tfms, bs=bs, num_workers=6)

    return data if sz > 300 else data.resize(340, 'tmp') 

label_csv = f'{PATH}3labels.csv'
vacc =[]

valididx_base = '3cls_val_ids'

for rep in range(reps):
    val_idxs = get_val_idx_fromfile(f'{PATH}'+valididx_base+str(rep+start)+'.csv')
    data = get_data(sz, 200, val_idxs, label_csv)
    learn = ConvLearner.pretrained(arch, data, precompute=False, ps = 0.2), 100)
    learn.precompute = False
    val_loss, val_acc =, 1)

(Fernando A.) #9

is it ok?