How do you do integrate sklearn StratifiedShuffleSplit with fastai

Curious how do you some kind of k-fold cross validation with the fastai library either the from_path or from_csv methods?

Also is there some wrapper around Learner to integrate with SkLearn Classifier for example to use it with it’s ensemble feature or generally with the skelarn ecosystem, similarly to the Keras wrapper?



I am not aware of any wrappers. But this thread discusses doing K-Fold - Dog Breed Identification challenge

1 Like


Here is the full code snippet how I did it, but if someone knows a better way, im happy to hear it.

def get_data(sz, f_model, transforms, val_idxs, bs=64):
  tfms = tfms_from_model(f_model, sz, aug_tfms=transforms, max_zoom=1.1)
  return ImageClassifierData.from_csv(PATH, 'newtrain', label_csv, val_idxs=val_idxs, test_name='test', 
tfms=tfms, bs=bs)

#get the full dataset first and then use that to split
data = get_data(sz, [0])
skf = StratifiedKFold(n_splits=4, random_state=seed, shuffle=True)
splits = skf.split(np.zeros(len(data.trn_y)), data.trn_y)
datas = []
for train_index, val_index in splits:
  datas.append(get_data(sz, val_idxs=val_index))

learn = ConvLearner.pretrained(f_model, datas[0],precompute=False)

#loop through each fold and train for a bit
for data in datas:
   learn.set_data(data), 3, cycle_len=1, cycle_mult=2)

I’m trying something similar but starting with 100 epochs of precompute = True and then switching to precompute = False. When I do this, the first k-fold group always have about 85% validation accuracy before setting precompute=False, and then it has about the same accuracy. However, when I go through the second k-fold set, it jumps from about 85% val_acc with precompute=True to about 97% val_acc after I set precompute=False and train one epoch.

Is there some way this is cheating because it saved somewhere the activations or answers after the first set of training with precompute =True? or is it maybe flipping the validation/test sets in subsequent CV groups?

Nope, I think it’s a result of how K-fold works. It trains the model on 2/3 of the data, call that fold A and evaluates on 1/3 - fold B, then on the second iteration it would train the model on 2/3 - fold C of the data and evaluate on another 1/3 - fold D. The problem is that fold D by definition would include a large percentage of fold A if not all, so the model would have already seen that data in the first pass and would do much better on the second validation set, which would be super misleading, that’s why K-fold used in that context doesn’t yield great numbers. If you truly have a lot of data then you can do non-overlapping splits, but then you might as well be better off to train a full model on all the data and regularize better, then train a couple more slight pertubations of that model on the full data and average all their results either naively or with some kinda of classifier at the end (IE stacking). Hope that helps.

Hello, do you know how to implement cross-validation in fastai v1?