Cross-Validation gives funky results

Konditola · August 9, 2019, 4:52pm

Hello all, let me first say that i am new to the forums and to fast.ai/python in general, so if this topic belongs to another part of the forum, please feel free to tell me.

I am trying to make CV work, but the results are kind of odd. The final predictions (= probabilty values for target class == 1) seem fine for the first fold only. For all the other folds however, the probabilities are getting close to 100% for each item in the test set (which cannot possibly be true, as the probabilities should be very small).

My CV-loop looks like this:

And the result is this:

I know i am making a stupid mistake somewhere in the loop, but can’t seem to figure it out. I experimented with it almost the entire day but can’t get it working.

I would really appreciate if someone could help me out and point me to my mistake.

Regards,
Michael.

arisu · September 11, 2019, 1:58pm

A little late, but one explanation could be that the learner is not “purged” between loop iterations and you keep training the same model?

You could try to .destroy() your learner between iterations and see if that changes anything.

muellerzr · September 11, 2019, 4:17pm

That is quite interesting indeed. I posted this in another thread but here is my implementation of k-fold:

val_pct = []
test_pct = []

for train_index, val_index in skf.split(train_df.index, train_df[dep_var]):
  data_fold = (TabularList.from_df(train_df, path=path, cat_names=cat_names.copy(),
                                  cont_names=cont_names.copy(), procs=procs,
                                  processor=data_init.processor) # Very important
              .split_by_idxs(train_index, val_index)
              .label_from_df(cols=dep_var)
              .databunch())
  
  data_test = (TabularList.from_df(test_df, path=path, cat_names=cat_names.copy(),
                                  cont_names=cont_names.copy(), procs=procs,
                                  processor=data_init.processor) # Very important
              .split_none()
              .label_from_df(cols=dep_var))
  data_test.valid = data_test.train
  data_test = data_test.databunch()
  
  learn = tabular_learner(data_fold, layers=[200,100], metrics=accuracy)
  learn.fit(1)
  
  _, val = learn.validate()
  
  learn.data.valid_dl = data_test.valid_dl
  
  _, test = learn.validate()
  
  val_pct.append(val.numpy())
  test_pct.append(test.numpy())

for a labeled test set. If you have unlabeled change data_fold to be:

  data_fold = (TabularList.from_df(train_df, path=path, cat_names=cat_names.copy(),
                                  cont_names=cont_names.copy(), procs=procs,
                                  processor=data_init.processor) # Very important
              .split_by_idxs(train_index, val_index)
              .label_from_df(cols=dep_var)
              .add_test(TabularList.from_df(test_df, ...)
              .databunch())

That bit that says very important relates back to building an initial databunch with all the data as some processes (the categorical matrix for one) were not lining up, so I built a normal one and pulled from it. I hope this can help solve your problem to some degree