After I trained my model, I’m using following code to export all the predictions to check what kind of features contribute to the most of the errors:
data = (tabular.TabularList.from_df(db, path='.', cat_names=catn, cont_names=contn, procs=[Categorify, Normalize])
.split_by_rand_pct(valid_pct = 0.99, seed = 666)
# .split_by_idx(list(range(len(db)-100)))
.label_from_df(cols=['res'])
.databunch())
learn = tabular_learner(data, layers=[2000,2000,500,200,50], metrics=rmse)
learn.load('/content/gdrive/My Drive/mark88')
res=learn.get_preds()
with .split_by_rand_pct
the result looks ok:
0 | 1 | |
---|---|---|
0 | 0.179529 | 0.357674 |
1 | 0.304910 | -0.051293 |
2 | -0.104908 | -0.562119 |
3 | 0.483046 | 0.772420 |
4 | 0.865559 | 0.982078 |
… | … | … |
63799 | 0.432284 | 0.530628 |
63800 | 1.148532 | 1.201470 |
63801 | 0.777352 | 1.193923 |
63802 | 0.591223 | 1.042042 |
63803 | 0.086806 | 0.190620 |
63804 rows × 2 columns
but with split_by_idx
the result became wierd:
0 | 1 | |
---|---|---|
0 | -260.968781 | -1.309333 |
1 | -261.455231 | -1.171183 |
2 | -261.944733 | -0.891598 |
3 | -262.437225 | -0.820981 |
4 | -262.932770 | -0.510826 |
… | … | … |
64344 | -0.031256 | -0.281038 |
64345 | 0.406735 | -0.180324 |
64346 | -1.687541 | -0.798508 |
64347 | -3.944418 | -0.072571 |
64348 | -1.901659 | -0.210721 |
64349 rows × 2 columns
what is the right way to use split_by_idx here?