After I trained my model, I’m using following code to export all the predictions to check what kind of features contribute to the most of the errors:
data = (tabular.TabularList.from_df(db, path='.', cat_names=catn, cont_names=contn, procs=[Categorify, Normalize])
.split_by_rand_pct(valid_pct = 0.99, seed = 666)
# .split_by_idx(list(range(len(db)-100)))
.label_from_df(cols=['res'])
.databunch())
learn = tabular_learner(data, layers=[2000,2000,500,200,50], metrics=rmse)
learn.load('/content/gdrive/My Drive/mark88')
res=learn.get_preds()
with .split_by_rand_pct the result looks ok:
| 0 | 1 | |
|---|---|---|
| 0 | 0.179529 | 0.357674 |
| 1 | 0.304910 | -0.051293 |
| 2 | -0.104908 | -0.562119 |
| 3 | 0.483046 | 0.772420 |
| 4 | 0.865559 | 0.982078 |
| … | … | … |
| 63799 | 0.432284 | 0.530628 |
| 63800 | 1.148532 | 1.201470 |
| 63801 | 0.777352 | 1.193923 |
| 63802 | 0.591223 | 1.042042 |
| 63803 | 0.086806 | 0.190620 |
63804 rows × 2 columns
but with split_by_idx the result became wierd:
| 0 | 1 | |
|---|---|---|
| 0 | -260.968781 | -1.309333 |
| 1 | -261.455231 | -1.171183 |
| 2 | -261.944733 | -0.891598 |
| 3 | -262.437225 | -0.820981 |
| 4 | -262.932770 | -0.510826 |
| … | … | … |
| 64344 | -0.031256 | -0.281038 |
| 64345 | 0.406735 | -0.180324 |
| 64346 | -1.687541 | -0.798508 |
| 64347 | -3.944418 | -0.072571 |
| 64348 | -1.901659 | -0.210721 |
64349 rows × 2 columns
what is the right way to use split_by_idx here?