Split_by_idx gives wierd result

jerron · February 28, 2020, 2:31pm

After I trained my model, I’m using following code to export all the predictions to check what kind of features contribute to the most of the errors:

data = (tabular.TabularList.from_df(db, path='.', cat_names=catn, cont_names=contn, procs=[Categorify, Normalize])
    .split_by_rand_pct(valid_pct = 0.99, seed = 666)
#    .split_by_idx(list(range(len(db)-100)))
    .label_from_df(cols=['res'])
    .databunch())
learn = tabular_learner(data, layers=[2000,2000,500,200,50], metrics=rmse)
learn.load('/content/gdrive/My Drive/mark88')
res=learn.get_preds()

with .split_by_rand_pct the result looks ok:

	0	1
0	0.179529	0.357674
1	0.304910	-0.051293
2	-0.104908	-0.562119
3	0.483046	0.772420
4	0.865559	0.982078
…	…	…
63799	0.432284	0.530628
63800	1.148532	1.201470
63801	0.777352	1.193923
63802	0.591223	1.042042
63803	0.086806	0.190620

63804 rows × 2 columns

but with split_by_idx the result became wierd:

	0	1
0	-260.968781	-1.309333
1	-261.455231	-1.171183
2	-261.944733	-0.891598
3	-262.437225	-0.820981
4	-262.932770	-0.510826
…	…	…
64344	-0.031256	-0.281038
64345	0.406735	-0.180324
64346	-1.687541	-0.798508
64347	-3.944418	-0.072571
64348	-1.901659	-0.210721

64349 rows × 2 columns

what is the right way to use split_by_idx here?

amin_nejad · February 28, 2020, 4:30pm

Is there a reason you’re using such a high valid_pct? Usually you’d expect something like 0.1. When you use split_by_idx, you’re telling it to use all but the last 100 values of db as your validation set. Maybe there’s something funny with the last 100 values (which are essentially your training set)

jerron · February 28, 2020, 11:45pm

I am using learn.get_preds() to export all the predictions to check what kind of features contribute to the most errors. when I train, I do use .split_by_rand_pct(valid_pct = 0.1, seed = 88). Is it the right way to do?

Sometimes I also observed the results were normalized:

	0	1
0	0.364257	-4.702942
1	0.361578	-4.708984
2	0.384142	-4.562355
3	0.507096	-4.653759
4	0.474304	-4.653759
…	…	…
6995	0.163310	-5.692428
6996	0.481058	-4.345143
6997	0.283751	-5.100203
6998	0.289077	-5.095756
6999	0.213350	-5.463285

how to de-normaliz them?