How is train_loss calculated and how to reproduce it?

jerron · March 3, 2020, 6:28pm

I trained my model as following:

df = pd.read_csv('/content/gdrive/My Drive/atm.csv')
data = (tabular.TabularList.from_df(df, path='.', cat_names=cat_names, cont_names=cont_names, procs=[Categorify, Normalize])
    .split_by_rand_pct(valid_pct = 0.1, seed = 88)
    .label_from_df(cols=[dep_var])
    .databunch())

learn = tabular_learner(data, layers=[2000,2000,500,200,50], metrics=exp_rmspe)
learn.lr_find()
learn.recorder.plot()
learn.fit_one_cycle(50, max_lr =1e-01,callbacks=[SaveModelCallback(learn,
  monitor='valid_loss',
  mode='min',
  name='/content/gdrive/My Drive/atm88')])

and I got the result fairly acceptable, with train_loss=0.0245 and valid_loss=0.0199 at epoch 47:

41 0.027972 0.020318 0.161073 00:01

42 0.027237 0.022235 0.154306 00:01

43 0.026908 31822.730469 inf 00:01

44 0.024176 47407947776.000000 inf 00:01

45 0.024114 49439031296.000000 0.157786 00:01

46 0.023868 10416965.000000 inf 00:01

47 0.024527 0.019987 0.148315 00:01

48 0.023762 4771567.000000 0.148559 00:01

49 0.021999 121208012800.000000 0.144663 00:01

Better model found at epoch 0 with valid_loss value: 1.5367414425564742e+19.
Better model found at epoch 1 with valid_loss value: 0.09597836434841156.
Better model found at epoch 3 with valid_loss value: 0.026210423558950424.
Better model found at epoch 4 with valid_loss value: 0.021548109129071236.
Better model found at epoch 41 with valid_loss value: 0.0203182864934206.
Better model found at epoch 47 with valid_loss value: 0.0199868306517601.

Then I tried to validate:

learn.load('/content/gdrive/My Drive/atm88')
print("Validation:",learn.validate(learn.data.train_dl),learn.validate(learn.data.valid_dl))

I got the same valid_loss but not the same train_loss

Validation: [146669540000000.0, tensor(inf)] [0.01998683, tensor(0.1483)]

Why is the number for train_dl so high and how to reproduce the number 0.245 in the epoch 47?

jerron · March 3, 2020, 6:43pm

and I tried to see what’s going on with train data:

subdata=df.iloc[:20]
df2=pd.DataFrame(list(zip( [learn.predict(row)[1].tolist()[0] for row in subdata.itertuples()],subdata[dep_var].tolist())))
print("predict\n",df2)

I found some prediction could be very far from target:

predict
0 1
0 -3.650368 -3.873408
1 -3.780042 -3.888676
2 -3.916492 -4.236845
3 -4.213390 -3.720628
4 -3.999758 -3.998456
5 -4.114305 -4.055614
6 -4.004932 -4.070062
7 -4.070571 -4.085330
8 -4.116131 -4.389104
9 -4.159841 -4.344652
10 -4.041779 -3.166078
11 -3.984901 -3.725693
12 -3.089322 -3.084433
13 2459.849609 -3.005506
14 -3.880156 -3.530475
15 -3.999149 -3.743783
16 -3.651942 -3.664363
17 -3.712008 -3.664363
18 -3.615193 -3.684181
19 -3.978493 -3.613800

but in the training, both the train_loss and valid_loss are much lower.

jerron · March 6, 2020, 8:20pm

Any comment? I thought what I did is very basic practice and this problem should be very generic?

barnacl · March 7, 2020, 3:43am

may be try training it with a lower lr(eg 1e-4) and see how the loss is.
The validation loss can be seen jumping around and there are a few inf in you metric i think.

jerron · March 7, 2020, 3:58pm

But no matter what parameter it is using, train_loss should be the number the model already calculated during the training, right? Why was it changed later during the validate() function?

41	0.027972	0.020318	0.161073	00:01
42	0.027237	0.022235	0.154306	00:01
43	0.026908	31822.730469	inf	00:01
44	0.024176	47407947776.000000	inf	00:01
45	0.024114	49439031296.000000	0.157786	00:01
46	0.023868	10416965.000000	inf	00:01
47	0.024527	0.019987	0.148315	00:01
48	0.023762	4771567.000000	0.148559	00:01
49	0.021999	121208012800.000000	0.144663	00:01