The output does, yes, but we want the input to it centered at zero, so the output of the sigmoid is centered at 0.5.

I think I finally understand this. So we are using sigmoid without the +0.5 offset because we want our output predictions to be centered around 0 because that will make it easier for our model to train. So even though in the first example (before you pass our output through the sigmoid function) it is centering our outputs around 0.5 with some results less than 0 and some of our results greater than 1, it is actually easier for the model to give smart weights if we don’t add the extra shifting and instead handle that with a sigmoid. If my intuition is correct here, that also means that the model can output a wider range of values to the sigmoid as well which means that it is easier for the model to put things that are definitely close to 0% very far into the negative and then the sigmoid will convert that to a value close to 0%

Yes exactly. It’s easier for a model to come up with a set of weights that just spits out some big number to mean “survived”, than to have to make it equal exactly `1.0`

.

Here’s another way to visually look at the intended effect centering input data.

I’ve made a tiny desmos demo that

- plots the sigmoid function curve
- plots the sigmoid results two different data groups
- one centered at zero
- other starting from zero

- and offset them tiny bit up and down so they don’t overlap (this is just for visual clarity)

You can see that

- the data group centered at zero gets distributed by sigmoid function in the 0…1 range(in y axis) being centred at 0.5
- while the data group starting at zero only gets distributed 0.5 and upwards.

Of course, this is overly simplified because i’ve omitted the in between operations, but hopefully this demo provides some intuition around that form of initialization.

Is there a link available for the recorded session on Terminal from Thursday, May 26? Thanks.

Stefan Josef! It has been a while since those FSDL study sessions, good times.

### learn.get_preds() and learn.predict() appear to give different results

This is a problem I’ve been trying to resolve for over a week. I’m working on a Text Classification project using Twitter. My results have just not been making sense.

Finally, I tracked it down to something I must not understand about the different uses of inferences: ` learn.get_preds`

and `learn.predict`

. Here is sample code showing the apparently different results using the same trained model and dataset:

```
model = "ulmfit.pkl"
c = congress/"RepWalorski.csv"
df = pd.read_csv(c)
df.text = df.text.str.replace('[\\r]','')
df = df.reset_index()
```

Here, I used `learn.get_preds()`

```
dl_test = TextDataLoaders.from_df(df, valid_pct = 0, text_col='text', shuffle=False)[0]
predicts, actuals = learn.get_preds(dl = dl_test)
predicts
Out[307]:
TensorText([[0.8213, 0.1787],
[0.8429, 0.1571],
[0.8596, 0.1404],
[0.8174, 0.1826],
[0.8157, 0.1843],
[0.8967, 0.1033],
[0.8972, 0.1028]])
```

Here I would expected to get the same results by iterating through the data using `learn.predict()`

. But I didn’t.

```
learn = load_learner(model)
res = []
for t in df.text:
party, cat, probs = learn.predict(t)
res.append([probs[0].item(), probs[1].item()])
np.array(res)
Out[306]:
array([[3.17013037e-04, 9.99683022e-01],
[3.80592272e-02, 9.61940706e-01],
[5.17464876e-01, 4.82535094e-01],
[9.64576146e-04, 9.99035478e-01],
[1.05449706e-01, 8.94550323e-01],
[4.67417343e-03, 9.95325804e-01],
[3.93099725e-01, 6.06900275e-01]])
```

The second predictions are what I expected. I’m not sure what I did wrong using `learn.get_preds()`

Moreover, when I look at this week’s lecture ~1:21 @jeremy is doing with the Titanic `dl`

, what I am think I am doing a text `dl`

I’ve spent too much time on this and am stuck. Help!

Many thanks

Maybe try passing an activation that does nothing, since `learn.get_preds`

is giving you the probability outputs but you seems to want the raw outputs (I think they call this raw outputs z or logits)

```
predicts, actuals = learn.get_preds(dl = dl_test, act=lambda x:x)
```

@wyquek thanks for offering to help. Unfortunately, passing `act=lambda x:x`

has no effect on the results.

I am stumped as to why these predictions don’t just work out-of-the-box.

### Solved!!

I went through the fastai source code for `learn.predict()`

and realized that this code was **wrong**

```
dl_test = TextDataLoaders.from_df(df, valid_pct = 0, text_col='text', shuffle=False)[0]
predicts, actuals = learn.get_preds(dl = dl_test)
predicts
```

This code does work. If there is a more idiomatic way to do this, please let me know.

```
model = "ulmfit.pkl"
learn = load_learner(model)
dl_test = learn.dls.test_dl(df.text.to_list())
predicts = learn.get_preds(dl = dl_test)
predicts
Out[93]:
(TensorText([[3.1701e-04, 9.9968e-01],
[3.8059e-02, 9.6194e-01],
[5.1746e-01, 4.8254e-01],
[9.6458e-04, 9.9904e-01],
[1.0545e-01, 8.9455e-01],
[4.6742e-03, 9.9533e-01],
[3.9310e-01, 6.0690e-01]]),
None)
```

Can you explain more? What code was wrong, and how was it wrong?

Sorry. To be clear, I meant to say **my** code was wrong. The source code is fine. In fact, it was the source code that helped me figure out why `learn.predict()`

was working for me, but `learn.get_preds`

was not. I looked through the fastai documentation, tutorials and the forums and just wasn’t able to figure it out. I should have looked at the source source code sooner. . .

This code is **wrong**. Unfortunately, it didn’t fail with an error, so I didn’t recognize that I had made a mistake as I was getting back answers. Only after really looking at the results did I conclude that something was wrong.

```
dl_test = TextDataLoaders.from_df(df, valid_pct = 0, text_col='text', shuffle=False)[0]
predicts, actuals = learn.get_preds(dl = dl_test)
predicts
Out[307]:
TensorText([[0.8213, 0.1787],
[0.8429, 0.1571],
[0.8596, 0.1404],
[0.8174, 0.1826],
[0.8157, 0.1843],
[0.8967, 0.1033],
[0.8972, 0.1028]])
```

This code **works**, though I don’t know if it is idiomatic:

```
dl_test = learn.dls.test_dl(df.text.to_list())
predicts = learn.get_preds(dl = dl_test)
predicts
Out[93]:
(TensorText([[3.1701e-04, 9.9968e-01],
[3.8059e-02, 9.6194e-01],
[5.1746e-01, 4.8254e-01],
[9.6458e-04, 9.9904e-01],
[1.0545e-01, 8.9455e-01],
[4.6742e-03, 9.9533e-01],
[3.9310e-01, 6.0690e-01]]),
None)
```

From what I see it looks exactly like how to use the inference API!

Does anyone know why we are not zeroing out the gradients after each `epoch`

in this course notebook (Linear model and neural net from scratch | Kaggle)? This is something that is normally done during each mini-batch/step, which in this notebook - 1 step/mini-batch is one training epoch as there is only a single batch, and seems like a bug although it does not seem to negatively affect the loss and accuracy too much when training for 30 epochs, but I have noticed some other issues. Maybe it does not matter in this case because of the relatively small dataset size and relatively small number of epochs. I’m trying to figure out if this was done on purpose or is a bug.

I ran a series of experiments on the linear model both with and without `coeffs.grad.zero_()`

. When training for 30 epochs, I was able to achieve a slightly higher accuracy `0.831460`

with `coeffs.grad.zero_()`

at epoch `19`

vs `0.825842`

without it (matching the accuracy in the course notebook), but the losses were consistently worse and accuracy was worse at the original final epoch (30) as well.

I then started tracking the losses, accuracy, gradients and coefficients and created plots of each one of them to see what was happening as training progressed. I ran both with and without `coeffs.grad.zero_()`

for 30, 100 and 1500 epochs.

**without coeffs.grad.zero_()**

**with**

`coeffs.grad.zero_()`

When running without `zero_grad_`

I noticed that the coefficients were growing linearly with the number of epochs which is not ideal. When running with `zero_grad_`

the coefficients were still continuously growing, but the graph of the coefficients looked like a log graph instead of a linear graph and the overall values were much lower which was much better. I then added a simple l2 regularization (ish) of the coefficients to the loss and ran some more experiments. This helped to prevent continuous growth of the coefficients.

I then tried keeping the coefficient l2 reg and turning off `zero_grad_`

and that caused training to become unstable, even with a much lower learning rate. This happened consistently across a number of different runs with different hyperparameters, but I have only included the screenshot from the final run.

Finally, I recorded the coefficients across multiple runs with an increasing number of epochs, all without l2 reg. I observed that while the coefficients were continually growing, their relative values between one another was converging.

**Code**

```
trn_split, val_split = RandomSplitter(seed=SPLITTER_SEED)(t_indep)
len(trn_split), len(val_split)
trn_indeps, val_indeps, trn_deps, val_deps = t_indep[trn_split], t_indep[val_split], t_dep[trn_split], t_dep[val_split]
n_coeffs = t_indep.shape[1]
print('n_coeffs',n_coeffs)
def lin_init_coeffs():
torch.manual_seed(TORCH_SEED)
return (torch.rand(n_coeffs)-0.5).requires_grad_(True)
coeffs = lin_init_coeffs()
def lin_show_coeffs(coeffs): return dict(zip(indep_cols, coeffs.requires_grad_(False)))
def lin_update_coeffs(coeffs, lr=.01,zero_grad=True):
with torch.no_grad():
coeffs.sub_(coeffs.grad*lr)
if zero_grad: coeffs.grad.zero_()
def lin_calc_preds(coeffs, indeps):
return torch.sigmoid((indeps*coeffs).sum(axis=1))
def lin_calc_loss(preds, deps):
return (preds - deps).abs().mean()
def lin_calc_acc(coeffs):
with torch.no_grad():
preds = lin_calc_preds(coeffs, val_indeps)
ret = ((preds > 0.5) == val_deps.bool()).float().mean()
return ret
def lin_calc_epoch(coeffs, indeps, deps, lr=2.,zero_grad=True,l2_reg=True):
preds = lin_calc_preds(coeffs, indeps)
loss = lin_calc_loss(preds, deps)
if l2_reg: loss += (coeffs.square().sum()) * .001 #prevent unbounded growth of coeffs
# print(f'loss {loss:.3f};',end='')
loss.backward()
grads = coeffs.grad.data.clone().detach().abs().mean()
lin_update_coeffs(coeffs,lr,zero_grad=zero_grad)
with torch.no_grad():
val_preds = lin_calc_preds(coeffs, val_indeps)
val_loss = lin_calc_loss(val_preds, val_deps)
acc = lin_calc_acc(coeffs)
# print(f'acc: {acc:.4f};',end='')
return loss.detach(), val_loss.detach(), acc, grads, coeffs.clone().detach().abs().mean()
def lin_train_model(epochs=30, lr=2.,zero_grad=True,l2_reg=True):
trn_loss,val_loss,acc, grads, coeffs_track = [],[],[],[],[]
coeffs = lin_init_coeffs()
for e in range(epochs):
tl,vl,a,gr,c = lin_calc_epoch(coeffs,trn_indeps,trn_deps,lr,zero_grad=zero_grad,l2_reg=l2_reg)
trn_loss.append(tl);val_loss.append(vl);acc.append(a);grads.append(gr);coeffs_track.append(c)
print('\n',lin_show_coeffs(coeffs).__str__(),'\n')
return trn_loss, val_loss, acc, grads, coeffs_track
def make_plot(tl,vl,acc,grads,coeffs_track,zero_grad,l2_reg):
best_acc_epoch = np.argmax(acc)
print(f'best acc epoch: {best_acc_epoch} | trn_loss: {tl[best_acc_epoch]} | val_loss: {vl[best_acc_epoch]} | acc: {acc[best_acc_epoch]}')
print(f'last epoch: {len(acc)} | trn_loss: {tl[-1]} | val_loss: {vl[-1]} | acc: {acc[-1]}');xs = list(range(len(tl)))
fig, (ax1,ax3) = plt.subplots(ncols=2,figsize=(16,6));ax2 = ax1.twinx();ax4 = ax3.twinx(); ax1.set_ylabel('Loss');
ax2.set_ylabel('Accuracy');ax1.plot(xs,tl,label='tloss'); ax1.plot(xs,vl,label='vloss');ax4.plot(xs,coeffs_track,label='coeffs_abs_mean')
ax2.plot(xs,acc,label='acc',color='g');ax3.plot(xs, grads, label='grads_abs_mean',color='red');fig.legend();
ax3.set_ylabel('grads');ax4.set_ylabel('coeffs')
title = f"{'With' if zero_grad else 'Without'} coeffs.grad.zero_() | "
title += f"{'With' if l2_reg else 'Without'} coeffs l2 regularization (0.001)"
_ = plt.title(title)
#EXAMPLE RUN:
make_plot(*lin_train_model(epochs=30, lr=2., zero_grad=(zero_grad:=False),l2_reg=(l2_reg:=False)),zero_grad,l2_reg)
```

Overall this was an interesting set of experiments and utilizing a simple linear model made it easier to wrap my head around everything that was going on. The graphs really helped visualizing what was going on during training.

Sorry for the super-long post!

EDITS:

- Added a reference to the course notebook I am referring to.
- Clarified question - Is omitting zeroing grads after epochs done on purpose or is it a bug.
- Clarified zero_grad is typically done at the end of one mini-batch/step, not epoch, but in this case since there is only one batch, a batch and epoch are the same thing.
- Jeremy confirmed gradients should be zeroed after each epoch in the referenced course notebook and it has been updated to add in that functionality so this question will no longer make sense if you view the notebook I was referring to. Here is a link to the version before it was added so you see what I was referring to originally: Linear model and neural net from scratch | Kaggle . It’s pretty cool that Kaggle keeps notebook revisions. Thank you Jeremy for the confirmation!

To get the weights zero centered i.e zero mean.

As I understand the gradients get accumulated. So they need to be zeroed after the weights are updated. Not sure why it does not matter here.

It is normally done where during each training? and, it “seems like a bug” in what?

I must confess almost all of this is totally over my head. I guess I need to go back to that notebook and try to understand it better.

I have code in a python file which I `import`

into Juypter. The initial `import`

works as expected. The problem is that changes to the python file don’t appear to be reloaded after re-executing the cell with the `import`

. In the past, the following magic Jupyter commands worked. Thanks in advance if anyone knows how to fix this.

```
%load_ext autoreload
%autoreload 2
```

Regarding tabular data. What is your experience with `fastai`

? Is it on par with things like XGBoost or LightGBM? Or is it a similar case as with HF Transformers, where a third-party library is a better choice? Or would you advise starting with the `fastai`

and migrate to something more specific later on? As far as I remember, boosted trees showed better results on tables compared to NNs. (Though I also remember things like TabNet.) Last time when I approached this question, gradient boosted ensembles showed better performance.