Lesson 5 official topic

The output does, yes, but we want the input to it centered at zero, so the output of the sigmoid is centered at 0.5.

3 Likes

I think I finally understand this. So we are using sigmoid without the +0.5 offset because we want our output predictions to be centered around 0 because that will make it easier for our model to train. So even though in the first example (before you pass our output through the sigmoid function) it is centering our outputs around 0.5 with some results less than 0 and some of our results greater than 1, it is actually easier for the model to give smart weights if we don’t add the extra shifting and instead handle that with a sigmoid. If my intuition is correct here, that also means that the model can output a wider range of values to the sigmoid as well which means that it is easier for the model to put things that are definitely close to 0% very far into the negative and then the sigmoid will convert that to a value close to 0%

3 Likes

Yes exactly. It’s easier for a model to come up with a set of weights that just spits out some big number to mean “survived”, than to have to make it equal exactly 1.0.

4 Likes

Here’s another way to visually look at the intended effect centering input data.

I’ve made a tiny desmos demo that

  • plots the sigmoid function curve
  • plots the sigmoid results two different data groups
    • one centered at zero
    • other starting from zero
  • and offset them tiny bit up and down so they don’t overlap (this is just for visual clarity)

You can see that

  • the data group centered at zero gets distributed by sigmoid function in the 0…1 range(in y axis) being centred at 0.5
  • while the data group starting at zero only gets distributed 0.5 and upwards.

Of course, this is overly simplified because i’ve omitted the in between operations, but hopefully this demo provides some intuition around that form of initialization.

6 Likes

Is there a link available for the recorded session on Terminal from Thursday, May 26? Thanks.

1 Like

See the “YouTube Live” link in the first post of this thread

3 Likes

Stefan Josef! It has been a while since those FSDL study sessions, good times.

1 Like

learn.get_preds() and learn.predict() appear to give different results

This is a problem I’ve been trying to resolve for over a week. I’m working on a Text Classification project using Twitter. My results have just not been making sense.

Finally, I tracked it down to something I must not understand about the different uses of inferences: learn.get_preds and learn.predict. Here is sample code showing the apparently different results using the same trained model and dataset:

model = "ulmfit.pkl"
c   = congress/"RepWalorski.csv"
df  = pd.read_csv(c)
df.text = df.text.str.replace('[\\r]','') 
df = df.reset_index()

Here, I used learn.get_preds()

dl_test = TextDataLoaders.from_df(df, valid_pct = 0, text_col='text', shuffle=False)[0]
predicts, actuals = learn.get_preds(dl = dl_test)
predicts

Out[307]:
TensorText([[0.8213, 0.1787],
        [0.8429, 0.1571],
        [0.8596, 0.1404],
        [0.8174, 0.1826],
        [0.8157, 0.1843],
        [0.8967, 0.1033],
        [0.8972, 0.1028]])

Here I would expected to get the same results by iterating through the data using learn.predict(). But I didn’t.

learn = load_learner(model)
res = []
for t in df.text:
    party, cat, probs = learn.predict(t)
    res.append([probs[0].item(), probs[1].item()])
np.array(res)

Out[306]:
array([[3.17013037e-04, 9.99683022e-01],
       [3.80592272e-02, 9.61940706e-01],
       [5.17464876e-01, 4.82535094e-01],
       [9.64576146e-04, 9.99035478e-01],
       [1.05449706e-01, 8.94550323e-01],
       [4.67417343e-03, 9.95325804e-01],
       [3.93099725e-01, 6.06900275e-01]])

The second predictions are what I expected. I’m not sure what I did wrong using learn.get_preds()

Moreover, when I look at this week’s lecture ~1:21 @jeremy is doing with the Titanic dl , what I am think I am doing a text dl

I’ve spent too much time on this and am stuck. Help!

Many thanks

Maybe try passing an activation that does nothing, since learn.get_preds is giving you the probability outputs but you seems to want the raw outputs (I think they call this raw outputs z or logits)

predicts, actuals = learn.get_preds(dl = dl_test, act=lambda x:x)
1 Like

@wyquek thanks for offering to help. Unfortunately, passing act=lambda x:x has no effect on the results.

I am stumped as to why these predictions don’t just work out-of-the-box.

1 Like

Solved!!

I went through the fastai source code for learn.predict() and realized that this code was wrong

dl_test = TextDataLoaders.from_df(df, valid_pct = 0, text_col='text', shuffle=False)[0]
predicts, actuals = learn.get_preds(dl = dl_test)
predicts

This code does work. If there is a more idiomatic way to do this, please let me know.

model = "ulmfit.pkl"
learn = load_learner(model)
dl_test = learn.dls.test_dl(df.text.to_list())
predicts = learn.get_preds(dl = dl_test)
predicts

Out[93]:
(TensorText([[3.1701e-04, 9.9968e-01],
         [3.8059e-02, 9.6194e-01],
         [5.1746e-01, 4.8254e-01],
         [9.6458e-04, 9.9904e-01],
         [1.0545e-01, 8.9455e-01],
         [4.6742e-03, 9.9533e-01],
         [3.9310e-01, 6.0690e-01]]),
 None)

Can you explain more? What code was wrong, and how was it wrong?

1 Like

Sorry. To be clear, I meant to say my code was wrong. The source code is fine. In fact, it was the source code that helped me figure out why learn.predict() was working for me, but learn.get_preds was not. I looked through the fastai documentation, tutorials and the forums and just wasn’t able to figure it out. I should have looked at the source source code sooner. . .

This code is wrong. Unfortunately, it didn’t fail with an error, so I didn’t recognize that I had made a mistake as I was getting back answers. Only after really looking at the results did I conclude that something was wrong.

dl_test = TextDataLoaders.from_df(df, valid_pct = 0, text_col='text', shuffle=False)[0]
predicts, actuals = learn.get_preds(dl = dl_test)
predicts

Out[307]:
TensorText([[0.8213, 0.1787],
        [0.8429, 0.1571],
        [0.8596, 0.1404],
        [0.8174, 0.1826],
        [0.8157, 0.1843],
        [0.8967, 0.1033],
        [0.8972, 0.1028]])

This code works, though I don’t know if it is idiomatic:


dl_test = learn.dls.test_dl(df.text.to_list())
predicts = learn.get_preds(dl = dl_test)
predicts

Out[93]:
(TensorText([[3.1701e-04, 9.9968e-01],
         [3.8059e-02, 9.6194e-01],
         [5.1746e-01, 4.8254e-01],
         [9.6458e-04, 9.9904e-01],
         [1.0545e-01, 8.9455e-01],
         [4.6742e-03, 9.9533e-01],
         [3.9310e-01, 6.0690e-01]]),
 None)

From what I see it looks exactly like how to use the inference API! :smile:

3 Likes

Does anyone know why we are not zeroing out the gradients after each epoch in this course notebook (Linear model and neural net from scratch | Kaggle)? This is something that is normally done during each mini-batch/step, which in this notebook - 1 step/mini-batch is one training epoch as there is only a single batch, and seems like a bug although it does not seem to negatively affect the loss and accuracy too much when training for 30 epochs, but I have noticed some other issues. Maybe it does not matter in this case because of the relatively small dataset size and relatively small number of epochs. I’m trying to figure out if this was done on purpose or is a bug.

I ran a series of experiments on the linear model both with and without coeffs.grad.zero_(). When training for 30 epochs, I was able to achieve a slightly higher accuracy 0.831460 with coeffs.grad.zero_() at epoch 19 vs 0.825842 without it (matching the accuracy in the course notebook), but the losses were consistently worse and accuracy was worse at the original final epoch (30) as well.

I then started tracking the losses, accuracy, gradients and coefficients and created plots of each one of them to see what was happening as training progressed. I ran both with and without coeffs.grad.zero_() for 30, 100 and 1500 epochs.

without coeffs.grad.zero_()




with coeffs.grad.zero_()


When running without zero_grad_ I noticed that the coefficients were growing linearly with the number of epochs which is not ideal. When running with zero_grad_ the coefficients were still continuously growing, but the graph of the coefficients looked like a log graph instead of a linear graph and the overall values were much lower which was much better. I then added a simple l2 regularization (ish) of the coefficients to the loss and ran some more experiments. This helped to prevent continuous growth of the coefficients.



I then tried keeping the coefficient l2 reg and turning off zero_grad_ and that caused training to become unstable, even with a much lower learning rate. This happened consistently across a number of different runs with different hyperparameters, but I have only included the screenshot from the final run.

Finally, I recorded the coefficients across multiple runs with an increasing number of epochs, all without l2 reg. I observed that while the coefficients were continually growing, their relative values between one another was converging.

Code

trn_split, val_split = RandomSplitter(seed=SPLITTER_SEED)(t_indep)
len(trn_split), len(val_split)

trn_indeps, val_indeps, trn_deps, val_deps = t_indep[trn_split], t_indep[val_split], t_dep[trn_split], t_dep[val_split]

n_coeffs = t_indep.shape[1]
print('n_coeffs',n_coeffs)
def lin_init_coeffs():
    torch.manual_seed(TORCH_SEED)
    return (torch.rand(n_coeffs)-0.5).requires_grad_(True)
coeffs = lin_init_coeffs()

def lin_show_coeffs(coeffs): return dict(zip(indep_cols, coeffs.requires_grad_(False)))

def lin_update_coeffs(coeffs, lr=.01,zero_grad=True):
    with torch.no_grad():
        coeffs.sub_(coeffs.grad*lr)
        if zero_grad: coeffs.grad.zero_()

def lin_calc_preds(coeffs, indeps):
    return torch.sigmoid((indeps*coeffs).sum(axis=1))

def lin_calc_loss(preds, deps):
    return (preds - deps).abs().mean()

def lin_calc_acc(coeffs):
    with torch.no_grad():
        preds = lin_calc_preds(coeffs, val_indeps)
        ret = ((preds > 0.5) == val_deps.bool()).float().mean()
    return ret

def lin_calc_epoch(coeffs, indeps, deps, lr=2.,zero_grad=True,l2_reg=True):
    preds = lin_calc_preds(coeffs, indeps)
    loss = lin_calc_loss(preds, deps)
    if l2_reg: loss += (coeffs.square().sum()) * .001 #prevent unbounded growth of coeffs
#     print(f'loss {loss:.3f};',end='')
    loss.backward()
    grads = coeffs.grad.data.clone().detach().abs().mean()
    lin_update_coeffs(coeffs,lr,zero_grad=zero_grad)
    with torch.no_grad():
        val_preds = lin_calc_preds(coeffs, val_indeps)
        val_loss = lin_calc_loss(val_preds, val_deps)
    acc = lin_calc_acc(coeffs)
#     print(f'acc: {acc:.4f};',end='')
    return loss.detach(), val_loss.detach(), acc, grads, coeffs.clone().detach().abs().mean()

def lin_train_model(epochs=30, lr=2.,zero_grad=True,l2_reg=True):
    trn_loss,val_loss,acc, grads, coeffs_track = [],[],[],[],[]
    coeffs = lin_init_coeffs()
    for e in range(epochs):
        tl,vl,a,gr,c = lin_calc_epoch(coeffs,trn_indeps,trn_deps,lr,zero_grad=zero_grad,l2_reg=l2_reg)
        trn_loss.append(tl);val_loss.append(vl);acc.append(a);grads.append(gr);coeffs_track.append(c)
    print('\n',lin_show_coeffs(coeffs).__str__(),'\n')
    return trn_loss, val_loss, acc, grads, coeffs_track

def make_plot(tl,vl,acc,grads,coeffs_track,zero_grad,l2_reg):
    best_acc_epoch = np.argmax(acc)
    print(f'best acc epoch: {best_acc_epoch} | trn_loss: {tl[best_acc_epoch]} | val_loss: {vl[best_acc_epoch]} | acc: {acc[best_acc_epoch]}')
    print(f'last epoch: {len(acc)} | trn_loss: {tl[-1]} | val_loss: {vl[-1]} | acc: {acc[-1]}');xs = list(range(len(tl)))
    fig, (ax1,ax3) = plt.subplots(ncols=2,figsize=(16,6));ax2 = ax1.twinx();ax4 = ax3.twinx(); ax1.set_ylabel('Loss'); 
    ax2.set_ylabel('Accuracy');ax1.plot(xs,tl,label='tloss'); ax1.plot(xs,vl,label='vloss');ax4.plot(xs,coeffs_track,label='coeffs_abs_mean')
    ax2.plot(xs,acc,label='acc',color='g');ax3.plot(xs, grads, label='grads_abs_mean',color='red');fig.legend();
    ax3.set_ylabel('grads');ax4.set_ylabel('coeffs')
    title = f"{'With' if zero_grad else 'Without'} coeffs.grad.zero_() | "
    title += f"{'With' if l2_reg else 'Without'} coeffs l2 regularization (0.001)"
    _ = plt.title(title)

#EXAMPLE RUN:
make_plot(*lin_train_model(epochs=30, lr=2., zero_grad=(zero_grad:=False),l2_reg=(l2_reg:=False)),zero_grad,l2_reg)

Overall this was an interesting set of experiments and utilizing a simple linear model made it easier to wrap my head around everything that was going on. The graphs really helped visualizing what was going on during training.

Sorry for the super-long post!

EDITS:

  1. Added a reference to the course notebook I am referring to.
  2. Clarified question - Is omitting zeroing grads after epochs done on purpose or is it a bug.
  3. Clarified zero_grad is typically done at the end of one mini-batch/step, not epoch, but in this case since there is only one batch, a batch and epoch are the same thing.
  4. Jeremy confirmed gradients should be zeroed after each epoch in the referenced course notebook and it has been updated to add in that functionality so this question will no longer make sense if you view the notebook I was referring to. Here is a link to the version before it was added so you see what I was referring to originally: Linear model and neural net from scratch | Kaggle . It’s pretty cool that Kaggle keeps notebook revisions. Thank you Jeremy for the confirmation!
11 Likes

To get the weights zero centered i.e zero mean.

As I understand the gradients get accumulated. So they need to be zeroed after the weights are updated. Not sure why it does not matter here.

1 Like

It is normally done where during each training? and, it “seems like a bug” in what?

I must confess almost all of this is totally over my head. I guess I need to go back to that notebook and try to understand it better.

I have code in a python file which I import into Juypter. The initial import works as expected. The problem is that changes to the python file don’t appear to be reloaded after re-executing the cell with the import. In the past, the following magic Jupyter commands worked. Thanks in advance if anyone knows how to fix this.

%load_ext autoreload
%autoreload 2

Regarding tabular data. What is your experience with fastai? Is it on par with things like XGBoost or LightGBM? Or is it a similar case as with HF Transformers, where a third-party library is a better choice? Or would you advise starting with the fastai and migrate to something more specific later on? As far as I remember, boosted trees showed better results on tables compared to NNs. (Though I also remember things like TabNet.) Last time when I approached this question, gradient boosted ensembles showed better performance.