Stability of LSTM training in lesson 4

When I run all code in lesson 4 notebook, in the last cell that is doing training, the training process goes off track after a few cycles.

The line:

m3.fit(lrs, 7, metrics=[accuracy], cycle_len=2, cycle_save_name='imdb2')

Output:

epoch trn_loss val_loss accuracy
0 0.465021 0.354633 0.915533
1 0.413693 0.325593 0.923896
2 0.421768 0.329558 0.922895
3 0.405373 0.31995 0.924976
4 0.40406 0.332836 0.919894
5 0.357218 0.329943 0.924056
6 0.415895 0.317114 0.924176
7 0.364562 0.315282 0.925656
8 0.377893 0.327583 0.923616
9 0.366968 0.327651 0.925096
10 0.71977 0.759003 0.737836
11 0.805769 0.806098 0.701224
12 1.049581 1.036387 0.412572
// I stopped execution here

Epochs 0 to 9 are fine, the loss goes down, the accuracy goes up. Both training and validation loss suddenly increase twice on epoch 10 and never recover.

It’s not the first time I see such behavior when training recurrent neural networks. Why does it happen? How to deal with it? Is it possible to tune some hyperparameters to make training more stable?

It means your LR is too high at that part of the training. You could try more gradient clipping, or a lower learning rate.

Thank you, Jeremy. Learning rate was the reason.
I did a few experiments with different learning rates when training the classification (“Sentiment” section of the lesson 4 notebook). I’ve run a training for each LR a few times and checked the mean value of an accuracy for the best epoch in each experiment and whether accuracy drops dramatically at some point of training. Most of the experiments were running for 20 epochs. Here are results:

FastAI%20Lesson%204%20Learning%20Rates

In most of the experiments I was changing only the 3rd part of LRs, which should regulate training of the last layers.
Looks like 1e-2 is too large. The best for this model seems to be around values 2e-3 and 3e-3. It helps to boost validation accuracy a little bit.

What else can I do to understand the impact of LRs to LSTM training better? Maybe something like visualizing how values of gradients change over time?