Mysterious super convergence

I was playing around with different activation function and I saw something which surprises me.

If you put a simple polynomial like xx400 in a two-layer fully connected network after only one epoch you can get 94 percent accuracy!

Have you seen such a thing, do you have any explanation for this? I mean this doesn’t make sense for me. With ReLU after one epoch, you get 71!

I would appreciate any insight. @jeremy

The notebook is in the attached.Supper.ipynb - Colaboratory.pdf (64.3 KB)

import torch
import torch.nn as nn
import torchvision.transforms as transforms
import torchvision.datasets as dsets
import random as rand
import math

def act(x):
    return x*x*400

if torch.cuda.is_available():
    avDev = torch.device("cuda")
    avDev = torch.device("cpu")

train_dataset = dsets.MNIST(root='./data', 
test_dataset = dsets.MNIST(root='./data', 

batch_size = 100
train_loader =, 
test_loader =, 

class LogisticRegressionModel(nn.Module):
    def __init__(self, input_dim, output_dim):
        super(LogisticRegressionModel, self).__init__()
        self.linear1 = nn.Linear(input_dim, 300)
        self.linear2 = nn.Linear(300, output_dim)
    def forward(self, x):
        out = act(self.linear1(x))
        out = self.linear2(out)
        return out
input_dim = 28*28
output_dim = 10
model = LogisticRegressionModel(input_dim, output_dim)
criterion = nn.CrossEntropyLoss().to(avDev)


learning_rate = 0.001
optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate)

iter = 0
for epoch in range(epochs):
for i, (images, labels) in enumerate(train_loader):
    images = images.view(-1, 28*28).to(avDev)
    labels =
    outputs = model(images)
    loss = criterion(outputs, labels)#
if iter % 1 == 0:
    correct = 0
    total = 0
    for images, labels in test_loader:
        images = images.view(-1, 28*28).to(avDev)
        outputs = model(images)
        _, predicted = torch.max(, 1)
        total += labels.size(0)
        correct += (predicted.cpu() == labels.cpu()).sum().float()
    accuracy = 100. * correct / total
    # Print Loss
    print('Epochs: {}. Loss: {}. Accuracy: {}'.format(epoch, loss.item(), accuracy))
1 Like

I do like a good mystery! No one has answered so I’ll take a risk with a speculation, which may of course be wrong.

I suspect that the ideal learning rate differs when you change the activation function. It is possibly too small or too large in the ReLU case.

You could proceed by moving your data, model, and training into a fastai Leaner instead of the PyTorch training loop, and applying lr_find() to the two cases. fastai also has the advantage that you don’t need to hand-code the training loop and metrics (removing a source of errors), you can see whether the classification is making sense (verifying the model), and easily use a validation set (see whether the model is memorizing vs. generalizing).

HTH you to unpack the anomaly. Please let us know what you find out!

P.S. The name LogisticRegressionModel is misleading because this seems to be a standard classification model. Also, uploading as .pdf rather than as a Jupyter notebook is an obstacle to anyone who may want to investigate your question more deeply.