CUDA error: device-side assert triggered pytorch and cnn_learner created model has diff number of parameters

Passos · May 10, 2021, 7:51pm

Hello all. When running a model build using create_cnn_model I get CUDA error, however when I run the same code using cnn_learner, the code runs fine.

If I do:

model = create_cnn_model(resnet50, get_c(dls))
learn = Learner(dls, model)
learn.fit_one_cycle(1)

I get CUDA error. However, if I do:

learn = cnn_learner(dls, resnet50)
learn.fit_one_cycle(1)

the model trains just fine.

What is even more strange, is when inspecting the models, they all have exact identical modules but when running learn.summary() the last layers have different number of parameters

the last module of the model created by using create_cnn_model is: Linear 64 x 7700 (3942400 Params), which is exactly right as the previous layer outputs 512 so 512*7700 = 3942400. However the last module created by cnn_learner is: Linear 64 x 7700 (3978240 Params) which is 35840 more params what it should be but it works like that. Can any1 help me pinpointing this issue?

arampacha · May 10, 2021, 10:25pm

Don’t know what the problem is, your code looks like it should work at first glance. But here is a general advice for debugging this kind of issues. CUDA errors are cryptic and not very informative, it’s better to put your dls to cpu and run the code on it, most probably it will result in interpretable error which might lead you to resolving the issue

VishnuSubramanian · May 11, 2021, 2:11am

Can you share a notebook that can reproduce the error, it will help in debuging.