Lesson 1. Kernel dies during training resnet34


#1

I am getting error that kernel is died in the line : learn = create_cnn(data, models.resnet34, metrics=error_rate) in the Lesson 1 notebook.

I have tried to change batch size and number of images. Tried to upgrade to the bigger instance (from p2.xlarge to g3.4xlarge in AWS EC2). Found a workaround regarding memory leak in dataloader, but seems it is for older version of fastai, while I am using fastai v1.0.



26

So nothing above helped :frowning: :frowning: Has anybody experience the same?
Shahnoza


(R vd Horst) #2

Found that this is known problem, see https://forums.fast.ai/t/platform-aws-ec2-dlami/27340/30
Regards, Remco

Previous:
Seem to get into the same problem. Instantiated an p2.xlarge aws, instance installed fastai v1.0 and tried it first with another notebook and afterwards used the lesson1-pet notebook as a kind of reference => still the same problem kernel died when using create_cnn. Did you find what the underlying issue was or a fix / workaround? Thanks, Remco


#3

Hello, I migrated to Amazon SageMaker services and it worked. I don’t have access to the link you provided, how did you solved the problem at the end?=)

Regards,
Shahnoza


(R vd Horst) #4

Sorry for the link access issue (I did not think about that). The problem is not solved as far as I know. The fall back I ended up using as described here: https://github.com/krishnakalyan3/course-v3/blob/aed64af19b34bcf0ddf1263bfd7d0e1744aac884/docs/start_aws.md
It uses Ubuntu 18 instead of ubuntu deep learning ami as a base. So some extra install effort needed.
I also needed some extra disk space (now set on 30Gb instead of default which it was 8Gb).
Regards, Remco