Very slow loading of the Convnet pretrained model on lesson 1

Problem still persists for me :slightly_frowning_face:
But I think I might have a different problem because setting num_workers=0 does not solve the issue.
To clarify, only loading the model is slow here (with or without precompute), .fit is working fine.

I actually had this exact issue with it taking ~1 minute. The num_workers=0 solution worked.
However I removed the num_workers option after that and it’s working great (runs in about 7s). Tested a kernel restart and it works now without any issues.

Edit: Mine’s a pretty powerful ubuntu rig on google cloud

Edit2: This is without pulling in Jeremy’s latest fix

The first time you run it, it will always take about a minute, since it has to precompute the activations (we’ll discuss this in class). Then when you run again, it’ll be fast.

Ohh that explains it.

@jeremy - Can you explain how did you figure this out?

Not sure I can!.. The behavior and stack trace clearly showed a race condition or similar, and it had to be in some other library since if it was in Pytorch lots of people would have seen the problem. Since opencv is pretty complex, I figured it might be there, and googled a bit.

Thanks Jeremy! Worked for me this time around. Nice work!

@metachi I just pushed a different (and hackier!) approach which is much faster - can you check again and confirm it still works for you?

@jeremy - I can confirm this is working for me (with latest sleep changes).

The hackier (sleep) fix does not appear to be working for me. It goes fine with num_workers=0, but with 4 it hangs when calling fit again. Whatever you had before was working for me.

Thanks @jeremy It works great for me.

Aah sorry checking messages now. Good that you could replicate it in AWS.

Actually it seems the code in fastai lib is going serially, it’s the DataLoader class which is doing all the multi processing thing behind the scene. I doubt if putting a lock in fastai lib would help. i.e. fastai seems to be doing the right thing (I haven’t seen the entire code though).

I also see you are suspecting opencv.

I will check at my end by increasing shm and with your latest changes.

1 Like

Hi @jeremy,

After pulling the latest changes from the fast.ai repo, the problem disappears for me. Thanks!

edit: I also wanted to say thanks to @apaszke - your offer to help out was very generous.

I have also not been able to reproduce the issue after pulling latest. Thanks @jeremy!

Yes, but it’s calling fastai.dataset, which is where the opencv issue is.

What’s your OS, GPU, python/lib versions, shared mem amount, etc? Seems that you’re the only one still reporting this problem ATM - still need to fix it, but need more details…

@jeremy
The issue is solved for me too. Thanks!

Works for me!

Now if I could get both GPUs running it…

OK this might be crazy, but I just wrote my own Dataloader class to use instead of the Pytorch one, and it’s running much faster and without any problems on my university server or my P2 instance. Try git pull and let me know how it goes! Fingers crossed…

Okay, I pulled yesterday’s repo changes and am able to get it all to run, which is good.

But my accuracy actually went down slightly after retraining with the differential learning rate (i.e., running:learn.fit(lr, 3, cycle_len=1, cycle_mult=2), then rerunning the TTA accuracy).

The notebook indicated the accuracy was supposed to go up – is it unusual to have it go down? Is this an overfitting issue?

Confused :confused:
Maureen