The first time you run it, it will always take about a minute, since it has to precompute the activations (we’ll discuss this in class). Then when you run again, it’ll be fast.
Ohh that explains it.
@jeremy - Can you explain how did you figure this out?
Not sure I can!.. The behavior and stack trace clearly showed a race condition or similar, and it had to be in some other library since if it was in Pytorch lots of people would have seen the problem. Since opencv is pretty complex, I figured it might be there, and googled a bit.
Thanks Jeremy! Worked for me this time around. Nice work!
@metachi I just pushed a different (and hackier!) approach which is much faster - can you check again and confirm it still works for you?
@jeremy - I can confirm this is working for me (with latest sleep changes).
The hackier (sleep) fix does not appear to be working for me. It goes fine with num_workers=0, but with 4 it hangs when calling fit again. Whatever you had before was working for me.
Thanks @jeremy It works great for me.
Aah sorry checking messages now. Good that you could replicate it in AWS.
Actually it seems the code in fastai lib is going serially, it’s the DataLoader class which is doing all the multi processing thing behind the scene. I doubt if putting a lock in fastai lib would help. i.e. fastai seems to be doing the right thing (I haven’t seen the entire code though).
I also see you are suspecting opencv.
I will check at my end by increasing shm and with your latest changes.
–
Hi @jeremy,
After pulling the latest changes from the fast.ai repo, the problem disappears for me. Thanks!
edit: I also wanted to say thanks to @apaszke - your offer to help out was very generous.
I have also not been able to reproduce the issue after pulling latest. Thanks @jeremy!
–
Yes, but it’s calling fastai.dataset, which is where the opencv issue is.
What’s your OS, GPU, python/lib versions, shared mem amount, etc? Seems that you’re the only one still reporting this problem ATM - still need to fix it, but need more details…
@jeremy
The issue is solved for me too. Thanks!
Works for me!
Now if I could get both GPUs running it…
OK this might be crazy, but I just wrote my own Dataloader class to use instead of the Pytorch one, and it’s running much faster and without any problems on my university server or my P2 instance. Try git pull
and let me know how it goes! Fingers crossed…
Okay, I pulled yesterday’s repo changes and am able to get it all to run, which is good.
But my accuracy actually went down slightly after retraining with the differential learning rate (i.e., running:learn.fit(lr, 3, cycle_len=1, cycle_mult=2), then rerunning the TTA accuracy).
The notebook indicated the accuracy was supposed to go up – is it unusual to have it go down? Is this an overfitting issue?
Confused
Maureen
It went down compared to yesterday? Compared to some other cell? Can you show us the details of what you ran exactly, and what you saw?
Here are the details:
so, first time I ran TTA accuracy was after running this:
tfms = tfms_from_model(resnet34, sz, aug_tfms=transforms_side_on, max_zoom=1.1)
data = ImageClassifierData.from_paths(PATH, tfms=tfms)
learn = ConvLearner.pretrained(resnet34, data, precompute=True)
learn.fit(1e-2, 1)
learn.precompute=False
learn.fit(1e-2, 3, cycle_len=1)
accuracy(*learn.TTA())
my accuracy was: 0.99450000000000005
I saved the model as “224_lastlayer”, reloaded it, then ran
learn.unfreeze()
lr=np.array([1e-4,1e-3,1e-2])
learn.fit(lr, 3, cycle_len=1, cycle_mult=2)
saved model as ‘224_all’ and reloaded,
ran log_preds,y = learn.TTA()
running accuracy(log_preds,y) gave me:
ACCURACY: 0.99350000000000005
So, my accuracy goes down from 0.9945 to 0.9935!
I noticed that learn.fit(lr, 3, cycle_len=1, cycle_mult=2) resulted in 7 runs? is that an issue?
output was:
[ 0. 0.04682 0.02693 0.99121]
[ 1. 0.03924 0.02231 0.99268]
[ 2. 0.03393 0.02198 0.99219]
[ 3. 0.03127 0.02144 0.99219]
[ 4. 0.02265 0.02303 0.9917 ]
[ 5. 0.0208 0.02034 0.99219]
[ 6. 0.01848 0.01948 0.99121]
The accuracy started stalling after the 3rd round, which is why I wondered if overfitting was a problem.
Should I be seeing 7 runs for the learn.fit(lr, 3, cycle_len=1, cycle_mult=2)? I would think 3, like previous time this was called with a single lr and without cycle_mult?