Lr_find & fit getting stuck

travis · July 8, 2018, 3:03am

What might cause lr_find to hang up without completing?
lr_find

It gets stuck in this state and won’t go any further. Fit does the same thing. I’ve run the model successfully numerous times on the Kaggle Freesound competition data. Now, I’m trying to develop pre-trained weights using Audioset data, but can’t figure out what’s going wrong.

There’s no stack trace, so I can’t run debugger. I’ve tried smaller batch-sizes thinking it might be a memory issue. It didn’t help. Perhaps there’s a problem with the data, but if so, why would I not get an error message?

Most of the Audioset files are in mono format, but some are 6-channel. That caused me some problems initially, but I was able to load all the files in (all with same shape array) and compute mean and standard deviation. I had to develop my own versions of dataset.py and transforms.py because this is audio data, but as I said, I’ve run them successfully many times on the Freesound data.

Any ideas about what might be going on?

Gabriel_Syme · July 8, 2018, 3:39am

When that happens in the lr finder I think it means it stops when the loss is becoming too high. In my case even if it stopped before 100% I could still plot it and fit later.

travis · July 8, 2018, 10:01am

Notice the * next the notebook cell. Lr_finder never stops. I can’t run another cell. I tried running fit without doing lr_find, and it does the same thing. It will train to a point and then just go into infinite-loop type behavior, never completing a full cycle.

bny6613 · July 8, 2018, 11:23am

Maybe try to interrupt the kernel and look at the stack trace, so you can at least see where it gets stuck.

Gabriel_Syme · July 9, 2018, 2:40am

Sorry didn’t notice that. Perhaps some of the debugging points Jeremy shares in one of the courses can come in handy here?

urmas.pitsi · July 9, 2018, 7:38am

You mention that you’ve made your own implementation of dataset. If standard fast.ai learning rate finder works, then I would bet the problem stems from custom dataset implementation.
Dataset -> dataloader -> model.py interaction is very interdependent and sometimes difficult to follow.

iNLyze · July 9, 2018, 8:49pm

How many observations do you have in training / validation sets? If they are very few (compared to batch_size) the observed behavior might occur. Did you try a batch_size of 1?

travis · July 10, 2018, 11:37am

The total dataset is very large (36,000 files). I’ve been trying to run on a sample of about 3,600.

travis · July 10, 2018, 11:53am

Perhaps, but as I mentioned, I’ve run the custom implementation successfully numerous times on a different dataset. The only difference between the datasets is that one contains all files in 1-channel format and a 44100 sample-rate. The other has some files in 6-channel (most in 1-channel), and it uses a sample rate of 22050. Also, the 2nd dataset is multi-label, while the 1st is not.

Maybe I’ve not handled the different sample rates correctly in my custom implementation, or maybe the inconsistent channel format is causing the problem. It’s just that I’ve never seen lr_find or fit get hung up the way it’s doing. I was hoping someone else had dealt with a similar issue and could provide some insight.

Thanks!

travis · July 10, 2018, 11:57am

Good idea. Unfortunately, it’s not letting me interrupt the kernel. I can “restart and clear all output”, but I can’t interrupt.

urmas.pitsi · July 10, 2018, 12:10pm

I’ve had similar issues, but don’t remember exactly what happened. I’ve had issues when loop break or exit doesn’t behave properly because in fast.ai most looping related stuff is wrapped into tqdm. Sometimes improper exit/break makes tqdm not returning.

Is your RAM usage under control? I’ve read some people have problems with bigger datasets that RAM usage creeps up and hangs the system.

bny6613 · July 10, 2018, 12:38pm

If you are not able to interrupt the kernel, that probably means it is waiting in some syscall. If you are using linux, you can try to attach to the python kernel with strace and see what it is currently doing. Another solution is to set a pdb break point inside lr_find and trace it line by line.