PyTorch internal error while doing the imdb notebook


(Thomas) #1

Hi,

I’m working through the imdb notebook in Lesson10, I get a PyTorch error
At the first finetuning step under Language model (learner.fit(lrs/2, 1, wds=wd, use_clr=(32,2), cycle_len=1)) I get an error

RuntimeError: range.second - range.first == t.size() ASSERT FAILED at torch/csrc/autograd/generated/Functions.cpp:47, please report a bug to PyTorch. inconsistent range for TensorList output

while that is clear advice and I’ll try to drill to the root cause, has anyone else seen this as well and already solved this?
(I’m using PyTorch/master, so some breakage is OK.)

Best regards

Thomas


(urmas pitsi) #2

I have pythorch 0.3.1 and it works fine. I guess you are trying to run it with the latest pytorch version? I think fast.ai can have some breakages above pytorch 0.3.1. I see recently there are commits to be 0.4 compatible though.


(Thomas) #3

Thanks for the hint! I’m relatively attached to running PyTorch / master. :slight_smile:
I have dug a bit further: Apparently the cudnn rnn backwards behaves strange with which grads are enabled and which are not.
So if anyone else runs into this: setting torch.backends.cudnn.enabled = False gets you around this.
Apparently there still is a bug in PyTorch to be fixed, though.


(Thomas) #4

Sure enough @sgugger found it first:


#5

Yes, someone else mentioned this error on the lesson wiki so I was trying to fix it, but it turns out the problem isn’t on our side :wink:
It seems you have found a fix though, congrats!


(Thomas) #6

He, I guess I’m on the PyTorch side, too… Thanks for all your work! I’m just checking out how you proceeded with French in the language Zoo to do this for German when this blew up on me.


#7

I’ve learned a lot more on training an LM and making super-convergence work on them, so I definitely have to share a notebook on what I found worked best.


(Thomas) #8

I look forward to that! My plan was to use sentencepiece as I hope to benefit from the subwords with German compound nouns - I did have lots of UNK in some previous experiments and it also seems to work well for OpenNMT.

In the meantime: Did I miss something about more compatibility issues?
I’m now at learner.lr_find(start_lr=lrs/10, end_lr=lrs*10, linear=True) and get

TypeError: cannot assign 'torch.cuda.FloatTensor' as parameter 'weight_hh_l0' (torch.nn.Parameter or None expected)

I think I should know how to fix it if noone else did yet.


#9

Weird, this ran fine for me (pytorch 0.4.0).


(Thomas) #10

I’m pretty sure the nn.Module code in PyTorch wants to keep you from overwriting a parameter with a variable - as the regularizer does (there is a PyTorch issue where I have opinions on how to deal with “calculated parameters” which would be a neat solution here, too) - but it might depend on other factors like the python version whether that works. Interestingly, the code has been there for a long time.

I made it work (but not without also some _raw vs. not so _raw hacking in load_module). I’m not sure that I want to impose it on you if the problem doesn’t exist for anyone else.


(Eric Roland) #11

I am getting the same error:

TypeError: cannot assign 'torch.cuda.FloatTensor' as parameter 'weight_hh_l0' (torch.nn.Parameter or None expected)

How did you get around it? Also, running PyTorch 0.4.0.

Thank you!


(Thomas) #12

I submitted my fix as a PR and a bit more description here:

Out of curiosity: What python version are you on?

Best regards

Thomas


(Eric Roland) #13

3.6.4 |Anaconda custom (64-bit)| (default, Jan 16 2018, 18:10:19)
[GCC 7.2.0]

Running on SageMaker. Your PR fixes my issue.

I appreciate the help.


(Thomas) #14

Hi Eric,

Cool! Thank you for reporting back.

Best regards

Thomas


(Todd Doucet) #15

Thomas, your pull request patch also fixed the problem for me.

I am running Python 3.6.5 and Pytorch 0.3.1, and my fastai repo was current as of today.

Running an 1080ti here, locally on my Linux Mint machine.

Thanks much.


#16

I’m still getting the error, inconsistent range of Tensor input. Has it been fixed ? I’m on Pytorch master (0.4) . Plus I think there is some sort of memory leak happening. The code doesn’t run but whole RAM is occupied.


(Thomas) #17

PyTorch master is what you get when you check out the git repo and recompile. 0.4 does have the bug, it’ll be fixed in 0.5.

Best regards

Thomas


#18

For now, I am not freezing the last embedding layer (due to error) and have to train the entire model in one go.


(Kyle Nesgood) #19

Running into the same problem. Is there a big performance hit when not freezing the last layer?


(Thomas) #20

Which PyTorch version are you on? 0.4.1 should have the fix needed…