Fit failing with RuntimeError: Could not infer dtype of numpy.int64

zhye · October 21, 2018, 3:40am

I really don’t know where this thread should be placed - hopefully in the right place. Basically, I’m trying to reproduce lesson 10 using fastai v1 code, and I got an error when running a fit_one_cycle on the classification model that reads:

RuntimeError                              Traceback (most recent call last)
<ipython-input-8-fd76aff41bb9> in <module>()
----> 1 learnClass.fit_one_cycle(1, 1e-2)

C:\ProgramData\Anaconda3\lib\site-packages\fastai\train.py in fit_one_cycle(learn, cyc_len, max_lr, moms, div_factor, pct_start, wd, **kwargs)
     16     cbs = [OneCycleScheduler(learn, max_lr, moms=moms, div_factor=div_factor,
     17                              pct_start=pct_start, **kwargs)]
---> 18     learn.fit(cyc_len, max_lr, wd=wd, callbacks=cbs)
     19 
     20 def lr_find(learn:Learner, start_lr:Floats=1e-5, end_lr:Floats=10, num_it:int=100, **kwargs:Any):

C:\ProgramData\Anaconda3\lib\site-packages\fastai\basic_train.py in fit(self, epochs, lr, wd, callbacks)
    136         callbacks = [cb(self) for cb in self.callback_fns] + listify(callbacks)
    137         fit(epochs, self.model, self.loss_fn, opt=self.opt, data=self.data, metrics=self.metrics,
--> 138             callbacks=self.callbacks+callbacks)
    139 
    140     def create_opt(self, lr:Floats, wd:Floats=0.)->None:

C:\ProgramData\Anaconda3\lib\site-packages\fastai\basic_train.py in fit(epochs, model, loss_fn, opt, data, callbacks, metrics)
     89     except Exception as e:
     90         exception = e
---> 91         raise e
     92     finally: cb_handler.on_train_end(exception)
     93 

C:\ProgramData\Anaconda3\lib\site-packages\fastai\basic_train.py in fit(epochs, model, loss_fn, opt, data, callbacks, metrics)
     77             cb_handler.on_epoch_begin()
     78 
---> 79             for xb,yb in progress_bar(data.train_dl, parent=pbar):
     80                 xb, yb = cb_handler.on_batch_begin(xb, yb)
     81                 loss = loss_batch(model, xb, yb, loss_fn, opt, cb_handler)[0]

C:\ProgramData\Anaconda3\lib\site-packages\fastprogress\fastprogress.py in __iter__(self)
     59         self.update(0)
     60         try:
---> 61             for i,o in enumerate(self._gen):
     62                 yield o
     63                 if self.auto_update: self.update(i+1)

C:\ProgramData\Anaconda3\lib\site-packages\fastai\data.py in __iter__(self)
     50     def __iter__(self):
     51         "Process and returns items from `DataLoader`."
---> 52         for b in self.dl: yield self.proc_batch(b)
     53 
     54     def one_batch(self)->Collection[Tensor]:

C:\ProgramData\Anaconda3\lib\site-packages\fastai\data.py in __iter__(self)
     50     def __iter__(self):
     51         "Process and returns items from `DataLoader`."
---> 52         for b in self.dl: yield self.proc_batch(b)
     53 
     54     def one_batch(self)->Collection[Tensor]:

C:\ProgramData\Anaconda3\lib\site-packages\torch\utils\data\dataloader.py in __next__(self)
    334                 self.reorder_dict[idx] = batch
    335                 continue
--> 336             return self._process_next_batch(batch)
    337 
    338     next = __next__  # Python 2 compatibility

C:\ProgramData\Anaconda3\lib\site-packages\torch\utils\data\dataloader.py in _process_next_batch(self, batch)
    355         self._put_indices()
    356         if isinstance(batch, ExceptionWrapper):
--> 357             raise batch.exc_type(batch.exc_msg)
    358         return batch
    359

RuntimeError: Traceback (most recent call last):
  File "C:\ProgramData\Anaconda3\lib\site-packages\torch\utils\data\dataloader.py", line 106, in _worker_loop
samples = collate_fn([dataset[i] for i in batch_indices])
  File "C:\ProgramData\Anaconda3\lib\site-packages\fastai\text\data.py", line 278, in pad_collate
return res, torch.tensor([s[1] for s in samples]).squeeze()
RuntimeError: Could not infer dtype of numpy.int64

Context:
I want to run a classification model on complaints text using fastai. I have watched all part 1 videos and a few from part 2, namely lesson 10. I am attempting to reproduce the IMDb classification model in lesson 10 using fastai v1, and tried to follow documentation as well as reading through python code (I’m a beginner in Python unfortunately). I’ve chosen to use slightly different variable and file names so I’m not successfully reproducing as a pure fluke. I have managed to successfully build my language model, create a data bunch for the classification model, and load my encoding from the language model (side note: I think the documentation can be much better, it was quite painful for me to ultimately get here). The next thing was to start training but I got stuck.

Code:
The point where I hit the error was here:

learnClass.fit_one_cycle(1, 1e-2)

The learner is created by
‘learnClass = RNNLearner.classifier(data_clas, drop_mult=0.5)
learnClass.load_encoder(‘lm_enc’)’

data_clas was created by the following code:
data_clas = TextClasDataBunch.from_df(path=ClassPath,train_df=TrainDF,valid_df=TestDF,vocab=data_lm.train_ds.vocab,bs=8,classes=Classes,n_labels=1)

The source data was a dataframe where the first column had the labels and the second column had the review text. The labels are all 0 and 1 representing ‘neg’ and ‘pos’ respectively.
ClassPath = Path('C:/Jupyter/Class')
TrainDF = pd.read_csv(ClassPath/'train.csv', header=None)
TestDF = pd.read_csv(ClassPath/'test.csv', header=None)
Classes = ['neg','pos']

Please help me understand why this is happening and how to move forward.

jeremy · October 21, 2018, 11:16pm

Please don’t at-mention fast.ai people unless you have a specific issue that can only be addressed by them.

You’ll find that injecting yourself into our mentions unnecessarily will likely cause your requests to be ignored by us, and other students may feel the same way.

zhye · October 21, 2018, 11:35pm

Thanks for the heads up Jeremy. I will certainly keep this one in mind and not tag people in the future.
From my perspective, I am really stuck because I don’t know python. I’m guessing the issue that I am having is either a result of me not using one of the functions correctly (there is a lack of documentation on detailed usage of input parameters, and if anyone can give me a pointer here, hopefully it will be a very quick answer), or there is a bug in the source code, which I am certainly not fully qualified to rectify.
Right now, I’m trying to google my way through every single line of code, and if I do find something I will certainly share. At the same time, I thought this is a question that fastai developers can answer much easier, so I would love any pointers.
My progress right now: I’ve found that in the source code if I have more than one column of labels, then the labels are stored as float, when there is only one, then the column is stored as int64. Given that this is explicitly called out in the code, I’m taking an assumption that this is fully intentional, so the source of the issue really sits in the callback function where the error was raised. I’m trying to painfully trace through code relating to callbacks at the moment.
If anyone can give a hand with this, it will be much appreciated.

PS I have deleted my @ mentions

sgugger · October 22, 2018, 10:24am

Hi there, from your error message I think there is something wrong with your labels. Could you

try TrainDF[0] = TrainDF[0].astype(np.int64) (and same for TestDF) and see if it fixes your issue?
just give me a look at the result of trainDF.head()
what data_clas.train_ds[0] returns you (and if there is no bug, the type of the two things it will return you).

The library is in its early stage and there’ll be more documentation as the new version of the course progresses. In the meantime, we’re always happy to help, but as Jeremy explained, at-mentions can be really distracting and you’ll get answered more quickly without them.
Also, I understand it can be frustrated to be blocked by an error, but please leave us a full 24 hours to reply

zhye · October 22, 2018, 12:42pm

Thanks for the reply!

Unfortunately, that didn’t help - got the exact same error.
Please see below - this looks identical to the Text Overview documentation page for as much as I recall (the doc page is down at the moment)

See the output below.

This is what I discovered today:
I looked into learnClass.fit_one_cycle(1, 1e-2) and copied out the code from inside the method, and executed them line by line.
The point where the aforementioned error occurs is line 77 of basic_train.py - the loop over progress_bar. I tried the follow to confirm exactly where the issue is:

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-16-c3e0c69ffc3a> in <module>()
----> 1 for xb,yb in test_pb: print(xb)

C:\ProgramData\Anaconda3\lib\site-packages\fastprogress\fastprogress.py in __iter__(self)
     59         self.update(0)
     60         try:
---> 61             for i,o in enumerate(self._gen):
     62                 yield o
     63                 if self.auto_update: self.update(i+1)

C:\ProgramData\Anaconda3\lib\site-packages\fastai\data.py in __iter__(self)
     50     def __iter__(self):
     51         "Process and returns items from `DataLoader`."
---> 52         for b in self.dl: yield self.proc_batch(b)
     53 
     54     def one_batch(self)->Collection[Tensor]:

C:\ProgramData\Anaconda3\lib\site-packages\fastai\data.py in __iter__(self)
     50     def __iter__(self):
     51         "Process and returns items from `DataLoader`."
---> 52         for b in self.dl: yield self.proc_batch(b)
     53 
     54     def one_batch(self)->Collection[Tensor]:

C:\ProgramData\Anaconda3\lib\site-packages\torch\utils\data\dataloader.py in __next__(self)
    334                 self.reorder_dict[idx] = batch
    335                 continue
--> 336             return self._process_next_batch(batch)
    337 
    338     next = __next__  # Python 2 compatibility

C:\ProgramData\Anaconda3\lib\site-packages\torch\utils\data\dataloader.py in _process_next_batch(self, batch)
    355         self._put_indices()
    356         if isinstance(batch, ExceptionWrapper):
--> 357             raise batch.exc_type(batch.exc_msg)
    358         return batch
    359 

RuntimeError: Traceback (most recent call last):
  File "C:\ProgramData\Anaconda3\lib\site-packages\torch\utils\data\dataloader.py", line 106, in _worker_loop
    samples = collate_fn([dataset[i] for i in batch_indices])
  File "C:\ProgramData\Anaconda3\lib\site-packages\fastai\text\data.py", line 278, in pad_collate
    return res, torch.tensor([s[1] for s in samples]).squeeze()
RuntimeError: Could not infer dtype of numpy.int64

I then tracked the code down to line 52 of basic_data.py, which was a generator that yielded proc_batch. I attempted to reproduce the loop and below is the results:

Got the numpy.int64 error again.

I am now going back to understand the content of train_dl by using a similar line by line approach to:

data_clas = TextClasDataBunch.from_df(path=ClassPath,train_df=TrainDF,valid_df=TestDF,vocab=data_lm.train_ds.vocab
                                      ,bs=8,classes=Classes,n_labels=1)

PS I’m new to the forum and not familiar with the rules. Even though I was getting frustrated, I wasn’t getting impatient, I just didn’t know if @ mentions are required to get attention. I’m a fan of Fastai, and want to eventually become a contributor, so the exercise of reproducing the code line by line is helping me learn python very rapidly. Having said that, right now I still need help to get through the technical road block.

sgugger · October 22, 2018, 1:23pm

Yes, the problem being in pad_collate, it’s logical you can’t iterate through batches. I can’t reproduce the issue, and as @tblock suggested, maybe it’s an installation problem?
Maybe create a new environment and reinstall the library (it’s super quick) then see if the problem persists?

Otherwise, we’ll try to get to the bottom of this together.

zhye · October 23, 2018, 1:53am

So I created a new python 3.6, had everything set up with Cuda 9.2, and installed everything I could with pip instead of conda.
Unfortunately I get the same issue, but this time I have a warning that I didn’t see before:

Is this possibly related? How would I fix the problem?

Update: the warning appears to be an issue with Spacy, so I’m inclined to think it is not too related
Below is the code I ran in anaconda prompt for the latest installation:

conda create -n py36 python=3.6 anaconda
conda activate py36
python -m pip install --upgrade pip
pip3 install http://download.pytorch.org/whl/cu92/torch-0.4.1-cp36-cp36m-win_amd64.whl
pip3 install torchvision
conda install -c conda-forge spacy
python -m spacy download en
pip install fastai

zhye · October 23, 2018, 6:16am

Ok I finally managed to get the devil running. I did have to go into line 281 of text\data.py and make some changes.
I found this discussion about how pytorch tensor doesn’t accept integers half way down the page https://github.com/pytorch/pytorch/issues/8365
So I went I changed pad_collate() from
return res, torch.tensor([s[1] for s in samples]).squeeze()
to
return res, torch.tensor([np.long(s[1]) for s in samples]).squeeze()

See image below:

sgugger · October 23, 2018, 12:22pm

This is weird, as when I tried to do the same thing with converting your column to np.int64, it didn’t work… Do you mind sending me a sample of your dataframe so I can see if I can replicate the bug?

zhye · October 23, 2018, 12:27pm

I’m happy to give you my whole jupyter notebook, I just can’t attach that to the reply. Is there a way for me to send it to you?

sgugger · October 23, 2018, 2:23pm

Can you put in a small repo in github? With a small subset of the files you use to create your dataframes (but still manage to reproduce the bug).

zhye · October 24, 2018, 7:51am

I see why you are looking for a sample of my dataframe now. I have truncated some of the data files so I can upload to github. In the parent directory there is a jupyter notebook that has all the details. The LM folder contains all data used for the language model and the class folder contains data used for the classification (which is where the error happened.

sgugger · October 24, 2018, 2:10pm

Even with your repo, I can’t reproduce the bug. This is really weird, maybe it’s a problem with your version of pytorch then?

zhye · October 24, 2018, 10:12pm

Yeah strange, this is quite literally the version I installed:

pip3 install http://download.pytorch.org/whl/cu92/torch-0.4.1-cp36-cp36m-win_amd64.whl

Note that I did not ever install nightly, because it tells me package cannot be found. Gathering from the information so far, I can only guess that this is either the difference in the nightly version, or that I’m potentially using a newer version than you where they have stopped accepting integers in the tensor() method.

Is it worth applying my suggested code change anyway such that it caters for all versions of pytorch?

I also noticed that a separate post also made mention of a potentially similar situation, but the user fixed it in a different way. See Fastai V1 and multilabel (Ulmfit)

sgugger · October 25, 2018, 1:58am

Ah! You have pytorch 0.4 and not 1.0. That explains all
The fastai library only supports pytorch v1 and was designed this way, so you should expect a lot more things to go wrong if you stick to pytorch 0.4.

I know v1 isn’t available in windows yet but don’t expect everything to work properly if you don’t switch to Linux instances for now.

zhye · October 25, 2018, 2:12am

Ah windows… fml

danielhunter · October 27, 2018, 11:34pm

@sgugger based on this resolution and other posts I’ve seen around the forum, would it be valuable if I wrote a standard “bug report / request for help” template that people can use (for example: a minimum code snippet to reproduce the error if possible, versions of major libraries, system OS, etc).

Obviously making it as short as possible to avoid boilerplate / overhead is ideal to make it easy as possible for beginners to ask questions, but might be nice to solve issues like these

jeremy · October 28, 2018, 2:10am

That sounds great! There’s some ideas you could steal from here:

https://docs-dev.fast.ai/troubleshoot.html#support

danielhunter · October 31, 2018, 6:09am

Nice, thanks. First pass (feedback welcome, feel free to use some / all / none of it wherever appropriate or valuable).

Request for Help

Please see troubleshooting docs here: https://docs-dev.fast.ai/troubleshoot.html

In order to allow us to help you most effectively, please keep the following in mind:

Please be as specific as possible. Include relevant code, error outputs, etc. “fastai isn’t working” is a lot worse than “I can’t figure out how to use a different sampler in my dataloader”, which is a lot worse than "Here’s a code snippet where I’m trying to use a WeightedRandomSampler, and it’s giving the following error: ..."
Please search your question here on the forums, as well as on Google (bonus points if you include links you found in your search that were helpful but didn’t fully answer your question!)
Please include a (the shortest possible) code snippet to demonstrate what you’re trying to do (for more info, please see https://stackoverflow.com/help/mcve). In the case of Deep Learning problems, consider what’s really needed for someone to reproduce your problem. Does the person helping you need to download your entire dataset, or does a single piece of data work? Or even better, does a torch.ones tensor of the same shape as your data produce the same error?
Please include your system version and setup: are you developing locally on Windows? Remotely on AWS? Please especially include the output of the following command (run from the command line):

python -c 'import fastai; fastai.show_install(1)'

Bug Report

Bug reports are welcome at https://github.com/fastai/fastai/issues/new – but if for some reason you’d prefer to post it here, that’s okay. We do ask that your follow the previous instructions as appropriate – especially giving us the code needed to reproduce your bug is very helpful.

And remember, we’re all here to help you, but we’re not magic (and also unpaid). We’ll try our best to get you to a solution, and showing that you’re also putting in effort is appreciated.

jeremy · October 31, 2018, 6:15pm

Thanks! Updated guide here:

The biggest issue now is that the ‘troubleshooting’ page only covers local installation - not troubleshooting for gradient/gcp/etc. So a really help PR would be something at the top that points people to the setup guides on course-v3.fast.ai.