Getting Error while running fast ai sample code for training the model

faysalhossain2007 · August 15, 2018, 5:46am

I am getting the following error while trying to train the model. I am running the code in the GPU. Also I am facing another problem during fine tunning. It is taking a lot time for fine tuned the pre-trained model. Any suggestion to make the process more faster? P.S. I have loaded cuda before running the process. And check GPU by nvidia-smi.

: /localtmp/fs5ve/fastai/courses/dl2/imdb_scripts ; python train_clas.py data/wiki/en 0 --lm-id pretrain_wt103 --clas-id pretrain_wt103 --cl 5
dir_path data/wiki/en; cuda_id 0; lm_id pretrain_wt103; clas_id pretrain_wt103; bs 64; cl 5; backwards False; dropmult 1.0 unfreeze True startat 0; bpe False; use_clr True;use_regular_schedule False; use_discriminative True; last False;chain_thaw False; from_scratch False; train_file_id
Traceback (most recent call last):
  File "train_clas.py", line 148, in <module>
    if __name__ == '__main__': fire.Fire(train_clas)
  File "/zf18/fs5ve/.conda/envs/fastai/lib/python3.6/site-packages/fire/core.py", line 127, in Fire
    component_trace = _Fire(component, args, context, name)
  File "/zf18/fs5ve/.conda/envs/fastai/lib/python3.6/site-packages/fire/core.py", line 366, in _Fire
    component, remaining_args)
  File "/zf18/fs5ve/.conda/envs/fastai/lib/python3.6/site-packages/fire/core.py", line 542, in _CallCallable
    result = fn(*varargs, **kwargs)
  File "train_clas.py", line 51, in train_clas
    assert trn_lbls.shape[1] == 1 and val_lbls.shape[1] == 1, 'This classifier uses cross entropy loss and only support single label samples'
IndexError: tuple index out of range
(fastai)

Hadus · August 15, 2018, 9:49am

Do both of these print True?

print(torch.cuda.is_available())

print(torch.backends.cudnn.enabled)

Are you using .cuda() on your model?

Could you include some code and what the data looks like?

I believe the problem is with your data. What are you using to create your data loader? Is it ModelData(... or ImageClassifierData. ... or something else?

MicPie · August 15, 2018, 9:53am

Your data is generating a dimension mismatch.

To find your error check your code with the python debugger:

You can use the python debugger pdb to step through code.
• pdb.set_trace() to set a breakpoint
• %debug magic to trace an error

Commands you need to know:
• s - step: execute and step into function
• n - next: execute current line
• c - continue: continue execution until next breakpoint
• u - up: move one level up in the stack trace
• d - down: move one level down in the stack trace
• p - print: print variable, example: “p x” prints variable x
• l - list: lists 11 lines of code around the current line

Jeremy shows in lesson 8 how to use it.
To learn how to use pdb will pay off big times in the long run.

pdb cheatsheet: https://nblock.org/2011/11/15/pdb-cheatsheet/

If you run into problems with code moved to the gpu run it on the cpu or use this to get more meaningful information during debugging of cuda errors (from http://lernapparat.de/debug-device-assert/):

import os
os.environ[‘CUDA_LAUNCH_BLOCKING’] = “1”

Best regards
Michael

faysalhossain2007 · August 15, 2018, 9:53pm

I am using the code that is uploaded in the github. I haven’t made any modification to the code.

faysalhossain2007 · August 15, 2018, 9:54pm

Is there any way the code which is uploaded in the github is not using GPU ?
what does the following code do ? As far as I know it will help me by providing more details during debug.

import os
os.environ[‘CUDA_LAUNCH_BLOCKING’] = “1”

Is there any way to make my finetuning process more faster?

Hadus · August 16, 2018, 10:54am

In the train_clas.py:

There is a problem with your y labels.

What are they?

You can check by editing the train_clas.py so that before the assert (line 51) there is:

print(trn_lbls.shape) # DELETE AFTER!

Then re-run it and it should print out the shape of your labels.

Oh and is training also slow not just fine tuning?

MicPie · August 16, 2018, 10:56am

Hello,

the explanation of the code including how to move things to the cpu (by “.cpu”) is covered in the last link in my post above.

What do you exactly mean here, the training process?

Best regards
Michael

MicPie · August 16, 2018, 11:06am

Better use the pdb.set_trace() command to be much more flexible.
Then you can easily print all the variables with “p var” including their shape with “p var.shape”, etc.

Trust me, usually a few print statements are not enough and in the end you have more lines with print than with the actual code.

Hadus · August 16, 2018, 2:33pm

I think from the shape we should be able to tell what is wrong… and for @faysalhossain2007 it is easier to edit the file then to learn to navigate there with the debugger.

But it is indeed better to use the debugger most the time I agree

faysalhossain2007 · August 19, 2018, 6:47pm

Thanks for the reply. Can you please tell me the pretrained wiki103( that fast ai used as a pretrained model) is a sentiment classifier or not?

pricel3ss · August 22, 2018, 3:25pm

Input type (torch.cuda.FloatTensor) and weight type (torch.FloatTensor) should be the same

i am trying to fit a model and getting this error from the lesson 9 object detection tutorial. I am struggling HEAVILY! anyone know how to fix this?

Hadus · August 29, 2018, 10:44am

You should have opened a new thread… that way people would have answered faster. The problem is that your model is not on the gpu. When you make the model you should call .cuda() on it.

pricel3ss · August 29, 2018, 12:09pm

This site won’t let me open a new thread. I thought Of your solution but sadly it’s nkt that simple.

Also, I’m running the notebook out of the box with no changes. So the error seems to be in one of the fast AI clAss declarations itself.

Hadus · August 31, 2018, 6:32pm

How does it not allow you to open a new thread?

Another thing I would try is downloading a different version of the notebook from GitHub.

The error is almost surely not with the FastAI library.

You should keep trying to open a new thread, it would be solved a lot faster there… but if you can’t could you post the error here.