"CUDA error" after 12 hrs and 38% training on large Language model

danaludwig · April 1, 2019, 7:23pm

Hi all,

While trying to train a clinical notes language model on a large (7GB) dataset, I just got the error: ‘CUDA error: device-side assert triggered’ after at least 12 hours and at least 38% complete training on first epoch. Traceback is below.

This is my code:

path = datapath4file(‘/media/DataHD2/Notes_PHI_20190121/notes_dana_hp’)
data_lm = load_data(path, fname=‘data_lm_hp_80pct_export.pkl’,bs=32)
learn = language_model_learner(data_lm, AWD_LSTM)
learn.unfreeze()
learn.lr_find()
learn.recorder.plot(skip_end=15)

The data file ‘data_lm_hp_80pct_export.pkl’ was 20,209,788,745 bytes

lr_finder

moms = (0.8,0.7)
print(datetime.datetime.now(), ‘Starting training 10 epochs’,flush=True)
learn.fit_one_cycle(10, slice(1e-2), moms=moms)
print(datetime.datetime.now(), ‘Finished training 10 epoch’,flush=True)
learn.save(‘fit_10_hp_80pct’)
learn.save_encoder(‘fit_10_hp_80pct_enc’)
print(datetime.datetime.now(), ‘Finished saving models’)

# Start 2019-03-31; 12:03AM; estimated time 32 hrs / epoch = 320 hours.; ETA 2019-04-13; 8AM
# GPU memory 4178 / 8114; util 81%; progress guage 2209/572361
# 2019-03-31; 9:02AM; 26.83%; 153623 / 572361; Est done Epoch 1 in 24 hr 30 min; GPU same
# 2019-03-31; 12:45PM; 37.93%; 217108 / "; est done Epoch 1 in 20 hr 48 min; same GPU mem.
# 2019-04-01; 9:41AM; CRASHED (sometime before now; same error message)

Here is the traceback:

RuntimeError Traceback (most recent call last)
in
2 moms = (0.8,0.7)
3 print(datetime.datetime.now(), ‘Starting training 10 epochs’,flush=True)
----> 4 learn.fit_one_cycle(10, slice(1e-2), moms=moms)
5 #learn.fit_one_cycle(1, 1e-2)
6 print(datetime.datetime.now(), ‘Finished training 10 epoch’,flush=True)

~/anaconda3/envs/fastaiv1/lib/python3.6/site-packages/fastai/train.py in fit_one_cycle(learn, cyc_len, max_lr, moms, div_factor, pct_start, final_div, wd, callbacks, tot_epochs, start_epoch)
20 callbacks.append(OneCycleScheduler(learn, max_lr, moms=moms, div_factor=div_factor, pct_start=pct_start,
21 final_div=final_div, tot_epochs=tot_epochs, start_epoch=start_epoch))
—> 22 learn.fit(cyc_len, max_lr, wd=wd, callbacks=callbacks)
23
24 def lr_find(learn:Learner, start_lr:Floats=1e-7, end_lr:Floats=10, num_it:int=100, stop_div:bool=True, wd:float=None):

~/anaconda3/envs/fastaiv1/lib/python3.6/site-packages/fastai/basic_train.py in fit(self, epochs, lr, wd, callbacks)
194 callbacks = [cb(self) for cb in self.callback_fns] + listify(callbacks)
195 if defaults.extra_callbacks is not None: callbacks += defaults.extra_callbacks
→ 196 fit(epochs, self, metrics=self.metrics, callbacks=self.callbacks+callbacks)
197
198 def create_opt(self, lr:Floats, wd:Floats=0.)->None:

~/anaconda3/envs/fastaiv1/lib/python3.6/site-packages/fastai/basic_train.py in fit(epochs, learn, callbacks, metrics)
98 for xb,yb in progress_bar(learn.data.train_dl, parent=pbar):
99 xb, yb = cb_handler.on_batch_begin(xb, yb)
→ 100 loss = loss_batch(learn.model, xb, yb, learn.loss_func, learn.opt, cb_handler)
101 if cb_handler.on_batch_end(loss): break
102

~/anaconda3/envs/fastaiv1/lib/python3.6/site-packages/fastai/basic_train.py in loss_batch(model, xb, yb, loss_func, opt, cb_handler)
23 if not is_listy(xb): xb = [xb]
24 if not is_listy(yb): yb = [yb]
—> 25 out = model(*xb)
26 out = cb_handler.on_loss_begin(out)
27

~/anaconda3/envs/fastaiv1/lib/python3.6/site-packages/torch/nn/modules/module.py in call(self, *input, **kwargs)
487 result = self._slow_forward(*input, **kwargs)
488 else:
→ 489 result = self.forward(*input, **kwargs)
490 for hook in self._forward_hooks.values():
491 hook_result = hook(self, input, result)

~/anaconda3/envs/fastaiv1/lib/python3.6/site-packages/torch/nn/modules/container.py in forward(self, input)
90 def forward(self, input):
91 for module in self._modules.values():
—> 92 input = module(input)
93 return input
94

~/anaconda3/envs/fastaiv1/lib/python3.6/site-packages/torch/nn/modules/module.py in call(self, *input, **kwargs)
487 result = self._slow_forward(*input, **kwargs)
488 else:
→ 489 result = self.forward(*input, **kwargs)
490 for hook in self._forward_hooks.values():
491 hook_result = hook(self, input, result)

~/anaconda3/envs/fastaiv1/lib/python3.6/site-packages/fastai/text/models/awd_lstm.py in forward(self, input, from_embeddings)
109 self.bs=bs
110 self.reset()
→ 111 raw_output = self.input_dp(input if from_embeddings else self.encoder_dp(input))
112 new_hidden,raw_outputs,outputs = ,,
113 for l, (rnn,hid_dp) in enumerate(zip(self.rnns, self.hidden_dps)):

~/anaconda3/envs/fastaiv1/lib/python3.6/site-packages/torch/nn/modules/module.py in call(self, *input, **kwargs)
487 result = self._slow_forward(*input, **kwargs)
488 else:
→ 489 result = self.forward(*input, **kwargs)
490 for hook in self._forward_hooks.values():
491 hook_result = hook(self, input, result)

~/anaconda3/envs/fastaiv1/lib/python3.6/site-packages/fastai/text/models/awd_lstm.py in forward(self, x)
21 def forward(self, x:Tensor)->Tensor:
22 if not self.training or self.p == 0.: return x
—> 23 m = dropout_mask(x.data, (x.size(0), 1, x.size(2)), self.p)
24 return x * m
25

~/anaconda3/envs/fastaiv1/lib/python3.6/site-packages/fastai/text/models/awd_lstm.py in dropout_mask(x, sz, p)
10 def dropout_mask(x:Tensor, sz:Collection[int], p:float):
11 “Return a dropout mask of the same type as x, size sz, with probability p to cancel an element.”
—> 12 return x.new(*sz).bernoulli_(1-p).div_(1-p)
13
14 class RNNDropout(nn.Module):

RuntimeError: CUDA error: device-side assert triggered

During the first few hours of training, my memory numbers appeared to be low and stable. I was stable at 4178 / 8114 total GPU Memory used on one NVidia 1080 GPU. My motherboard memory seems to be at 34.3GB / 62.8GB total available. This didn’t change over the course of training, so I thought of two possible crash reasons:

Something in my data was unexpected. This was epoch 1, so I don’t know my data is OK for sure.
Maybe the training hit a bug at the end of Epoch 1; I don’t know whether it got that far.

So I’m figuring I need to figure out which data sample that it crashed on. I used %debug and the debugger seemed to work fine, but when I tried to look at the current data at any level in the traceback stack, I would get another CUDA error:

*** RuntimeError: cuda runtime error (59) : device-side assert triggered at /opt/conda/conda-bld/pytorch_1544174967633/work/aten/src/THC/THCCachingHostAllocator.cpp:265

Any thoughts on how I can approach debugging this mess?

I know I should first try training the model on smaller data sets, and I did, but they did not reproduce the error.

sgugger · April 1, 2019, 7:31pm

CUDA error are very tricky to debug. A device-side assert triggered error usually comes from a bad index error, but not always. To make things easier:

if it’s thrown during some asynchronous computations, it might show you a line of code that has nothing to do in the error trace
it crashes your cuda entirely so you can’t debug anything that’s on the CPU

One thing that is helpful is to use and re-run your code

 import os
 os.environ['CUDA_LAUNCH_BLOCKING'] = "1"

that may give you a better error message.

danaludwig · April 1, 2019, 7:42pm

Thank you Sylvain; I did that

os.environ[‘CUDA_LAUNCH_BLOCKING’] = “1”

last time but without the “import os”; I just added the extra line now. Any thoughts about callbacks that I should insert to record the index of the input x that was seen at the crash or just before the crash? I don’t mind creating a huge temporary log file. The progress bar didn’t help - it was just a solid red bar that said “interrupted” to the right of the bar. I’m not too proud to put “print” statements in my callback

sgugger · April 1, 2019, 7:52pm

No idea since the bug could come from anywhere in your code. Make sure you have the right vocab size/vocab size since it could have come from trying to access an embedding that doesn’t exist.
The error message is super unhelpful in the sense there is nothing in that final line of code that could create the devise assert error.

Maybe do you first epoch without any shuffle (data_lm.train_dl.shuffle = False) so that you see the samples in order and put the indexes in a log file as you go so you know exactly which batch is problematic?

danaludwig · April 1, 2019, 8:05pm

Wow - that hint about “vocab” is intriguing! I assumed the vocab in the line below would come from my data, in data_lm_hp_80pct_export.pkl, and not from AWS_LSTM.

data_lm = load_data(path, fname=‘data_lm_hp_80pct_export.pkl’,bs=32)
learn = language_model_learner(data_lm, AWD_LSTM)

Here is the code I used to build the pkl file:

all = [‘BaseTokenizer’, ‘SpacyTokenizer’, ‘Tokenizer’, ‘Vocab’, ‘fix_html’, ‘replace_all_caps’, ‘replace_rep’, ‘replace_wrep’,
‘rm_useless_spaces’, ‘med_spec_add_spaces’,‘spec_add_spaces’, ‘BOS’, ‘EOS’, ‘FLD’, ‘UNK’, ‘PAD’, ‘TK_MAJ’, ‘TK_UP’, ‘TK_REP’, ‘TK_REP’, ‘TK_WREP’,
‘deal_caps’]

class MedTokenizer(BaseTokenizer):
“Modification of a spacy tokenizer to make it a BaseTokenizer.”
def init(self, lang:str):
self.tok = spacy.blank(lang)
def tokenizer(self, t:str) -> List[str]:
    return [t.text for t in self.tok.tokenizer(t)]

def add_special_cases(self, toks:Collection[str]):
    for w in toks:
        self.tok.tokenizer.add_special_case(w, [{ORTH: w}])
def med_spec_add_spaces(t:str) → str:
“For clinical notes, Add spaces around ‘<,>,:,-’ in addition to /,#, in t. \n”
return re.sub(r’([/#<>:-\n])‘, r’ \1 ', t)

defaults.text_pre_rules = [fix_html, replace_rep, replace_wrep, med_spec_add_spaces, rm_useless_spaces]
defaults.text_post_rules = [replace_all_caps, deal_caps]
mytokenizer = Tokenizer(MedTokenizer, lang=‘en’)
np.random.seed(42)
path = datapath4file(‘/media/DataHD2/Notes_PHI_20190121/notes_dana_hp’)
data_lm = (TextList.from_csv(path, ‘notes_hp_80pct.csv’, cols=‘note_text’,
processor = [TokenizeProcessor(tokenizer=mytokenizer), NumericalizeProcessor(max_vocab=60000)])
.split_by_rand_pct()
.label_for_lm()
.databunch(bs=48, num_workers=4))

I hacked “spacy” because I wanted it to tokenize each of the embedded characters “><:-”

I wonder if this step corrupted, in some way, my vocab, that is, if one of the 4 characters above is magic or reserved or something.

bfarzin · April 2, 2019, 9:45pm

I am curious if something like this might help. I am using the IMDB data set just to prove how it might work. In this case, I generate an error by putting in an index that is > vocab size. Then I get an error and it drops me into a debugger so I can figure out where I am with the data.

You can inspect the state of the training loop at that point and see the data that is input/output (I copy over to CPU before trying to pass through the model. If I get a CUDA error, then I can’t recover the input data.) However, I am not sure how you are going to tie a particular batch back to your original data to see if there is something further up the pipeline causing the problem.

As Sylvain said, these can be hard errors to debug. Come back with any questions on the code.

from fastai.text import *
path = untar_data(URLs.IMDB_SAMPLE)
data_lm = TextLMDataBunch.from_csv(path, 'texts.csv')
learn = language_model_learner(data_lm, AWD_LSTM, drop_mult=0.5)

class find_bad_data(LearnerCallback):
    "Find the data that breaks my forward pass!"
    def __init__(self, learn:Learner,**kwargs):
        super().__init__(learn)
        
    def on_batch_begin(self,last_input,last_target,train,**kwargs):
        if not is_listy(last_input): xb = [last_input]

        ## create an error to test the callback code
        if np.random.rand() > 0.9: xb[0][5,3] = 70000   #set a bad value 10% of time

        ## copy over to CPU, or we lose them in CUDA error
        cpu_xb = to_cpu(xb)
        cpu_yb = to_cpu(last_target)
        
        try:
            out = self.model(*xb)
        except:
            print("Model does not work with this batch!")
            set_trace()
            
        pass
        
    ## skip all the update steps, just iterated over the data
    def on_backward_begin(self, last_loss:Rank0Tensor, **kwargs):
        return {'last_loss': last_loss, 'skip_bwd': True}
    
    def on_backward_end(self,**kwargs):        
        return {'skip_step':True, 'skip_zero':True} 

learn.fit(2,callbacks=find_bad_data(learn))

Running in a Notebook, I get this output and then can inspect the data:

Model does not work with this batch!
> <ipython-input-5-7315ad7862ba>(22)on_batch_begin()
-> pass
(Pdb) cpu_xb
[tensor([[  11, 1444,  235,  ...,   10,    5,  826],
        [  11,   18,   12,  ...,    9,  267,   11],
        [  17,   85,  105,  ...,   56,   41,  313],
        ...,
        [  46,   14,  115,  ...,   11,  415,   45],
        [ 811,   85,   43,  ...,  100,   61,   61],
        [  20,   27,   11,  ...,  250,  484,   28]])]
(Pdb) cpu_yb
tensor([[1444,  235,    9,  ...,    5,  826,   11],
        [  18,   12,   59,  ...,  267,   11,   22],
        [  85,  105,  235,  ...,   41,  313,  602],
        ...,
        [  14,  115,   42,  ...,  415,   45,   14],
        [  85,   43, 1159,  ...,   61,   61,  197],
        [  27,   11,  106,  ...,  484,   28,   60]])
(Pdb) q

danaludwig · April 2, 2019, 10:30pm

Wow, Bobak, that’s awesome! If I can get this to run, I will definitely try it on the big batch.

Does the ‘try’ clause only test the forward pass?

If the forward pass succeeds in the ‘try’ clause, but the training loop still fails on a given batch, outside of the ‘try’ clause, it looks like I will still have the input data (xb) in main memory, so I will have a chance to track it down anyway, right??

I will give it a try as soon as I can, and let you know what happens. - Dana

bfarzin · April 2, 2019, 10:59pm

If it fails outside of the try clause, you won’t have the info you need. But from your error before, it appears your CUDA error came up when the you were in the forward pass:

FWIW, my suspicion is that you are trying to pull a word_id index that is outside the size of the vocab on that pass. I don’t know how, but that is my guess.

danaludwig · April 2, 2019, 11:18pm

Absolutely correct - forward pass! I ran your callback on the training of the 1000 row data, and it worked fine. I removed the random crash, and I just now reloaded the full volume model and I’m training it with your callback. It should be done about 22 hours from now or whenever the exception triggers. I’ll let you know what I find. Thank you, Dana

TomB · April 3, 2019, 3:08am

To debug an exception you can use the %debug magic which will debug the last exception. The only issue is if the initial exception is caught and re-raised then you will end up at this point not the original cause. Though in your case you won’t end up at the actual cause anyway so this is fine. I haven’t tried this in a notebook, just iPython, but it is reported to work fine in notebooks. After an exception just add a new cell containing %debug and run it to get a debugger session for the last exception.
Or in iPython you can also run the %pdb magic before running code and it will start the debugger on any unhandled exceptions (run %pdb again to turn this off), though I couldn’t find confirmation this works in a notebook.

Tom.

danaludwig · April 3, 2019, 6:36am

Thank you, Tom!

danaludwig · April 3, 2019, 5:24pm

Bobak, the find_bad_data callback worked perfectly. It triggered on sample 455238 / 572361, 79.54% through the 9 gigabyte dataset. I can display cpu_xb and cpu_yb. Stupid question: what next? How do I save them to a CSV or SOMETHING on the file system?

danaludwig · April 3, 2019, 5:56pm

Skip the last question about how to save cpu_xb and xy. I just pickled them

bfarzin · April 3, 2019, 5:57pm

No stupid questions. We all learn by asking questions!

To output, this should work pd.DataFrame(to_np(cpu_xb[0])).to_csv('error_xb.csv')
That will write the csv in the current directory. You can do the same for cpu_yb I would check that it is there in another terminal before you exit the debugger. I figured that out just playing around in the debugger. If you can afford having the GPU and resources tied up, I would leave that in the debugger till you figure out what you think is wrong with the data. You might want to re-export it or look at something else (if you can) and don’t want to wait to re-run to this point.

You can quickly see if the max index in your inputs is > len(vocab) by doing the below commands. If your min is <0 or your max is > embedding input size, that is your problem. If not, then something else more subtle is going on.

The really good news is that you can use the CSV to generate a new one-off batch that you can debug!

(Pdb) cpu_xb[0].min()
tensor(0)
(Pdb) cpu_xb[0].max()
tensor(70000)
(Pdb) self.model
SequentialRNN(
  (0): AWD_LSTM(
    (encoder): Embedding(8947, 400, padding_idx=1)
    (encoder_dp): EmbeddingDropout(
      (emb): Embedding(8947, 400, padding_idx=1)
    )
    (rnns): ModuleList(
      (0): WeightDropout(
        (module): LSTM(400, 1150, batch_first=True)
      )
      (1): WeightDropout(
        (module): LSTM(1150, 1150, batch_first=True)
      )
      (2): WeightDropout(
        (module): LSTM(1150, 400, batch_first=True)
      )
    )
    (input_dp): RNNDropout()
    (hidden_dps): ModuleList(
      (0): RNNDropout()
      (1): RNNDropout()
      (2): RNNDropout()
    )
  )
  (1): LinearDecoder(
    (decoder): Linear(in_features=400, out_features=8947, bias=True)
    (output_dp): RNNDropout()
  )
)

danaludwig · April 3, 2019, 6:09pm

Great! Here is what I’m seeing:
(Pdb) cpu_xb[0].max()
tensor(1048596)
(Pdb) cpu_xb[0].min()
tensor(0)

Yet my vocab should have been max 60000:
data_lm = (TextList.from_csv(path, ‘notes_hp_80pct.csv’, cols=‘note_text’,
processor = [TokenizeProcessor(tokenizer=mytokenizer), NumericalizeProcessor(max_vocab=60000)])
.split_by_rand_pct()
.label_for_lm()
.databunch(bs=48, num_workers=4))

That’s a big difference! could the max() of 1048596 actually be my sample index?

bfarzin · April 3, 2019, 6:22pm

.max() is the biggest value in that data. So, somehow, you are putting this huge index into the top of the network and it is bigger then the embedding.

There is something about your data_lm build that is not working right which is causing this problem. I don’t quite know what it is but I am guessing your processor call is not doing what you think it is doing. I will have a look in a little bit and see if I can figure out what is going on building this kind of databunch with a custom processor. I will post back with some ideas soon.