"CUDA error" after 12 hrs and 38% training on large Language model

danaludwig · April 1, 2019, 8:05pm

Wow - that hint about “vocab” is intriguing! I assumed the vocab in the line below would come from my data, in data_lm_hp_80pct_export.pkl, and not from AWS_LSTM.

data_lm = load_data(path, fname=‘data_lm_hp_80pct_export.pkl’,bs=32)
learn = language_model_learner(data_lm, AWD_LSTM)

Here is the code I used to build the pkl file:

all = [‘BaseTokenizer’, ‘SpacyTokenizer’, ‘Tokenizer’, ‘Vocab’, ‘fix_html’, ‘replace_all_caps’, ‘replace_rep’, ‘replace_wrep’,
‘rm_useless_spaces’, ‘med_spec_add_spaces’,‘spec_add_spaces’, ‘BOS’, ‘EOS’, ‘FLD’, ‘UNK’, ‘PAD’, ‘TK_MAJ’, ‘TK_UP’, ‘TK_REP’, ‘TK_REP’, ‘TK_WREP’,
‘deal_caps’]

class MedTokenizer(BaseTokenizer):
“Modification of a spacy tokenizer to make it a BaseTokenizer.”
def init(self, lang:str):
self.tok = spacy.blank(lang)
def tokenizer(self, t:str) -> List[str]:
    return [t.text for t in self.tok.tokenizer(t)]

def add_special_cases(self, toks:Collection[str]):
    for w in toks:
        self.tok.tokenizer.add_special_case(w, [{ORTH: w}])
def med_spec_add_spaces(t:str) → str:
“For clinical notes, Add spaces around ‘<,>,:,-’ in addition to /,#, in t. \n”
return re.sub(r’([/#<>:-\n])‘, r’ \1 ', t)

defaults.text_pre_rules = [fix_html, replace_rep, replace_wrep, med_spec_add_spaces, rm_useless_spaces]
defaults.text_post_rules = [replace_all_caps, deal_caps]
mytokenizer = Tokenizer(MedTokenizer, lang=‘en’)
np.random.seed(42)
path = datapath4file(‘/media/DataHD2/Notes_PHI_20190121/notes_dana_hp’)
data_lm = (TextList.from_csv(path, ‘notes_hp_80pct.csv’, cols=‘note_text’,
processor = [TokenizeProcessor(tokenizer=mytokenizer), NumericalizeProcessor(max_vocab=60000)])
.split_by_rand_pct()
.label_for_lm()
.databunch(bs=48, num_workers=4))

I hacked “spacy” because I wanted it to tokenize each of the embedded characters “><:-”

I wonder if this step corrupted, in some way, my vocab, that is, if one of the 4 characters above is magic or reserved or something.