Wow - that hint about “vocab” is intriguing! I assumed the vocab in the line below would come from my data, in data_lm_hp_80pct_export.pkl, and not from AWS_LSTM.
data_lm = load_data(path, fname=‘data_lm_hp_80pct_export.pkl’,bs=32)
learn = language_model_learner(data_lm, AWD_LSTM)
Here is the code I used to build the pkl file:
all = [‘BaseTokenizer’, ‘SpacyTokenizer’, ‘Tokenizer’, ‘Vocab’, ‘fix_html’, ‘replace_all_caps’, ‘replace_rep’, ‘replace_wrep’,
‘rm_useless_spaces’, ‘med_spec_add_spaces’,‘spec_add_spaces’, ‘BOS’, ‘EOS’, ‘FLD’, ‘UNK’, ‘PAD’, ‘TK_MAJ’, ‘TK_UP’, ‘TK_REP’, ‘TK_REP’, ‘TK_WREP’,
‘deal_caps’]class MedTokenizer(BaseTokenizer):
“Modification of a spacy tokenizer to make it aBaseTokenizer
.”
def init(self, lang:str):
self.tok = spacy.blank(lang)def tokenizer(self, t:str) -> List[str]: return [t.text for t in self.tok.tokenizer(t)] def add_special_cases(self, toks:Collection[str]): for w in toks: self.tok.tokenizer.add_special_case(w, [{ORTH: w}])
def med_spec_add_spaces(t:str) → str:
“For clinical notes, Add spaces around ‘<,>,:,-’ in addition to /,#, int
. \n”
return re.sub(r’([/#<>:-\n])‘, r’ \1 ', t)defaults.text_pre_rules = [fix_html, replace_rep, replace_wrep, med_spec_add_spaces, rm_useless_spaces]
defaults.text_post_rules = [replace_all_caps, deal_caps]
mytokenizer = Tokenizer(MedTokenizer, lang=‘en’)
np.random.seed(42)
path = datapath4file(‘/media/DataHD2/Notes_PHI_20190121/notes_dana_hp’)
data_lm = (TextList.from_csv(path, ‘notes_hp_80pct.csv’, cols=‘note_text’,
processor = [TokenizeProcessor(tokenizer=mytokenizer), NumericalizeProcessor(max_vocab=60000)])
.split_by_rand_pct()
.label_for_lm()
.databunch(bs=48, num_workers=4))
I hacked “spacy” because I wanted it to tokenize each of the embedded characters “><:-”
I wonder if this step corrupted, in some way, my vocab, that is, if one of the 4 characters above is magic or reserved or something.