Hi,when I review lesson4, I find that the following code do have to run for one or two minutes , Jeremy also said “That takes a few minutes to tokenize and numericalize.” I try to search the source code but I find nothing about Tokenizaiton and Numericalization.
data_lm = (TextList.from_folder(path)
.filter_by_folder(include=[‘train’,‘test’])
.split_by_rand_pct(0.1)
.label_for_lm()
.databunch(bs=bs))
Thanks,I just want to know where does the code accumplish Tokenizaiton and Numericalization.
Tokenization and Numericalization docs are here. Jeremy uses Spacy standardly and SentencePiece was just added recently after the new NLP course.
thanks! Have the new NLP course come out yet?
can i use FastText as choice
I have defined my language model while completing lesson3-imdb
notebook. I then save the model by doing
data_lm = (TextList.from_folder(path)
#Inputs: all the text files in path
.filter_by_folder(include=['train', 'test', 'unsup'])
#We may have other temp folders that contain text files so we only keep what's in train and test
.split_by_rand_pct(0.1)
#We randomly split and keep 10% (10,000 reviews) for validation
.label_for_lm()
#We want to do a language model so we label accordingly
.databunch(bs=bs))
data_lm.save('data_lm.pkl')
I then fine tuned the model, ran learn.fit_one_cycle(10, 1e-3, moms=(0.8,0.7))
on GPU (it took around 3 hours to complete the fine tuning). I then finally do learn.save('fine_tuned')
& learn.save('fine_tuned_enc')
.
I then started with the classifier part and after some iteration I got a GPU error stating
RuntimeError: CUDA out of memory. Tried to allocate 272.00 MiB (GPU 0; 15.90 GiB total capacity; 13.66 GiB already allocated; 139.88 MiB free; 1.45 GiB cached)
.
After this I mistakenly ran the above code block again. It saved a newer data_lm.pkl
.
I created my data classifier by doing this -
data_clas = (TextList.from_folder(path, vocab=data_lm.vocab)
#grab all the text files in path
.split_by_folder(valid='test')
#split by train and valid folder (that only keeps 'train' and 'test' so no need to filter)
.label_from_folder(classes=['neg', 'pos'])
#label them all with their folders
.databunch(bs=bs))
data_clas.save('data_clas.pkl')
While training I am observing my model accuracy to be around 80%. Is it because the data_lm.vocab
has changed when I executed it for the second time (In this case all my training time is wasted). I am slightly confused why this has happened. And is their any way to get back my previous data_lm.vocab
which I used before the finetuning.