Lesson4-building language model part

Hi,when I review lesson4, I find that the following code do have to run for one or two minutes , Jeremy also said “That takes a few minutes to tokenize and numericalize.” I try to search the source code but I find nothing about Tokenizaiton and Numericalization.
data_lm = (TextList.from_folder(path)
.filter_by_folder(include=[‘train’,‘test’])
.split_by_rand_pct(0.1)
.label_for_lm()
.databunch(bs=bs))
Thanks,I just want to know where does the code accumplish Tokenizaiton and Numericalization.:grinning:

Tokenization and Numericalization docs are here. Jeremy uses Spacy standardly and SentencePiece was just added recently after the new NLP course.

2 Likes

thanks! Have the new NLP course come out yet?:laughing:

can i use FastText as choice

I have defined my language model while completing lesson3-imdb notebook. I then save the model by doing

data_lm = (TextList.from_folder(path)
           #Inputs: all the text files in path
            .filter_by_folder(include=['train', 'test', 'unsup']) 
           #We may have other temp folders that contain text files so we only keep what's in train and test
            .split_by_rand_pct(0.1)
           #We randomly split and keep 10% (10,000 reviews) for validation
            .label_for_lm()           
           #We want to do a language model so we label accordingly
            .databunch(bs=bs))
data_lm.save('data_lm.pkl')

I then fine tuned the model, ran learn.fit_one_cycle(10, 1e-3, moms=(0.8,0.7)) on GPU (it took around 3 hours to complete the fine tuning). I then finally do learn.save('fine_tuned') & learn.save('fine_tuned_enc').

I then started with the classifier part and after some iteration I got a GPU error stating

RuntimeError: CUDA out of memory. Tried to allocate 272.00 MiB (GPU 0; 15.90 GiB total capacity; 13.66 GiB already allocated; 139.88 MiB free; 1.45 GiB cached).

After this I mistakenly ran the above code block again. It saved a newer data_lm.pkl.
I created my data classifier by doing this -

data_clas = (TextList.from_folder(path, vocab=data_lm.vocab)
             #grab all the text files in path
             .split_by_folder(valid='test')
             #split by train and valid folder (that only keeps 'train' and 'test' so no need to filter)
             .label_from_folder(classes=['neg', 'pos'])
             #label them all with their folders
             .databunch(bs=bs))

data_clas.save('data_clas.pkl')

While training I am observing my model accuracy to be around 80%. Is it because the data_lm.vocab has changed when I executed it for the second time (In this case all my training time is wasted). I am slightly confused why this has happened. And is their any way to get back my previous data_lm.vocab which I used before the finetuning.