I was trying to build an italian LM. I downloaded a corpus. It was a really big single file, so I extracted a pair of txt files in train and validation folder to make a test, so I have a train/aa.txt file (with multi line of text) and a valid/ab.txt
I used this code
data_lm = (TextList.from_folder(path)
.filter_by_folder(include=[‘train’, ‘valid’])
.random_split_by_pct(0.1)
.label_for_lm()
.databunch(bs=bs))
but it says that my validation set is empty
next step was
learn = language_model_learner(data_lm, AWD_LSTM, drop_mult=0.3, pretrained=False)
then fit_one_cycle on it but there is no valid_loss (I think related to the warning above)
Can anyone help me making correct operations?
Thank you!
I think you would need the split_by_folder[source] function, instead of the random_split_by_pct you’re using?
You don’t want to randomly split into a validation & training set, as you have already chosen those sets (by providing the two folders “train” and “valid”).
I see. Then the real problem might be that (by using random_split_by_pct) you were trying to select 10% of the items as validation set - but you had just 2 items (one text file in “valid”, one text file in “train”), so it would return an empty validation set.
Ok this makes sense.
So what is the correct practice if you have downloaded a single big text file corpus and from this you want to train a language model. You have to manually strip a part and put in the validation folder or there is a way to let it make an auto-split for validation set?
… and get your data in shape by using the “from_csv” function (instead of the “from_folder”-function which seems like a superfluous step in your case), like so:
(edit: actually by using the “from_csv”-function, you don’t even need the pandas dataframe step before. Just rename/convert your single huge text file into a “.csv” file and read it into data_lm via the above line)
BTW: if you have just lines of text in your “text.txt” without any other columns, you may need to specify the text column (by default, fastai assumes that text column = 1), to prevent a “positional indexers are out-of-bounds” error, like so:
I trained a pretty good (I think) LM from wikipedia. I used colab, so I couldn’t save my data_lm with data_lm.save(‘data_lm.pkl’) because I had no memory for this, but I could train a LM learner and save it every epoch with learn.save(‘epoch_n’).
So every time I want to train a new epoch (I couldn’t afford more than 1 on colab due to 12 hours limitation) I could:
-generate again the DataBunch from the wikipedia csv (I used a fix random seed)
-learn = language_model_learner(arch=AWD_LSTM, drop_mult=0.3, pretrained=False)
-learn = learn.load(‘epoch_x’)
-train again
This is, in my understanding, my first step, because I have no a LM with pretrained=True in italian, so I made my personal one.
Now my next step (according to lesson 3), would be a new LM with my real domain (let say for example italian movie reviews, to use the same example of the lesson), and finally last step would be a classifier.
Now I’m not sure I have understood well this part.
-I created a DataBunch with italian reviews to train the new domain-specific LM
-again i created a new learner with pretrained=False using that reviews databunch
-now if I try to load weights with learn.load(‘my_last_epoch’) it says it can’t because mismatch of parameters size [60003, 400] against [20533, 400]
I think this is due to different size of vocabulary, so where I’m wrong?
How can I use the wikipedia lm I trained with the new review databunch (and with the new vocabulary)?
Thank you
Edit: could be I have to use the pretrained_fnames= parameter, but I have to figure out how
Ok after a little investigation in the sourcecode I think I have to use pretrained_fnames =(‘data_lm’, ‘my_last_epoch’)
so i desperately need that data_lm.pkl
I tried with a little databunch I could save on colab, but seems like data_lm.save saves a file that pretrained_fnames (that calls load_pretrained) can’t load
what helped was using pickle.dump on data_lm.vocab.itos
now it seems it can load that saved file with also the weight file (saved with learn.save)
But I guess there is an easier way to do it
EDIT: class Vocab() has the save method that makes this, so I just need to call data_lm.vocab.save()
and by the way I could do this also in colab