Training a language model (not english)

Mugnaio · February 25, 2019, 9:50am

I was trying to build an italian LM. I downloaded a corpus. It was a really big single file, so I extracted a pair of txt files in train and validation folder to make a test, so I have a train/aa.txt file (with multi line of text) and a valid/ab.txt
I used this code
data_lm = (TextList.from_folder(path)
.filter_by_folder(include=[‘train’, ‘valid’])
.random_split_by_pct(0.1)
.label_for_lm()
.databunch(bs=bs))

but it says that my validation set is empty
next step was
learn = language_model_learner(data_lm, AWD_LSTM, drop_mult=0.3, pretrained=False)
then fit_one_cycle on it but there is no valid_loss (I think related to the warning above)
Can anyone help me making correct operations?
Thank you!

jolackner · February 25, 2019, 12:32pm

I think you would need the split_by_folder [source] function, instead of the random_split_by_pct you’re using?
You don’t want to randomly split into a validation & training set, as you have already chosen those sets (by providing the two folders “train” and “valid”).

Mugnaio · February 25, 2019, 12:44pm

ok, that worked, thank you.

Then maybe I didn’t understand the code I pasted (coming from here: https://github.com/fastai/course-v3/blob/master/nbs/dl1/lesson3-imdb.ipynb section “Language Model”).

I thought it was merging in some way all the text from each folder splitting it and keeping only 10% for validation

jolackner · February 25, 2019, 12:56pm

I see. Then the real problem might be that (by using random_split_by_pct) you were trying to select 10% of the items as validation set - but you had just 2 items (one text file in “valid”, one text file in “train”), so it would return an empty validation set.

Mugnaio · February 25, 2019, 1:06pm

Ok this makes sense.
So what is the correct practice if you have downloaded a single big text file corpus and from this you want to train a language model. You have to manually strip a part and put in the validation folder or there is a way to let it make an auto-split for validation set?

Mugnaio · February 25, 2019, 1:19pm

hmm maybe I just need something like

data_lm = TextLMDataBunch.from_folder(path)

jolackner · February 25, 2019, 1:29pm

Try just renaming your first huge text file “text.txt” as “text.csv”, then read it into a pandas dataframe like:

df = pd.read_csv(path/‘text.csv’)
df.head()

(straight from the “preparing the data” section of your lesson 3 notebook).
Then try following the steps in this fastai tutorial starting at:
https://docs.fast.ai/text.html#Getting-your-data-ready-for-modeling

… and get your data in shape by using the “from_csv” function (instead of the “from_folder”-function which seems like a superfluous step in your case), like so:

data_lm = TextLMDataBunch.from_csv(path, ‘text.csv’)

(edit: actually by using the “from_csv”-function, you don’t even need the pandas dataframe step before. Just rename/convert your single huge text file into a “.csv” file and read it into data_lm via the above line)

jolackner · February 25, 2019, 2:31pm

BTW: if you have just lines of text in your “text.txt” without any other columns, you may need to specify the text column (by default, fastai assumes that text column = 1), to prevent a “positional indexers are out-of-bounds” error, like so:

data_lm = TextLMDataBunch.from_csv(path, ‘text.csv’, text_cols = 0)

Mugnaio · February 25, 2019, 3:15pm

thank you very much!

I also noticed that there is a method from_tokens. Is that supposed to load a text already tokenized?

jolackner · February 25, 2019, 3:24pm

I think so, but I never used it. Generally, the fastai docs are a good source of information:

https://docs.fast.ai/text.data.html#TextDataBunch.from_tokens

Good luck e coraggio!

Mugnaio · February 26, 2019, 4:46pm

finally, I used this script

github.com

fastai/fastai/blob/master/courses/dl2/imdb_scripts/prepare_wiki.sh

#!/usr/bin/env bash
# Script to download a Wikipedia dump

# Script is partially based on https://github.com/facebookresearch/fastText/blob/master/get-wikimedia.sh
ROOT="data"
DUMP_DIR="${ROOT}/wiki_dumps"
EXTR_DIR="${ROOT}/wiki_extr"
WIKI_DIR="${ROOT}/wiki"
EXTR="wikiextractor"
mkdir -p "${ROOT}"
mkdir -p "${DUMP_DIR}"
mkdir -p "${EXTR_DIR}"
mkdir -p "${WIKI_DIR}"

echo "Saving data in ""$ROOT"
read -r -p "Choose a language (e.g. en, bh, fr, etc.): " choice
LANG="$choice"
echo "Chosen language: ""$LANG"
DUMP_FILE="${LANG}wiki-latest-pages-articles.xml.bz2"
DUMP_PATH="${DUMP_DIR}/${DUMP_FILE}"

This file has been truncated. show original

to make a csv using italian wikipedia as corpus, then I was able to use it with TextLMDataBunch.from_csv.
Thank you again for the text_cols = 0 tip

Mugnaio · March 6, 2019, 1:34pm

I trained a pretty good (I think) LM from wikipedia. I used colab, so I couldn’t save my data_lm with data_lm.save(‘data_lm.pkl’) because I had no memory for this, but I could train a LM learner and save it every epoch with learn.save(‘epoch_n’).
So every time I want to train a new epoch (I couldn’t afford more than 1 on colab due to 12 hours limitation) I could:
-generate again the DataBunch from the wikipedia csv (I used a fix random seed)
-learn = language_model_learner(arch=AWD_LSTM, drop_mult=0.3, pretrained=False)
-learn = learn.load(‘epoch_x’)
-train again
This is, in my understanding, my first step, because I have no a LM with pretrained=True in italian, so I made my personal one.

Now my next step (according to lesson 3), would be a new LM with my real domain (let say for example italian movie reviews, to use the same example of the lesson), and finally last step would be a classifier.

Now I’m not sure I have understood well this part.
-I created a DataBunch with italian reviews to train the new domain-specific LM
-again i created a new learner with pretrained=False using that reviews databunch
-now if I try to load weights with learn.load(‘my_last_epoch’) it says it can’t because mismatch of parameters size [60003, 400] against [20533, 400]
I think this is due to different size of vocabulary, so where I’m wrong?

How can I use the wikipedia lm I trained with the new review databunch (and with the new vocabulary)?
Thank you

Edit: could be I have to use the pretrained_fnames= parameter, but I have to figure out how

Mugnaio · March 6, 2019, 2:15pm

Ok after a little investigation in the sourcecode I think I have to use pretrained_fnames =(‘data_lm’, ‘my_last_epoch’)
so i desperately need that data_lm.pkl

Is my guess correct?

Mugnaio · March 6, 2019, 4:51pm

I tried with a little databunch I could save on colab, but seems like data_lm.save saves a file that pretrained_fnames (that calls load_pretrained) can’t load

what helped was using pickle.dump on data_lm.vocab.itos
now it seems it can load that saved file with also the weight file (saved with learn.save)

But I guess there is an easier way to do it

EDIT: class Vocab() has the save method that makes this, so I just need to call data_lm.vocab.save()
and by the way I could do this also in colab