Is there any special pre-processing that should be applied in the process of pre-processing and/or tokenization to ensure I’m following an approach consistent with how the forward pre-trained model was created?
Anyhow, still would be interested to know what if any pre-processing you are doing with the text before training (e.g., are you using the default tokenization and pre-processing strategies or are you doing something different?)
I’m also curious what a good approach would be in the case of using SentencePiece since it seems to me we have to decide a head of time what custom tokens (e.g., BOS, EOS, FLD, UNK, PAD, TK_MAJ,TK_UP,TK_REP,TK_WREP, etc…) to include in the .txt/.csv files SP will train on.
Just loading with backwards=True in the call to databunch, otherwise the same as forward.
I’m using the default tokenizer and rules otherwise the model pretrained wouldn’t be compatible with fastai. the only preprocessing I do to WT103 is to separate the big text file in articles to be able to shuffle them at each epoch, then I retokenize it with our defaults.
are processed to UNK and the @-@ are left as is (there are also other weirds @ @ things). Titles are left, as are subtitles and everything.
Note that the best prompt for text generations will use the same title format as what’s in WT103.
Backward model has been added and pushed to master. There will be a release today to make it easily accessible.
Note that this comes with a breaking change: to get all the speed-up we can while training in mixed precision, we had to change the default embedding size to 1152 (multiple of 8). The vocab size will also always be a multiple of 8 unless you pass a max_vocab that isn’t a multiple of 8.
Edit: I also added the script used to create the pretrained models in the examples folder: here it is.
Hello, Sylvain
I’m trying to train base lm model for Russian language and while using your wonderful example I’ve encountered some memory errors. The problem was with np.array(articles) in the read_file() function. My data consumes around 2G disk space, but turning it into np array of strings exceeds my 16Gb (+swap) memory (and hatls with Memory Error)
I’ve managed to overcome this with some minor changes in script:
I’ve removed np.array(articles) from read_file() (now it’s just return articles.append(current_article)).
And in create_data(path) changed
This change helped me much. Memory consumption reduced to something like 4-5 Gb.
So my suggestion is to get rid of numpy string array and turn list of articles into DataFrame strait away. That will allow to use this script on a home desktop.
Or maybe I’m overlooking something and my version will break in some cases?
In fact that is not what I was trying to say. As I understand, tokenization takes place in data = TextList.from_df(... (and it worked for me). I was referring to earlier lines where articles are separates and turn into a dataframe (and supposedly unnecessarily turn into numpy string array in the process, which consumes all the memory)
I’d like to check my understanding with regards to building a backward model.
I’ve tried building a databunch for both forward and backward text (as below). However upon inspection, the content in the databunch doesn’t appear to be flipped for the data_clas_bwd. If so, where does it actually happen? During training itself?
once we’ve defined the databunch using the backwards=True option - is there any way that we can verify by looking at the databunch that it is indeed backwards? I’ve tried looking at the data and it seems identical to backwards=False.