Two questions on creating a backwards pre-trained AWD LSTM model

Assuming one does not exist …

  1. Should I use the wiki-103 dataset download available here: https://course.fast.ai/datasets?

  2. Is there any special pre-processing that should be applied in the process of pre-processing and/or tokenization to ensure I’m following an approach consistent with how the forward pre-trained model was created?

1 Like

I’m training one today :wink:

3 Likes

I knew this would happen if I asked :slight_smile:

Anyhow, still would be interested to know what if any pre-processing you are doing with the text before training (e.g., are you using the default tokenization and pre-processing strategies or are you doing something different?)

I’m also curious what a good approach would be in the case of using SentencePiece since it seems to me we have to decide a head of time what custom tokens (e.g., BOS, EOS, FLD, UNK, PAD, TK_MAJ,TK_UP,TK_REP,TK_WREP, etc…) to include in the .txt/.csv files SP will train on.

Just loading with backwards=True in the call to databunch, otherwise the same as forward.
I’m using the default tokenizer and rules otherwise the model pretrained wouldn’t be compatible with fastai. the only preprocessing I do to WT103 is to separate the big text file in articles to be able to shuffle them at each epoch, then I retokenize it with our defaults.

2 Likes

Make sense, two questions:

  1. What are you doing with those <unk> tokens and the @-@ token?

  2. Are you including the title or just the article text?

are processed to UNK and the @-@ are left as is (there are also other weirds @ @ things). Titles are left, as are subtitles and everything.
Note that the best prompt for text generations will use the same title format as what’s in WT103.

Do you mean that as a prompt it should look like, '= = = The worst thing about The Last Jedi is = = ='?

Not really, presented with a real title (which is only one = btw), when wanting to generate a fake wikipedia article.

1 Like

Haven’t forgotten, just finding some bugs in pretraining model with the latest fastai that I’m trying to solve.

1 Like

Are you using the latest release (1.0.53) or the upcoming codebase?

I’m using master (not that different from v1.0.53). We aren’t in the applications yet in v2.

Backward model has been added and pushed to master. There will be a release today to make it easily accessible.

Note that this comes with a breaking change: to get all the speed-up we can while training in mixed precision, we had to change the default embedding size to 1152 (multiple of 8). The vocab size will also always be a multiple of 8 unless you pass a max_vocab that isn’t a multiple of 8.

Edit: I also added the script used to create the pretrained models in the examples folder: here it is.

3 Likes

Probably a dumb question, but why a multiple of 8?

Cause Nvidia GPUs want that to be fast on mixed precision.

2 Likes

Hello, Sylvain
I’m trying to train base lm model for Russian language and while using your wonderful example I’ve encountered some memory errors. The problem was with np.array(articles) in the read_file() function. My data consumes around 2G disk space, but turning it into np array of strings exceeds my 16Gb (+swap) memory (and hatls with Memory Error) :frowning:
I’ve managed to overcome this with some minor changes in script:
I’ve removed np.array(articles) from read_file() (now it’s just return articles.append(current_article)).
And in create_data(path) changed

all_texts = np.concatenate([valid, train, test])
df = pd.DataFrame({‘texts’:all_texts})

into

df = pd.concat([pd.DataFrame({‘texts’:valid}), pd.DataFrame({‘texts’:train}), pd.DataFrame({‘texts’:test})])

This change helped me much. Memory consumption reduced to something like 4-5 Gb.
So my suggestion is to get rid of numpy string array and turn list of articles into DataFrame strait away. That will allow to use this script on a home desktop.
Or maybe I’m overlooking something and my version will break in some cases?

We know tokenization is memory inefficient and that will be addressed in v2 (working on it as we speak).

1 Like

In fact that is not what I was trying to say. As I understand, tokenization takes place in data = TextList.from_df(... (and it worked for me). I was referring to earlier lines where articles are separates and turn into a dataframe (and supposedly unnecessarily turn into numpy string array in the process, which consumes all the memory)

Yes, that is the reason it is memory inefficient, because we load everything in a big array.

I’d like to check my understanding with regards to building a backward model.

I’ve tried building a databunch for both forward and backward text (as below). However upon inspection, the content in the databunch doesn’t appear to be flipped for the data_clas_bwd. If so, where does it actually happen? During training itself?

once we’ve defined the databunch using the backwards=True option - is there any way that we can verify by looking at the databunch that it is indeed backwards? I’ve tried looking at the data and it seems identical to backwards=False.