Two questions on creating a backwards pre-trained AWD LSTM model

(WG) #1

Assuming one does not exist …

  1. Should I use the wiki-103 dataset download available here: https://course.fast.ai/datasets?

  2. Is there any special pre-processing that should be applied in the process of pre-processing and/or tokenization to ensure I’m following an approach consistent with how the forward pre-trained model was created?

1 Like

#2

I’m training one today :wink:

3 Likes

(WG) #3

I knew this would happen if I asked :slight_smile:

Anyhow, still would be interested to know what if any pre-processing you are doing with the text before training (e.g., are you using the default tokenization and pre-processing strategies or are you doing something different?)

I’m also curious what a good approach would be in the case of using SentencePiece since it seems to me we have to decide a head of time what custom tokens (e.g., BOS, EOS, FLD, UNK, PAD, TK_MAJ,TK_UP,TK_REP,TK_WREP, etc…) to include in the .txt/.csv files SP will train on.

0 Likes

#4

Just loading with backwards=True in the call to databunch, otherwise the same as forward.
I’m using the default tokenizer and rules otherwise the model pretrained wouldn’t be compatible with fastai. the only preprocessing I do to WT103 is to separate the big text file in articles to be able to shuffle them at each epoch, then I retokenize it with our defaults.

2 Likes

(WG) #5

Make sense, two questions:

  1. What are you doing with those <unk> tokens and the @-@ token?

  2. Are you including the title or just the article text?

0 Likes

#6

are processed to UNK and the @-@ are left as is (there are also other weirds @ @ things). Titles are left, as are subtitles and everything.
Note that the best prompt for text generations will use the same title format as what’s in WT103.

0 Likes

(WG) #7

Do you mean that as a prompt it should look like, '= = = The worst thing about The Last Jedi is = = ='?

0 Likes

#8

Not really, presented with a real title (which is only one = btw), when wanting to generate a fake wikipedia article.

1 Like

#9

Haven’t forgotten, just finding some bugs in pretraining model with the latest fastai that I’m trying to solve.

1 Like

(WG) #10

Are you using the latest release (1.0.53) or the upcoming codebase?

0 Likes

#11

I’m using master (not that different from v1.0.53). We aren’t in the applications yet in v2.

0 Likes

#12

Backward model has been added and pushed to master. There will be a release today to make it easily accessible.

Note that this comes with a breaking change: to get all the speed-up we can while training in mixed precision, we had to change the default embedding size to 1152 (multiple of 8). The vocab size will also always be a multiple of 8 unless you pass a max_vocab that isn’t a multiple of 8.

Edit: I also added the script used to create the pretrained models in the examples folder: here it is.

3 Likes

(WG) #13

Probably a dumb question, but why a multiple of 8?

0 Likes

#14

Cause Nvidia GPUs want that to be fast on mixed precision.

1 Like

(Pavel) #15

Hello, Sylvain
I’m trying to train base lm model for Russian language and while using your wonderful example I’ve encountered some memory errors. The problem was with np.array(articles) in the read_file() function. My data consumes around 2G disk space, but turning it into np array of strings exceeds my 16Gb (+swap) memory (and hatls with Memory Error) :frowning:
I’ve managed to overcome this with some minor changes in script:
I’ve removed np.array(articles) from read_file() (now it’s just return articles.append(current_article)).
And in create_data(path) changed

all_texts = np.concatenate([valid, train, test])
df = pd.DataFrame({‘texts’:all_texts})

into

df = pd.concat([pd.DataFrame({‘texts’:valid}), pd.DataFrame({‘texts’:train}), pd.DataFrame({‘texts’:test})])

This change helped me much. Memory consumption reduced to something like 4-5 Gb.
So my suggestion is to get rid of numpy string array and turn list of articles into DataFrame strait away. That will allow to use this script on a home desktop.
Or maybe I’m overlooking something and my version will break in some cases?

0 Likes

#16

We know tokenization is memory inefficient and that will be addressed in v2 (working on it as we speak).

0 Likes

(Pavel) #17

In fact that is not what I was trying to say. As I understand, tokenization takes place in data = TextList.from_df(... (and it worked for me). I was referring to earlier lines where articles are separates and turn into a dataframe (and supposedly unnecessarily turn into numpy string array in the process, which consumes all the memory)

0 Likes

#18

Yes, that is the reason it is memory inefficient, because we load everything in a big array.

0 Likes