NLP with v1

balnazzar · November 14, 2018, 12:53pm

Hi mates. I want to do what follows:

acquire a large bunch of texts stored in a certain folder. The texts are .CSVs with columns that provide labeling.
have them splitted in train and valid
(more importantly) work with them using a custom pretrained language model.

While I can somewhat cope with point 1 & 2, I’m a bit lost with point 3: as far as I could read, I was unable to find any reference in the documentation about that.

Thanks!

sgugger · November 14, 2018, 2:25pm

You have to use pretrained_fnames in language_model_learner to load your custom model.

balnazzar · November 14, 2018, 2:48pm

Thanks!
I understand you are busy in developing the library, so even quick links to the relevant sections of the docs will be helpful. Somehow, it is a bit difficult to navigate.

Let me tell you what causes me perplexities.

In the docs, one reads:

You can specify pretrained_model if you want to use the weights of a pretrained model. If you have your own set of weights and the corrsesponding dictionary, you can pass them in pretrained_fnames . This should be a list of the name of the weight file and the name of the corresponding dictionary. The dictionary is needed because the function will internally convert the embeddings of the pretrained models to match the dictionary of the data passed (a word may have a different id for the pretrained model). Those two files should be in the models directory of data.path

Now, let’s look at the example:

learn = language_model_learner(data, pretrained_model=URLs.WT103, drop_mult=0.5)

If I’m getting you (the docs) right, I will not pass pretrained_model (it’s None by default).
Rather, I’ll pass pretrained_fnames=['my_weights_file', 'my_dictionary_file']

Allow me to ask what follows:

Is what I wrote above correct?
I’ll pass as my_dictionary_file one of the vectors (preferably in bin format) which are a the bottom of this page (Am I right in doing so?).
But what about the pretrained weights? Since I’m using a different language, should I use something pretrained which is already present in fastai, OR should I train my model from scratch? And if so, how?

Again, thanks!

sgugger · November 14, 2018, 3:03pm

I think I misunderstood what you asked for. You can’t use pretrained embeddings in the AWD-LSTM so if that is what you want, you will need to write your own model.
I was telling you how to use a model you have trained from scratch on a new language.

balnazzar · November 14, 2018, 3:56pm

Understood. No matter how unfortunate this is, I think your reply will be useful for others too: as of late, I talked with a lot of users who wanted to use pretrained embeddings with AWD-LSTM.

Thanks.

balnazzar · November 14, 2018, 6:44pm

Now that you make me think about it, it would be an interesting thing to train a language model from scratch, and then use it against a language whatsoever given an appropriate dict file.

Could you provide some hints about how to do that with v1?

Thanks!!