Beginning of NLP

shaun1 · October 6, 2018, 4:46pm

I’m working with through the IMDB doc and will eventually be adapting it to a custom dataset in another project. I have a question about the pretrained models. The pretrained “model” from fastai 0.7 include these files:

bwd_wt103_enc.h5 bwd_wt103.h5 fwd_wt103_enc.h5 fwd_wt103.h5 itos_wt103.pkl

The fastai 1.0 model has these files:

itos_wt103.pkl lstm_wt103.pth

There is a separate backward and forward model in the 0.7 version of the library but only one lstm_wt103.pth in 1.0. Is this only forward (or backward)?
Does the vocabulary between the two match? In other words, are both the itos_wt103.pkl the same?

Finally I have an off-topic question. It is my understanding that this section in the forums are for discussing the development of the new fastai library. Can I put questions about usage of the library (for example how to do one thing or whether one is possible etc) in this section as well? If not where is the best place to ask those questions. Sorry for the off-topic questions, I didn’t want to create a new thread for this question to avoid noise.

Thank you.

sgugger · October 6, 2018, 5:16pm

The models from fastai v1 aren’t backward compatible, which is why you have new ones. I haven’t trained a backward model yet but I’ll launch this sometimes next week. The vocabulary are different as we cut the vocab size to 60k for this new model (rare words will be fine-tuned on the custom dataset).

For questions, I think the best is to create a new topic with a clear title, so that it benefits all the others with the same problem.

shaun1 · October 7, 2018, 7:18pm

The TextDataset has a txt_cols variable initialized to None in its constructor. Does the variable represent the number of fields in our data source? If so, do we need to specify them when constructing the dataset?

For example, I have a dataframe (converted to a csv) which has two columns named name and item_description (along with a dummy labels column) and I would like both of them included in the dataset for the language model. So when I create my dataset, would I specify txt_cols=2 since I have two fields or does that variable and the number fields have no bearing with each other?

sgugger · October 7, 2018, 7:30pm

You can either initialize from csv/data frame with n_labels (which says the numbers of labels columns at the beginning and one column of text is assumed) or by specifying the names of text columns and label columns in txt_cols and lbl_cols. Both should work properly at this stage but the latter is new and not tested a lot, so report if you find any bug.

shaun1 · October 7, 2018, 7:35pm

I’m confused. Doesn’t n_labels refer to the number of labeled classes (in case of a classification problem)? So for LM this will always be 1 where all the label values are 0. Am I not understanding this correctly?

sgugger · October 7, 2018, 7:37pm

n_labels could be greater than 1 if you have multiple labels (you’d need to build your custom model of course). I’t’s not the number of classes.

shaun1 · October 7, 2018, 7:43pm

So I if I had a dataframe such as this (after processing/extracting the texts from the main data source for building a LM):

	name	item_description
0	name_1	desc_1
1	name_2	desc_2
2	name_3	desc_3

What would n_labels refer to here? The way I’m thinking here is n_labels=1 since there’s only 1 label with values of 0 and the number of fields are 2 (name and item_description).

If n_labels is not the number of classes, what would be an example where n_labels > 1? Sorry for so many questions!

sgugger · October 7, 2018, 8:17pm

You can either pass your csv with

n_labels=1 and no header
txt_cols = ['name', 'item_description'] and lbl_cols=['labels'] with the header.

A problem with n_labels > 1 is if you have multiclassification in NLP (where each word/text can have several different labels). I think the toxic comment competition is a good example.

shaun1 · October 8, 2018, 2:01pm

Passing chunksize as a parameter to text_data_from_df results in the following error (not that we would ever need to do that):

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-6-056e51589e97> in <module>
----> 1 data_lm = text_data_from_df(PATH, train_df=train, valid_df=test, data_func=lm_data, max_vocab=60_000, chunksize=24_000, min_freq=2, txt_cols=['name', 'item_description'], label_cols=['label'])

~/fastai/fastai/text/data.py in text_data_from_df(path, train_df, valid_df, test_df, tokenizer, data_func, vocab, **kwargs)
    324     path=Path(path)
    325     txt_kwargs, kwargs = extract_kwargs(['max_vocab', 'chunksize', 'min_freq', 'n_labels', 'txt_cols', 'label_cols'], kwargs)
--> 326     train_ds = TextDataset.from_df(path, train_df, tokenizer, 'train', vocab=vocab, **txt_kwargs)
    327     datasets = [train_ds, TextDataset.from_df(path, valid_df, tokenizer, 'valid', vocab=train_ds.vocab, **txt_kwargs)]
    328     if test_df: datasets.append(TextDataset.from_df(path, test_df, tokenizer, 'test', vocab=train_ds.vocab, **txt_kwargs))

~/fastai/fastai/text/data.py in from_df(cls, folder, df, tokenizer, name, **kwargs)
    142         tokenizer = ifnone(tokenizer, Tokenizer())
    143         chunksize = 1 if (type(df) == DataFrame) else df.chunksize
--> 144         return cls(folder, tokenizer, df=df, create_mtd=TextMtd.DF, name=name, chunksize=chunksize, **kwargs)
    145 
    146     @classmethod

TypeError: type object got multiple values for keyword argument 'chunksize'

A little research indicated that the error “can happen if you pass a key word argument for which one of the keys is similar (has same string name) to a positional argument.” as given in the 2nd answer in this stackoverflow question. The solution is to “You would have to remove the keyword argument from the kwargs before passing it to the method.” I’m not sure how to do that. I also found this. Just wanted to bring this to your attention.

Thanks.

wgpubs · October 8, 2018, 4:45pm

Are you passing in chunksize as an argument as well as passing in a DataFrame with chunksize specified?

If so, don’t pass in chunksize separately as the DataFrame is already “chuncked”. The only time passing in chunksize as an argument is needed is if you are creating your dataset from a .csv for the filesystem.

shaun1 · October 8, 2018, 4:48pm

In that case, wouldn’t it be better if chunksize is not extracted into kwargs? Becaue in the from_df method we have this line:

chunksize = 1 if (type(df) == DataFrame) else df.chunksize

For a DataFrame, wouldn’t the chunksize always be 1 since the whole think is already loaded into memory?

sgugger · October 8, 2018, 4:54pm

I removed this argument because it makes no sense passing a chunksize in the from_df method. It won’t always be 1, it depends on how you loaded that dataframe.

wgpubs · October 8, 2018, 4:55pm

Yah I suppose so.

I’ll refactor this today and push something to the repo. (never mind, is see @sgugger already did it).

-wg

shaun1 · October 8, 2018, 6:50pm

Currently, pre_trained LM weights are assumed to be located in DATA_PATH/models. The problem with this is, I need to make copies (or softlink the weight and the itos vocab) of the weights for each project I work on in that project’s directory. How about we specify a separate pre_trained_path for just loading the pre_trained weights? Would that be something that could be considered? I can submit a PR that does that.

sgugger · October 8, 2018, 7:00pm

For now, keep the simlinks. We’ll be adding more pretrained models, and as we do, we’ll come with a solution to have them centralized somewhere (and automatically downloaded if needed like pytorch does). This should come in v1.1.

wgpubs · October 8, 2018, 7:53pm

Any ETA on that?

I’m beginning to port my NLP work from old fastai to the new framework, but not until we have both a forward and backwards pre-trained wiki103 at the very least.

If ya’ll need any help, let me know.

-wg

shaun1 · October 9, 2018, 12:01pm

I trained a new LM on custom dataset after fine-tuning it on the pre-trained LM. Happy to say that everything worked without any problems and it was extremely easy! Kudos to the fastai team.

I have one question. Before training the model, I split my texts into 80/10/10 for train, val, test and passed all there of them to the factory function to create my data_lm. I can see from the progress outputs how the training loss, val loss, accuracy indicating that the training and val datasets have been use. I’m not sure how to use the test data sets after I created, trained and saved the model.

In other words, how would I “test” this LM. Please note that I intend to use this on another task (regression instead of classification) so that would serve as a good test. But I was wondering how to use the test set that I passed to the factory method.

Thank you

sgugger · October 9, 2018, 12:10pm

You should use learn.get_preds(is_valid=False) to get your predictions on the test test.
I’m glad you like it so far

shaun1 · October 9, 2018, 12:15pm

That method throws an error saying get_preds does not exist:

learn.get_preds

AttributeError Traceback (most recent call last)
in
----> 1 learn.get_preds

AttributeError: ‘RNNLearner’ object has no attribute ‘get_preds’

sgugger · October 9, 2018, 1:26pm

Ah yes, that’s because we removed the tta import from there I guess. Try from fastai.tta import * and tell me if this works, will fix in function of what you report.