Beginning of NLP

@sgugger Re: 0007b_imdb_classifier “the pretrained model and the corresponding itos dictionary here and put them in the MODEL_PATH folder.” Where can we download the itos4.pkl and lstm4.pth files from?

Are these from wt103? e.g., Is the itos4.pkl what was previously known as models/wt103/itos_wt103.pkl ? Or are these coming from another pre-trained model?

Thanks!

Will put them on files.fast.ai sometimes today, thanks for reminding me I have to do this!

All done, you can find the model and the vocabulary in here. Will add a note in the notebook.

2 Likes

Thanks! Is itos4.pkl == models/wt103/itos_wt103.pkl , etc.? Both names are used in 007b_imdb_classifier.

Yes, I haven’t been very consistent with names, but itos4 and itos_wt103 are the same thing, lstm4 and lstm_wt103 as well.

1 Like

There might be potential bug in the tokenizer that may have carried over from the fastai v0 code. In particular, in the inner for loop in the the tokenizer method in TextDataset:

        for i in range(self.n_labels+1, len(df.columns)):
            texts += f' {FLD} {i-n_lbls} ' + df[i].astype(str)

Wouldn’t the code need to be texts += f’ {FLD} {i-n_lbls+1} ’ + df[i].astype(str). Otherwise we will end up with 2 fields that have xfld=1. This was referenced here as well.

1 Like

Ah yes, it was corrected in the imdb scripts, but not the notebook, so I didn’t change it. Thanks for catching this!

This might be another bug. I get an error saying n_lbls is not defined at the same spot in the code. Sure enough it doesn’t show up anywhere in text/data.py. Interestingly, it didn’t throw that error earlier even though the only code that was just was an increment of 1. Am I missing something?

Ooop, my be a bad copy-paste. Fixed it in this commit.

I don’t if this the right place or whether it is too soon, but I was wondering if there would be an option to give a custom path for the tmp directory in the text dataset. My understanding from reading the docs and perusing the code is that if I point to train.csv in a PATH, the code will create a tmp directory within the path, copy over the csv and create all the intermediate files for tokenization. I have two questions regarding this:

  1. If my valid.csv is in the same path as train.csv which now already has a tmp directory, what would happen? Will the tmp directory get overwritten or will the code just use the stuff in it?
  2. If I rename the tmp directory to something else, will the code still be able to load it in?

Thanks.

The tmp directory is for all the internal files that the TextDataset uses to remember its computations. If you change its name, the code will recreate it and redo all the computaiton for tokenization and numericalization.
If you add a valid.csv file, it will just add more files in the tmp directory (which will be called valid*).

It seems that the new codebase presumes the underlying data format is a .csv file whereas it would be infinitely more reusable, if instead, it assumed a DataFrame (which could be created via a .csv file, .tsv file, sql query, an http get, etc…).

I’d be curious to hear your thoughts on either adding from_df() type helper methods and/or just assuming DataFrames instead of CSV files.

That would most probably be better. In other parts of the library that’s what we’ve done, and had the csv version of functions simply call the df version. happy to take a pr that does this (please test the documentation code and examples still work if you do this)

Will work up something for today for you all to consider.

In terms of testing, I assume by testing the documentation code and the examples you mean the /examples/text.ipynb file, correct? If there is any other code I should be testing just lmk.

@wgpubs this is the full set of notebooks: https://github.com/fastai/fastai_docs/tree/master/docs_src . The ones starting with “text.” are the ones that you might be interested in. Especially “text.data”.

Ok thanks.

Is there a particular process for submitting PRs I should follow? I know you guys were doing things a bit differently with using notebooks for initial development but was wondering if I can work in the more formal way of:

  1. Forking the repo
  2. Symlink to /fastai/fastai folder
  3. Submit PRs from my forked repo.

The notebook dev stuff was just for the initial work - we’re not using it any more. It’s just a regular library now. And you don’t need to symlink - see the fastai readme for how to do an ‘editable install’. Also, you may want to look at hub for creating PRs - it’s really convenient.

The only bit that’s different to usual processes is our docs.

http://docs.fast.ai/gen_doc.html

Oh wow … cool: pip install -e .[dev]

You mentioned hub before so I’ll check it out (I’m so old school when it comes to submitting PRs).

1 Like

Done.

I have another recommendation I’m hoping you all might be amenable too as well. It has to do with being able to with being able to specify the TEXT columns and the LABEL columns in the .csv or DataFrame instead of using the n_labels parameter to delineate between and text and label columns.

The reason for this is to make running multiple experiments and/or ablations easier by not requiring the user to construct a DataFrame or .csv file for every possible configuration they want to test. For example, I’m working with a dataset with about 5 potential TEXT columns and multiple LABEL columns (some I’d use for multilabel problems and others individually for multiclass problems). Being able to specify the TEXT and LABEL columns to use would allow me to create a single Dataframe that could be used to create multiple TextDatasets and/or learners.

In addition to the configuration flexibility, it seems more intuitive as the user specifically declares what columns they want to use for both TEXT and LABELs without having to worry about whether their columns are specified in the right order.

I’m recommending this as an option for TextDataset or as a replacement for the n_labels method.

I’m working with through the IMDB doc and will eventually be adapting it to a custom dataset in another project. I have a question about the pretrained models. The pretrained “model” from fastai 0.7 include these files:

bwd_wt103_enc.h5  bwd_wt103.h5  fwd_wt103_enc.h5  fwd_wt103.h5  itos_wt103.pkl

The fastai 1.0 model has these files:

itos_wt103.pkl  lstm_wt103.pth

  1. There is a separate backward and forward model in the 0.7 version of the library but only one lstm_wt103.pth in 1.0. Is this only forward (or backward)?
  2. Does the vocabulary between the two match? In other words, are both the itos_wt103.pkl the same?

Finally I have an off-topic question. It is my understanding that this section in the forums are for discussing the development of the new fastai library. Can I put questions about usage of the library (for example how to do one thing or whether one is possible etc) in this section as well? If not where is the best place to ask those questions. Sorry for the off-topic questions, I didn’t want to create a new thread for this question to avoid noise.

Thank you.

1 Like