Beginning of NLP

It seems that the new codebase presumes the underlying data format is a .csv file whereas it would be infinitely more reusable, if instead, it assumed a DataFrame (which could be created via a .csv file, .tsv file, sql query, an http get, etc…).

I’d be curious to hear your thoughts on either adding from_df() type helper methods and/or just assuming DataFrames instead of CSV files.

That would most probably be better. In other parts of the library that’s what we’ve done, and had the csv version of functions simply call the df version. happy to take a pr that does this (please test the documentation code and examples still work if you do this)

Will work up something for today for you all to consider.

In terms of testing, I assume by testing the documentation code and the examples you mean the /examples/text.ipynb file, correct? If there is any other code I should be testing just lmk.

@wgpubs this is the full set of notebooks: https://github.com/fastai/fastai_docs/tree/master/docs_src . The ones starting with “text.” are the ones that you might be interested in. Especially “text.data”.

Ok thanks.

Is there a particular process for submitting PRs I should follow? I know you guys were doing things a bit differently with using notebooks for initial development but was wondering if I can work in the more formal way of:

  1. Forking the repo
  2. Symlink to /fastai/fastai folder
  3. Submit PRs from my forked repo.

The notebook dev stuff was just for the initial work - we’re not using it any more. It’s just a regular library now. And you don’t need to symlink - see the fastai readme for how to do an ‘editable install’. Also, you may want to look at hub for creating PRs - it’s really convenient.

The only bit that’s different to usual processes is our docs.

http://docs.fast.ai/gen_doc.html

Oh wow … cool: pip install -e .[dev]

You mentioned hub before so I’ll check it out (I’m so old school when it comes to submitting PRs).

1 Like

Done.

I have another recommendation I’m hoping you all might be amenable too as well. It has to do with being able to with being able to specify the TEXT columns and the LABEL columns in the .csv or DataFrame instead of using the n_labels parameter to delineate between and text and label columns.

The reason for this is to make running multiple experiments and/or ablations easier by not requiring the user to construct a DataFrame or .csv file for every possible configuration they want to test. For example, I’m working with a dataset with about 5 potential TEXT columns and multiple LABEL columns (some I’d use for multilabel problems and others individually for multiclass problems). Being able to specify the TEXT and LABEL columns to use would allow me to create a single Dataframe that could be used to create multiple TextDatasets and/or learners.

In addition to the configuration flexibility, it seems more intuitive as the user specifically declares what columns they want to use for both TEXT and LABELs without having to worry about whether their columns are specified in the right order.

I’m recommending this as an option for TextDataset or as a replacement for the n_labels method.

I’m working with through the IMDB doc and will eventually be adapting it to a custom dataset in another project. I have a question about the pretrained models. The pretrained “model” from fastai 0.7 include these files:

bwd_wt103_enc.h5  bwd_wt103.h5  fwd_wt103_enc.h5  fwd_wt103.h5  itos_wt103.pkl

The fastai 1.0 model has these files:

itos_wt103.pkl  lstm_wt103.pth

  1. There is a separate backward and forward model in the 0.7 version of the library but only one lstm_wt103.pth in 1.0. Is this only forward (or backward)?
  2. Does the vocabulary between the two match? In other words, are both the itos_wt103.pkl the same?

Finally I have an off-topic question. It is my understanding that this section in the forums are for discussing the development of the new fastai library. Can I put questions about usage of the library (for example how to do one thing or whether one is possible etc) in this section as well? If not where is the best place to ask those questions. Sorry for the off-topic questions, I didn’t want to create a new thread for this question to avoid noise.

Thank you.

1 Like

The models from fastai v1 aren’t backward compatible, which is why you have new ones. I haven’t trained a backward model yet but I’ll launch this sometimes next week. The vocabulary are different as we cut the vocab size to 60k for this new model (rare words will be fine-tuned on the custom dataset).

For questions, I think the best is to create a new topic with a clear title, so that it benefits all the others with the same problem.

The TextDataset has a txt_cols variable initialized to None in its constructor. Does the variable represent the number of fields in our data source? If so, do we need to specify them when constructing the dataset?

For example, I have a dataframe (converted to a csv) which has two columns named name and item_description (along with a dummy labels column) and I would like both of them included in the dataset for the language model. So when I create my dataset, would I specify txt_cols=2 since I have two fields or does that variable and the number fields have no bearing with each other?

You can either initialize from csv/data frame with n_labels (which says the numbers of labels columns at the beginning and one column of text is assumed) or by specifying the names of text columns and label columns in txt_cols and lbl_cols. Both should work properly at this stage but the latter is new and not tested a lot, so report if you find any bug.

I’m confused. Doesn’t n_labels refer to the number of labeled classes (in case of a classification problem)? So for LM this will always be 1 where all the label values are 0. Am I not understanding this correctly?

n_labels could be greater than 1 if you have multiple labels (you’d need to build your custom model of course). I’t’s not the number of classes.

So I if I had a dataframe such as this (after processing/extracting the texts from the main data source for building a LM):

labels name item_description
0 0 name_1 desc_1
1 0 name_2 desc_2
2 0 name_3 desc_3

What would n_labels refer to here? The way I’m thinking here is n_labels=1 since there’s only 1 label with values of 0 and the number of fields are 2 (name and item_description).

If n_labels is not the number of classes, what would be an example where n_labels > 1? Sorry for so many questions!

You can either pass your csv with

  • n_labels=1 and no header
  • txt_cols = ['name', 'item_description'] and lbl_cols=['labels'] with the header.

A problem with n_labels > 1 is if you have multiclassification in NLP (where each word/text can have several different labels). I think the toxic comment competition is a good example.

Passing chunksize as a parameter to text_data_from_df results in the following error (not that we would ever need to do that):

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-6-056e51589e97> in <module>
----> 1 data_lm = text_data_from_df(PATH, train_df=train, valid_df=test, data_func=lm_data, max_vocab=60_000, chunksize=24_000, min_freq=2, txt_cols=['name', 'item_description'], label_cols=['label'])

~/fastai/fastai/text/data.py in text_data_from_df(path, train_df, valid_df, test_df, tokenizer, data_func, vocab, **kwargs)
    324     path=Path(path)
    325     txt_kwargs, kwargs = extract_kwargs(['max_vocab', 'chunksize', 'min_freq', 'n_labels', 'txt_cols', 'label_cols'], kwargs)
--> 326     train_ds = TextDataset.from_df(path, train_df, tokenizer, 'train', vocab=vocab, **txt_kwargs)
    327     datasets = [train_ds, TextDataset.from_df(path, valid_df, tokenizer, 'valid', vocab=train_ds.vocab, **txt_kwargs)]
    328     if test_df: datasets.append(TextDataset.from_df(path, test_df, tokenizer, 'test', vocab=train_ds.vocab, **txt_kwargs))

~/fastai/fastai/text/data.py in from_df(cls, folder, df, tokenizer, name, **kwargs)
    142         tokenizer = ifnone(tokenizer, Tokenizer())
    143         chunksize = 1 if (type(df) == DataFrame) else df.chunksize
--> 144         return cls(folder, tokenizer, df=df, create_mtd=TextMtd.DF, name=name, chunksize=chunksize, **kwargs)
    145 
    146     @classmethod

TypeError: type object got multiple values for keyword argument 'chunksize'

A little research indicated that the error “can happen if you pass a key word argument for which one of the keys is similar (has same string name) to a positional argument.” as given in the 2nd answer in this stackoverflow question. The solution is to “You would have to remove the keyword argument from the kwargs before passing it to the method.” I’m not sure how to do that. I also found this. Just wanted to bring this to your attention.

Thanks.

Are you passing in chunksize as an argument as well as passing in a DataFrame with chunksize specified?

If so, don’t pass in chunksize separately as the DataFrame is already “chuncked”. The only time passing in chunksize as an argument is needed is if you are creating your dataset from a .csv for the filesystem.

In that case, wouldn’t it be better if chunksize is not extracted into kwargs? Becaue in the from_df method we have this line:

chunksize = 1 if (type(df) == DataFrame) else df.chunksize

For a DataFrame, wouldn’t the chunksize always be 1 since the whole think is already loaded into memory?

I removed this argument because it makes no sense passing a chunksize in the from_df method. It won’t always be 1, it depends on how you loaded that dataframe.