Unsupervised Text Data in fastaiV1

Hi,
one of the interesting aspects of Ulmfit is that you can use target specific unsupervised data to finetune your language model. Currently, if you want to do so in fastaiV1, you can manually tokenize your unsupervised set, concatenate it with your training set, create fake labels and feed that to a TextLMDataBunch. It would be really great if you could do this automatically.

At the moment, TextLMDataBunch wants labels for its training and validation set, I’m not sure if this is intended or a consequences of TextClasDataBunch also being based on TextDataBunch, feel free to enlighten me.

You could add an unsup argument to the creation of TextLMDataBunch such that if not None, it would take the unsupervised data and concatenate it with the training data by itself ( And tokenize and numericalize both if not done already).

What do you think would be the best way to add unsupervised Text Data support in fastaiV1?

The labels TextLMDataBunch wants aren’t used. You can create one without passing labels if you use from_tok or from_ids. It’s just the way datasets are built in fastai_v1, we need an x and a y. If you encounter an error at any point because of the lack of labels, just give a bunch of 0s.

So I’m not sure what you mean when adding unsupervised data, as all data for the language model training is unsupervised.

Imagine you have 3 files, one called train.csv containing labelled data (n_labels columns with label and 1 column of text), one called valid.csv with the same format, and one called unsup.csv with only one columns of text. I can open those files, tokenize each by creating appropriate TextDataSets. Then, I would load my ids, concatenate the ids of the train and unsup file in a train_lm.ids file and create a TextLMDataBunch.from_ids from this concatenated set.

What I would like to do is to be able to pass an unsup argument to TextLMDataBunch.from_csv (and other compatible methods), so that if unsup is not None, my unsup.csv and train.csv are tokenized and numericalized and then concatenated to create the training set of the language model.

Why don’t you load the two of them in dataframes, concatenate them (and put 0 in the unsup labels) and use the from_df factory method?

Of course this works. But in that case when you want to train your classifier, you have to split your tokens again to only select the supervised entries. My point is not that it’s impossible to do with fastaiV1. It is possible with different methods, but each of them comes with the need to concatenate by yourself or to split it after tokenization. I feel like using unsupervised data is really natural to use to finetune the language model, and I think it would be great to have it integrated inside the TextLMDataBunch, so that you could do one call to TextLMDataBunch with the right method, and have your data ready to go.

Note that having the same data for the language model and the classifier is only a convenience for documentation or end-to-end example. In practical cases, it’s expected you’d have different data objects.

In the new API we’re designing, I’ll see how to specify a list of csv instead of just one. That may be helpful to this case.

Indeed, doing that would allow us to use unsupervised data very easily.

In practical cases, I would be tempted to make my data into this format (unsup, train, valid). Since tokenization takes time, it is a legitimate concern that it is done only once (If I had 2 datasets train_lm and train_clas (not preprocessed yet), i wouldn’t want to tokenize each of them if their intersection was not empty). In that regard, it seems natural to me to split data like this.

Do you think making a PR that adds an argument “unsup” to TextLMDataBunch (that would be concatenated to the train) would be a good idea, or do you think it goes against fastaiV1 design, and that I should handle this kind of situation “by hand” locally?

No PRs for this yet. The API will change when the course will go toward NLP as Jeremy and I refactor the library for the topics of the course.
Also it will be handled by the new data block API, not the general factory method (that will stay very general).

Even I have the same scenario. I have training, validation and test dataframes and at the same time I have my domain related corpus present in a different datafarme(domain_text_df). Now, how do I pass in the domain based corpus. Right now, I’m using the below api’s to create data block api:

Language model data

data_lm = TextLMDataBunch.from_df(path=".", train_df=train_df, test_df=test_df, valid_df=valid_df,
label_cols=‘multi_label_column’, label_delim=’ ',
text_cols=‘my_text_col’)

Classifier model data

data_clas = TextClasDataBunch.from_df(path=".", train_df=train_df, test_df=test_df, valid_df=valid_df, label_cols=‘multi_label_column’, label_delim=’ ',
text_cols=‘my_text_col’, vocab=data_lm.train_ds.vocab, bs=32)

How do I pass in the unsupervised data(domain based text without labels)?
Is there any provision to pass that in the api?

Let’s say your unsupervised data is also in a dataframe called unsup_df, all you have to do is

lm_train_df = pd.concat(train_df, unsup_df) (make sure unsup_df and train_df have the same format)

then you pass lm_train_df instead of train_df to your TextLMDataBunch constructor.

1 Like

In my scenario I have some articles related to my domain. I can put them in the dataframe with a column named ‘text’, but it won’t have the same number of records. Can I fill null values to accommodate the train_df format?.

Absolutely, the language model will only use the text so you can fill the other columns with dummy values.

Is it possible (or even desirable) to construct the TextLMDataBunch without holding any data for a validation set? Let’s say for example you’ve tuned the LM hyperparameters through a k-fold cross validation process, and you want to retrain the LM from scratch with your chosen hyperparameters using all your data (both labeled training data meant for the classifier and unsupervised domain-specific data). How could you accomplish this using the from_df constructor, as it requires designation of a valid_df argument? If you’re doing from_csv you can simply set the valid_pct parameter to 0. For from_df, is it as simple as passing an empty data frame into the valid_df argument?