Beginning of NLP

Note that there has been a big change in the API to get your data (arguments are more or less the same, it’s just the name to call you have to adapt).
See here for all the details, examples and docs are updated.

1 Like

I have a CSV with a single column which contains an article on each row. I do not have any labels associated with them since I would like to create a language model. When I run

data_lm = TextLMDataBunch.from_csv(PATH, bs=bs)

I get the following error:

~/.local/lib/python3.6/site-packages/fastai/text/data.py in tokenize(self)
     87             df = next(dfs) if (type(dfs) == pd.io.parsers.TextFileReader) else self.df
     88             lbl_type = np.float32 if len(self.label_cols) > 1 else np.int64
---> 89             lbls = df[self.label_cols].values.astype(lbl_type) if (len(self.label_cols) > 0) else []
     90             self.txt_cols = ifnone(self.txt_cols, list(range(len(self.label_cols),len(df.columns))))
     91             texts = f'{FLD} {1} ' + df[self.txt_cols[0]].astype(str)

ValueError: invalid literal for int() with base 10: '...'

Do we still need label columns while we are trying to instantiate TextLMDataBunch? According to the documentation, it sounds like we do but it’s a bit confusing.

I ended up setting n_labels=0. This let the DataBunch to be created successfully but when I run

learn.fit_one_cycle(4, 1e-2)

I get the following error:

ValueError: Target size (torch.Size([1024])) must be the same as input size (torch.Size([1024, 60002]))

Seems like I’m missing something. Could you help me understand the API? Should I be creating a dummy label column in my input CSV?

I guess this is the necessary step!
Let us now if it worked out.

Best regards
Michael

Just add a column of zeros. You can do it many ways. Either load it into a dataframe in pandas and add a column of zeros. Or you can just do it in the terminal. If you google for how to add column to csv you might get some useful results. Its just a dummy columns that serves no purpose for the language model other than to maintain a consistent API.

I see. If this is the necessary step to keep the API consistent, I think it should be spelled out more explicitly in the documentation. My expectation was TextLMDataBunch.from_csv function would do that for me automatically.

Important change! I’ve just updated the API to be more torchvision-like so when you want to use the fastai pretrained model, you should now use:

learn = RNNLearner.language_model(data_lm, pretrained_model=URLs.WT103)

It will download the model for you in the .fastai/models/ folders the first time you use it (so it’s at one place once and for all and you don’t have to download it in all your projects) then load it. You can still use the old pretrained_fnames=[{weights_fname},{itos_fname}] if you train your own model locally.

Example is properly updated, docs will follow soon.

3 Likes

Is .fastai automatically created in the home directory? If it is, is there an option to relocated that? Because, in the system where I work, we have very limited home space and most of the data/models are stored in an external mount.

There’s a confit file where you can redirect those folders, fear not :wink:

Thank you very much! :slight_smile: I’m guessing details of how to set these up will be in docs (now or soon).

Yes it is indeed documented:

http://docs.fast.ai/datasets.html#download_data-1

2 Likes

Is there a backwards pretrained URLs.WT103 model available yet?

I don’t think so. @sgugger said he’ll work on it in the near future.

Does it mean that we do Multitask Learning when we use n_labels > 1?

Let’s say, my dataset has the following columns (no headers):

  • 10 columns which correspond to 10 different classes to which we can classify each document (each label is 0 or 1 so that we could use Sigmoid to predict each), ex.: sad, funny, touching, offensive, well_written, disapproving, interesting, insightful, entertaining, provocative
  • 1 column with the full text of the document at the end.

The problem is that each document can incorporate many labels at the same time, which is why Softmax would be a bad choice. I could, of course, train 10 classifiers with Sigmoid output layer separately, but Multitask Learning would be a better option. How could I adapt RNNLearner to reflect this Multitask Problem?

Also, is the following definition of TextDataset and TextClasDataBunch correct for my case?

  • train_ds = TextDataset.from_csv(folder=path, name="train", n_labels = 10)

  • data_clas = TextClasDataBunch.from_csv(path=path, train = "train", valid="valid", test="test", vocab = data_lm.train_ds.vocab, bs=32, n_labels = 10)

Thanks a lot in advance!

If you pass multiple labels, the RNNLearner will be adjusted in consequence. As you pointed out it will use sigmoid instead of softmax, but you normally don’t have to do anything.

1 Like

Thanks a lot! So it is Multitask learning then, correct? because for each label (column) out of all labels defined by n_labels RNNLearner.classifier(data_clas) learns 1 sigmoid classifier, but it is not trained separately, but for all 10 mini-classifiers at once. I am asking because you sad earlier here that we would need to build a custom model for it.

I am also a little confused about the classes argument. If I always define my labels (per column) in a binary fashion, what should I pass to this argument?

  • For example: classes = ['no', 'yes']?
  • or rather: classes = [sad, funny, touching, offensive, well_written, disapproving, interesting, insightful, entertaining, provocative]
  • or can I skip this classes argument entirely?

Sorry for so many questions at once.

I run through the docs periodically as I git pull commits from the repo. Currently, I ran the IMDB example from the docs. The LM part ran very smoothly but I got an error when I ran the classifier part. In particular, when I called the fit_one_cycle method after creating the RNNLearner.classifier and loading the encoder via load_encoder, I got the following error:

learn.fit_one_cycle(1, 1e-2)

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-25-3ea49add0339> in <module>
----> 1 learn.fit_one_cycle(1, 1e-2)

~/fastai/fastai/train.py in fit_one_cycle(learn, cyc_len, max_lr, moms, div_factor, pct_start, wd, callbacks, **kwargs)
     17     callbacks.append(OneCycleScheduler(learn, max_lr, moms=moms, div_factor=div_factor,
     18                                         pct_start=pct_start, **kwargs))
---> 19     learn.fit(cyc_len, max_lr, wd=wd, callbacks=callbacks)
     20 
     21 def lr_find(learn:Learner, start_lr:Floats=1e-7, end_lr:Floats=10, num_it:int=100, stop_div:bool=True, **kwargs:Any):

~/fastai/fastai/basic_train.py in fit(self, epochs, lr, wd, callbacks)
    160         callbacks = [cb(self) for cb in self.callback_fns] + listify(callbacks)
    161         fit(epochs, self.model, self.loss_func, opt=self.opt, data=self.data, metrics=self.metrics,
--> 162             callbacks=self.callbacks+callbacks)
    163 
    164     def create_opt(self, lr:Floats, wd:Floats=0.)->None:

~/fastai/fastai/basic_train.py in fit(epochs, model, loss_func, opt, data, callbacks, metrics)
     92     except Exception as e:
     93         exception = e
---> 94         raise e
     95     finally: cb_handler.on_train_end(exception)
     96 

~/fastai/fastai/basic_train.py in fit(epochs, model, loss_func, opt, data, callbacks, metrics)
     82             for xb,yb in progress_bar(data.train_dl, parent=pbar):
     83                 xb, yb = cb_handler.on_batch_begin(xb, yb)
---> 84                 loss = loss_batch(model, xb, yb, loss_func, opt, cb_handler)
     85                 if cb_handler.on_batch_end(loss): break
     86 

~/fastai/fastai/basic_train.py in loss_batch(model, xb, yb, loss_func, opt, cb_handler)
     16     if not is_listy(xb): xb = [xb]
     17     if not is_listy(yb): yb = [yb]
---> 18     out = model(*xb)
     19     out = cb_handler.on_loss_begin(out)
     20 

/net/vaosl01/opt/NFS/sw/anaconda3/envs/mer-su0/lib/python3.7/site-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs)
    475             result = self._slow_forward(*input, **kwargs)
    476         else:
--> 477             result = self.forward(*input, **kwargs)
    478         for hook in self._forward_hooks.values():
    479             hook_result = hook(self, input, result)

TypeError: forward() takes 2 positional arguments but 927 were given

Please note that I just followed the docs directly without any modifications.

If you pass n_labels = 10, the model should have a last linear layer with 10 outputs and loss function that is binary cross entropy with logits (so sigmoid + cross entropy). I say should because it’s untested.
The classes should be the names of each of your label, so the second solution you offered, but you can skip it entirely.

If AWD-LSTM is the basic Langugae Model in fastai_v1, is there an equivalent for sequence tagging/entity recognition in fastai_v1?

Thank you very much sgugger! I tested this on my dataset (in fact, I have 20 labels, not 10).

I was able to train LM without problems, but when I train RNNLearner.classifier I get the following error:

  • ValueError: Target size (torch.Size([8, 20])) must be the same as input size (torch.Size([8, 2])) when I set the batch size to 16.
  • ValueError: Target size (torch.Size([16, 20])) must be the same as input size (torch.Size([16, 2])) when I set the batch size to 32.

I ran this twice:

  1. With dummy label column set to [0]*len(df): link to my GitHub file
  2. Without any dummy column: link to my GitHub file

Am I doing something wrong? And which version is better - 1 or 2? It looks like it would work only if there are just 2 outputs in the last linear layer for the classifier, cause now it cannot match input with the size of 2 and output (target) of the classifier with the size of 20.

Here is also a screenshot:


Btw., I use fastai version 1.0.14 (in new 1.0.15 TextDataset doesn’t even have the method from_csv())

I would appreciate any advice!

1 Like

It’s fixed now. Someone had changed the way DataBunch handles collate_fn so it broke down this part.