Beginning of NLP

Yes the argument is is_test, sorry about that. And it won’t work for an RNN because the output is a tuple and not a tensor, and the callbacks to handle that aren’t called there, so you kind of have to write your own loop.

As for the recorder, it isn’t saved with the model, so it stays as long as you keep your learn object around, but if you restart your notebook, it will have disappeared, yes.

Hi, since I am beginner here, I would like also know how to do the prediction. In Scikit-learn or Keras, there are very simple and intuitive functions to train and make prediction:

  • classifier.fit(x, y, epochs,…)
  • classifier.predict(x)

How about it here? is there any similar way to do it? from my experience trying ULMFit, this is not clear and easy as I had using other frameworks. Thanks.

1 Like

I have something like this

When I’m running
learn = get_tabular_learner(data, layers=[200,100], metrics=exp_rmspe)
learn.fit(1, 1e-2)

I’m getting

BrokenPipeError Traceback (most recent call last)
in ()
1 learn = get_tabular_learner(data, layers=[200,100], metrics=exp_rmspe)
----> 2 learn.fit(1, 1e-2)

c:\users\gerar\fastai\fastai\basic_train.py in fit(self, epochs, lr, wd, callbacks)
132 callbacks = [cb(self) for cb in self.callback_fns] + listify(callbacks)
133 fit(epochs, self.model, self.loss_fn, opt=self.opt, data=self.data, metrics=self.metrics,
–> 134 callbacks=self.callbacks+callbacks)
135
136 def create_opt(self, lr:Floats, wd:Floats=0.)->None:

c:\users\gerar\fastai\fastai\basic_train.py in fit(epochs, model, loss_fn, opt, data, callbacks, metrics)
85 except Exception as e:
86 exception = e
—> 87 raise e
88 finally: cb_handler.on_train_end(exception)
89

c:\users\gerar\fastai\fastai\basic_train.py in fit(epochs, model, loss_fn, opt, data, callbacks, metrics)
69 cb_handler.on_epoch_begin()
70
—> 71 for xb,yb in progress_bar(data.train_dl, parent=pbar):
72 xb, yb = cb_handler.on_batch_begin(xb, yb)
73 loss,_ = loss_batch(model, xb, yb, loss_fn, opt, cb_handler)

~\Anaconda3\lib\site-packages\fastprogress\fastprogress.py in iter(self)
59 self.update(0)
60 try:
—> 61 for i,o in enumerate(self._gen):
62 yield o
63 if self.auto_update: self.update(i+1)

c:\users\gerar\fastai\fastai\data.py in iter(self)
45 def iter(self):
46 “Process and returns items from DataLoader.”
—> 47 self.gen = map(self.proc_batch, self.dl)
48 return iter(self.gen)
49

~\Anaconda3\lib\site-packages\torch\utils\data\dataloader.py in iter(self)
499
500 def iter(self):
–> 501 return _DataLoaderIter(self)
502
503 def len(self):

~\Anaconda3\lib\site-packages\torch\utils\data\dataloader.py in init(self, loader)
287 for w in self.workers:
288 w.daemon = True # ensure that the worker exits on process exit
–> 289 w.start()
290
291 _update_worker_pids(id(self), tuple(w.pid for w in self.workers))

~\Anaconda3\lib\multiprocessing\process.py in start(self)
110 ‘daemonic processes are not allowed to have children’
111 _cleanup()
–> 112 self._popen = self._Popen(self)
113 self._sentinel = self._popen.sentinel
114 # Avoid a refcycle if the target function holds an indirect

~\Anaconda3\lib\multiprocessing\context.py in _Popen(process_obj)
221 @staticmethod
222 def _Popen(process_obj):
–> 223 return _default_context.get_context().Process._Popen(process_obj)
224
225 class DefaultContext(BaseContext):

~\Anaconda3\lib\multiprocessing\context.py in _Popen(process_obj)
320 def _Popen(process_obj):
321 from .popen_spawn_win32 import Popen
–> 322 return Popen(process_obj)
323
324 class SpawnContext(BaseContext):

~\Anaconda3\lib\multiprocessing\popen_spawn_win32.py in init(self, process_obj)
63 try:
64 reduction.dump(prep_data, to_child)
—> 65 reduction.dump(process_obj, to_child)
66 finally:
67 set_spawning_popen(None)

~\Anaconda3\lib\multiprocessing\reduction.py in dump(obj, file, protocol)
58 def dump(obj, file, protocol=None):
59 ‘’‘Replacement for pickle.dump() using ForkingPickler.’’’
—> 60 ForkingPickler(file, protocol).dump(obj)
61
62 #

BrokenPipeError: [Errno 32] Broken pipe

pytorch 0.4.1

Windows 10 Pro

Neither of these are supported. Please see the readme.

I will fix both tonight

If I have a JSON with fields (categorical, contiguous) and text and I manage to load them in a data frame
Can I still use the tabular_data_from_df?

What is the best way to include the text in the embedding?
Should I just include it as an additional category?

I’m talking about a hybrid model (images, text, columns)
Maybe fields and a photo with description.

Note that there has been a big change in the API to get your data (arguments are more or less the same, it’s just the name to call you have to adapt).
See here for all the details, examples and docs are updated.

1 Like

I have a CSV with a single column which contains an article on each row. I do not have any labels associated with them since I would like to create a language model. When I run

data_lm = TextLMDataBunch.from_csv(PATH, bs=bs)

I get the following error:

~/.local/lib/python3.6/site-packages/fastai/text/data.py in tokenize(self)
     87             df = next(dfs) if (type(dfs) == pd.io.parsers.TextFileReader) else self.df
     88             lbl_type = np.float32 if len(self.label_cols) > 1 else np.int64
---> 89             lbls = df[self.label_cols].values.astype(lbl_type) if (len(self.label_cols) > 0) else []
     90             self.txt_cols = ifnone(self.txt_cols, list(range(len(self.label_cols),len(df.columns))))
     91             texts = f'{FLD} {1} ' + df[self.txt_cols[0]].astype(str)

ValueError: invalid literal for int() with base 10: '...'

Do we still need label columns while we are trying to instantiate TextLMDataBunch? According to the documentation, it sounds like we do but it’s a bit confusing.

I ended up setting n_labels=0. This let the DataBunch to be created successfully but when I run

learn.fit_one_cycle(4, 1e-2)

I get the following error:

ValueError: Target size (torch.Size([1024])) must be the same as input size (torch.Size([1024, 60002]))

Seems like I’m missing something. Could you help me understand the API? Should I be creating a dummy label column in my input CSV?

I guess this is the necessary step!
Let us now if it worked out.

Best regards
Michael

Just add a column of zeros. You can do it many ways. Either load it into a dataframe in pandas and add a column of zeros. Or you can just do it in the terminal. If you google for how to add column to csv you might get some useful results. Its just a dummy columns that serves no purpose for the language model other than to maintain a consistent API.

I see. If this is the necessary step to keep the API consistent, I think it should be spelled out more explicitly in the documentation. My expectation was TextLMDataBunch.from_csv function would do that for me automatically.

Important change! I’ve just updated the API to be more torchvision-like so when you want to use the fastai pretrained model, you should now use:

learn = RNNLearner.language_model(data_lm, pretrained_model=URLs.WT103)

It will download the model for you in the .fastai/models/ folders the first time you use it (so it’s at one place once and for all and you don’t have to download it in all your projects) then load it. You can still use the old pretrained_fnames=[{weights_fname},{itos_fname}] if you train your own model locally.

Example is properly updated, docs will follow soon.

3 Likes

Is .fastai automatically created in the home directory? If it is, is there an option to relocated that? Because, in the system where I work, we have very limited home space and most of the data/models are stored in an external mount.

There’s a confit file where you can redirect those folders, fear not :wink:

Thank you very much! :slight_smile: I’m guessing details of how to set these up will be in docs (now or soon).

Yes it is indeed documented:

http://docs.fast.ai/datasets.html#download_data-1

2 Likes

Is there a backwards pretrained URLs.WT103 model available yet?

I don’t think so. @sgugger said he’ll work on it in the near future.

Does it mean that we do Multitask Learning when we use n_labels > 1?

Let’s say, my dataset has the following columns (no headers):

  • 10 columns which correspond to 10 different classes to which we can classify each document (each label is 0 or 1 so that we could use Sigmoid to predict each), ex.: sad, funny, touching, offensive, well_written, disapproving, interesting, insightful, entertaining, provocative
  • 1 column with the full text of the document at the end.

The problem is that each document can incorporate many labels at the same time, which is why Softmax would be a bad choice. I could, of course, train 10 classifiers with Sigmoid output layer separately, but Multitask Learning would be a better option. How could I adapt RNNLearner to reflect this Multitask Problem?

Also, is the following definition of TextDataset and TextClasDataBunch correct for my case?

  • train_ds = TextDataset.from_csv(folder=path, name="train", n_labels = 10)

  • data_clas = TextClasDataBunch.from_csv(path=path, train = "train", valid="valid", test="test", vocab = data_lm.train_ds.vocab, bs=32, n_labels = 10)

Thanks a lot in advance!

If you pass multiple labels, the RNNLearner will be adjusted in consequence. As you pointed out it will use sigmoid instead of softmax, but you normally don’t have to do anything.

1 Like

Thanks a lot! So it is Multitask learning then, correct? because for each label (column) out of all labels defined by n_labels RNNLearner.classifier(data_clas) learns 1 sigmoid classifier, but it is not trained separately, but for all 10 mini-classifiers at once. I am asking because you sad earlier here that we would need to build a custom model for it.

I am also a little confused about the classes argument. If I always define my labels (per column) in a binary fashion, what should I pass to this argument?

  • For example: classes = ['no', 'yes']?
  • or rather: classes = [sad, funny, touching, offensive, well_written, disapproving, interesting, insightful, entertaining, provocative]
  • or can I skip this classes argument entirely?

Sorry for so many questions at once.