Fastai v2 text

kub · January 14, 2020, 11:26pm

Hi,

I’m running through the quick start code with a different dataset, and data bunches built with from_df. I’m having an issue where I run the exact same code at different times, but the text classifier at the end outputs differently formatted results.

One of them is as it appears in the quick start page:
(Category positive, tensor(1), tensor([0.0049, 0.9951]))

While the other one that I’m getting often is like this:
(FloatItem [0.177895], tensor([0.1779]), tensor([0.1779]))

Can anyone help me figure out why the model is outputting predictions in the second form and not the first?

Edit: Ok I figured out that it was because I had my labels as a list 0’s and 1’s, which seemed to make the classifier classifier class treat it as a regression problem. Changing the 0’s to ‘negative’ and 1’s to ‘positive’ makes it work consistently as expected now.

sgugger · January 15, 2020, 8:22pm

Pushed a few changes under the hood for the text classifier to limit memory usage (was unable to fit ULMFit with less than 10Gb of GPU RAM). The main thing to know is that you should use pad_input_chunk instead of pad_input now, otherwise your data won’t arrive to the model in the format it expects.

arora_aman · January 16, 2020, 6:03am

I am interested in building a language model at work as the primary objective using fastai2:

However, I just have a couple questions:

The data has personal contact details and numbers, is there any way in fastai2 to remove such information? One way I imagined was to keep only top 60k vocab which should remove phone numbers and personal information and replace everything else with <UNK>.
How can I be sure that ULMFiT won’t predict special chars like XXMAJ XXTOP when I run inference to complete the sentence?

Thanks in advance

much_learner · January 16, 2020, 4:14pm

How to properly construct DataSource for multi label regression in text?

I tried factory from_df

TextDataBunch.from_df(raw_train, seed=42, text_col=["question_title", "question_body", "answer"],
                             text_vocab=lm_vocab, label_col=labels, bs=128, device=default_device(),
                             before_batch=pad_input)

AttributeError: 'Series' object has no attribute 'question_title'

And also DataSource - I don’t know what to pass to work with MultiCategorize instead of attrgetter

splits = RandomSplitter(0.1, seed=42)(raw_train)
tfms = [attrgetter("text"), Tokenizer.from_df(["question_title", "question_body", "answer"]), Numericalize(vocab=lm_vocab)]
dsrc = DataSource(raw_train, [tfms, [labels, MultiCategorize()]], splits=splits, dl_type=SortedDL)

TypeError: 'list' object is not callable

muellerzr · January 16, 2020, 4:23pm

I believe you’d want the DataBlock instead (mid-level). For a hint, here is the IMDB in DataBlock form:


imdb_clas = DataBlock(blocks=(TextBlock(vocab), CategoryBlock),
                      get_x=attrgetter('text'),
                      get_y=ColReader('label'),
                      splitter=RandomSplitter(),
                      dl_type=SortedDL)

And here we can replace CategoryBlock with MultiCategoryBlock

(Here is my notebook if it would help: notebook)

much_learner · January 16, 2020, 6:48pm

Almost there, what’s tok_tfm?

dbch = DataBlock(blocks=(TextBlock(vocab=lm_vocab), MultiCategoryBlock),
                      get_x=ColReader('text'),
                      get_y=ColReader(labels),
                      splitter=RandomSplitter(0.1, seed=42),
                      dl_type=SortedDL).databunch(df_tok, home, bs=128)

I am also following source code but that’s not quite right

dbch = DataBlock(blocks=(TextBlock.from_df(vocab=lm_vocab, text_cols="text"), MultiCategoryBlock),
                      get_x=attrgetter('text'),
                      get_y=ColReader(labels),
                      splitter=RandomSplitter(0.1, seed=42),
                      dl_type=SortedDL).databunch(df_tok, bs=128)

dbch.show_batch()

    222             if attr is not None: return getattr(attr, k)
--> 223         raise AttributeError(k)
    224     def __dir__(self): return custom_dir(self, self._dir() if self._xtra is None else self._dir())
    225 #     def __getstate__(self): return self.__dict__

AttributeError: _IterableDataset_len_called

muellerzr · January 16, 2020, 6:53pm

tok_tfm?

much_learner · January 16, 2020, 6:55pm

Yep, TextBlock init requires tok_tfm fastai2/fastai2/text/data.py at 3c8eec601e09b35b0d400768ffa04a492be7bb47 · fastai/fastai2 · GitHub as the first param.

muellerzr · January 16, 2020, 7:00pm

Ah yes What issue is it giving you? And you shouldn’t need to pass anything in, it should default to SpaCy

much_learner · January 16, 2020, 7:28pm

TypeError: init() missing 1 required positional argument: ‘tok_tfm’

muellerzr · January 16, 2020, 7:42pm

@much_learner sorry! That method is outdated (I haven’t looked at it in awhile). From what I can see, you should try TextBlock.from_df. There you should be able to do what we were trying above and replace CategoryBlock with MultiCategoryBlock

github.com

fastai/fastai2/blob/master/nbs/course/lesson3-imdb.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# IMDB"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%matplotlib inline"
   ]
  },
  {
   "cell_type": "code",

This file has been truncated. show original

much_learner · January 16, 2020, 7:50pm

I already did that as I mentioned above dbch.show_batch() fails with

AttributeError: _IterableDataset_len_called

muellerzr · January 16, 2020, 7:53pm

Ah! That’s because torchvision did an update last night and broke everything. See the install directions for colab here as I’m constantly updating them Fastai-v2 - read this before posting please! 😊 You want version 0.4.2. Apologies for glazing over that @much_learner

much_learner · January 17, 2020, 3:05pm

I think I am creating labels in the wrong way https://colab.research.google.com/drive/1qYybvyqbQXHb820JIkOnHKqX_UDWcfQU#scrollTo=f6cKsenN7JEf

I see they are hot encoded tensor([1., 0., 0., 0., 1., 0., how to change them to regression? y vocab also contains categories, but these numbers should be floats.

df_tok, count = tokenize_df(raw_train, text_cols=text_cols)
dbch = DataBlock(blocks=(TextBlock.from_df(vocab=lm_vocab, text_cols="text"), MultiCategoryBlock),
                      get_x=ColReader('text'),
                      get_y=ColReader(labels),
                      splitter=RandomSplitter(0.2, seed=42),
                      dl_type=SortedDL).databunch(df_tok, home, bs=128)

learn = text_classifier_learner(dbch, AWD_LSTM, metrics=[accuracy], path=home,
                                cbs=[EarlyStoppingCallback(monitor='accuracy', min_delta=0.01, comp=np.greater, patience=5),
                                     SaveModelCallback(monitor="accuracy", comp=np.greater, fname="best_model")]
                               ).to_fp16()
learn.load_encoder("ft_enc_v2")

learn.lr_find()

   1837     if dim == 2:
   1838         ret = torch._C._nn.nll_loss(input, target, weight, _Reduction.get_enum(reduction), ignore_index)

ValueError: Expected input batch_size (128) to match target batch_size (3200).

chess · January 18, 2020, 4:31am

Hello!

I’m attempting to recreate the IMDB example from Fastai v1:

With the provided IMDB notebook for Fastai v2:

https://github.com/fastai/fastai2/blob/master/nbs/course/lesson3-imdb.ipynb

It appears any databunch saving and loading is missing from the Fastai v2 notebook:

data_lm = load_data(path, ‘data_lm.pkl’, bs=bs)

Indeed, when I try to run this in my Fastai v2 notebook, I get:

NameError: name ‘load_data’ is not defined

I’m assuming this means this functionality hasn’t made it yet to Fastai v2. I’ve checked the documentation and don’t see a way to load/save a databunch. Can someone confirm if this is indeed the case so I know I’m not missing anything?

Thanks!

muellerzr · January 18, 2020, 4:35am

Yes. This is mostly due to (I think) with how the new language model data works there’s no need to (its grabbed as its needed and made on the fly instead of all at once)

chess · January 18, 2020, 5:24am

Thanks for the quick answer!

On my system, it takes about an hour to open the databunch again in Fastai v2:

imdb_lm = DataBlock(blocks=(TextBlock.from_df(‘text’, is_lm=True),),
get_x=attrgetter(‘text’),
splitter=RandomSplitter())
dbunch = imdb_lm.databunch(df_orig, bs=64, seq_len=72)
dbunch.show_batch(max_n=6)

But only about 10 seconds to load the old databunch using Fastai v1 code:

data_lm = load_data(path, ‘data_lm.pkl’, bs=bs)

I hope that’s helpful feedback!

Edit from below: Update: It doesn’t actually take the full hour to load the databunch. After it runs for a few minutes, the estimate goes down significantly, then it finishes well ahead of schedule. I didn’t time it, but it probably took 5-10 minutes or so.

muellerzr · January 18, 2020, 5:30am

It is thank you! Most likely it’ll be implemented a bit later (I know they restructured text a little bit ago) as I know they just got into load_learner

sgugger · January 18, 2020, 1:43pm

Actually no. This is mostly due cause all our objects pickle now., so you can directly save with torch.save and load with torch.load the DataBunch that takes time to create.

Also note for a problem in a folder like IMDB, fastai2 caches all the tokenized texts so you don’t need to do it twice. It still takes time to load the second time cause it needs to read the lengths of all files, caching them is on my TODO list.

fmobrj75 · January 18, 2020, 1:46pm

Hi @sgugger. Does this replicate the behaviour we had in fastaiv1 with collate? Fo example, if I fave a document with 1000 tokens will the databunch and the model use bptt to break the input into bptt chuncks and feed the model?