Fastai v2 text

much_learner · January 16, 2020, 6:55pm

Yep, TextBlock init requires tok_tfm https://github.com/fastai/fastai2/blob/3c8eec601e09b35b0d400768ffa04a492be7bb47/fastai2/text/data.py#L156 as the first param.

muellerzr · January 16, 2020, 7:00pm

Ah yes What issue is it giving you? And you shouldn’t need to pass anything in, it should default to SpaCy

much_learner · January 16, 2020, 7:28pm

TypeError: init() missing 1 required positional argument: ‘tok_tfm’

muellerzr · January 16, 2020, 7:42pm

@much_learner sorry! That method is outdated (I haven’t looked at it in awhile). From what I can see, you should try TextBlock.from_df. There you should be able to do what we were trying above and replace CategoryBlock with MultiCategoryBlock

github.com

fastai/fastai2/blob/master/nbs/course/lesson3-imdb.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# IMDB"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%matplotlib inline"
   ]
  },
  {
   "cell_type": "code",

This file has been truncated. show original

much_learner · January 16, 2020, 7:50pm

I already did that as I mentioned above dbch.show_batch() fails with

AttributeError: _IterableDataset_len_called

muellerzr · January 16, 2020, 7:53pm

Ah! That’s because torchvision did an update last night and broke everything. See the install directions for colab here as I’m constantly updating them Fastai-v2 - read this before posting please! 😊 You want version 0.4.2. Apologies for glazing over that @much_learner

much_learner · January 17, 2020, 3:05pm

I think I am creating labels in the wrong way https://colab.research.google.com/drive/1qYybvyqbQXHb820JIkOnHKqX_UDWcfQU#scrollTo=f6cKsenN7JEf

I see they are hot encoded tensor([1., 0., 0., 0., 1., 0., how to change them to regression? y vocab also contains categories, but these numbers should be floats.

df_tok, count = tokenize_df(raw_train, text_cols=text_cols)
dbch = DataBlock(blocks=(TextBlock.from_df(vocab=lm_vocab, text_cols="text"), MultiCategoryBlock),
                      get_x=ColReader('text'),
                      get_y=ColReader(labels),
                      splitter=RandomSplitter(0.2, seed=42),
                      dl_type=SortedDL).databunch(df_tok, home, bs=128)

learn = text_classifier_learner(dbch, AWD_LSTM, metrics=[accuracy], path=home,
                                cbs=[EarlyStoppingCallback(monitor='accuracy', min_delta=0.01, comp=np.greater, patience=5),
                                     SaveModelCallback(monitor="accuracy", comp=np.greater, fname="best_model")]
                               ).to_fp16()
learn.load_encoder("ft_enc_v2")

learn.lr_find()

   1837     if dim == 2:
   1838         ret = torch._C._nn.nll_loss(input, target, weight, _Reduction.get_enum(reduction), ignore_index)

ValueError: Expected input batch_size (128) to match target batch_size (3200).

chess · January 18, 2020, 4:31am

Hello!

I’m attempting to recreate the IMDB example from Fastai v1:

With the provided IMDB notebook for Fastai v2:

github.com

fastai/fastai2/blob/master/nbs/course/lesson3-imdb.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# IMDB"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%matplotlib inline"
   ]
  },
  {
   "cell_type": "code",

This file has been truncated. show original

It appears any databunch saving and loading is missing from the Fastai v2 notebook:

data_lm = load_data(path, ‘data_lm.pkl’, bs=bs)

Indeed, when I try to run this in my Fastai v2 notebook, I get:

NameError: name ‘load_data’ is not defined

I’m assuming this means this functionality hasn’t made it yet to Fastai v2. I’ve checked the documentation and don’t see a way to load/save a databunch. Can someone confirm if this is indeed the case so I know I’m not missing anything?

Thanks!

muellerzr · January 18, 2020, 4:35am

Yes. This is mostly due to (I think) with how the new language model data works there’s no need to (its grabbed as its needed and made on the fly instead of all at once)

chess · January 18, 2020, 5:24am

Thanks for the quick answer!

On my system, it takes about an hour to open the databunch again in Fastai v2:

imdb_lm = DataBlock(blocks=(TextBlock.from_df(‘text’, is_lm=True),),
get_x=attrgetter(‘text’),
splitter=RandomSplitter())
dbunch = imdb_lm.databunch(df_orig, bs=64, seq_len=72)
dbunch.show_batch(max_n=6)

But only about 10 seconds to load the old databunch using Fastai v1 code:

data_lm = load_data(path, ‘data_lm.pkl’, bs=bs)

I hope that’s helpful feedback!

Edit from below: Update: It doesn’t actually take the full hour to load the databunch. After it runs for a few minutes, the estimate goes down significantly, then it finishes well ahead of schedule. I didn’t time it, but it probably took 5-10 minutes or so.

muellerzr · January 18, 2020, 5:30am

It is thank you! Most likely it’ll be implemented a bit later (I know they restructured text a little bit ago) as I know they just got into load_learner

sgugger · January 18, 2020, 1:43pm

Actually no. This is mostly due cause all our objects pickle now., so you can directly save with torch.save and load with torch.load the DataBunch that takes time to create.

Also note for a problem in a folder like IMDB, fastai2 caches all the tokenized texts so you don’t need to do it twice. It still takes time to load the second time cause it needs to read the lengths of all files, caching them is on my TODO list.

fmobrj75 · January 18, 2020, 1:46pm

Hi @sgugger. Does this replicate the behaviour we had in fastaiv1 with collate? Fo example, if I fave a document with 1000 tokens will the databunch and the model use bptt to break the input into bptt chuncks and feed the model?

sgugger · January 18, 2020, 1:51pm

Yes. It will also use max_len to not backpropagate after that point like in v1. The main thing new from v1 is that it completely ignores the padding at the beginning and starts with a clean hidden state for each sequence in the batch.

fmobrj75 · January 18, 2020, 1:56pm

Thanks!

muellerzr · January 18, 2020, 2:19pm

Glad to be wrong

Cl_78_v · January 19, 2020, 8:35pm

Hi!

Did you find a solution? I agree that regression seems to be omitted. In theory it can be overcome via applying simple ToTensor() transformation, but it did not work for me. So, I am curious if there is an easy fix.

fmobrj75 · January 20, 2020, 9:16am

There is something different. Maybe is the new way of dealing with sequences. I tested 3 models I had running with fastaiv1 text and all of them, using the same parameters, perform 1.5 or 2 percentage points lower in fastaiv2 in terms of accuracy (even IMDB) using the 0.0.6 version. Inspecting the dataloader I notice some texts end up with some padding not only in the beggining of the sequence but also in some pad tokens in end of the sequence. Is it expected?

much_learner · January 20, 2020, 1:05pm

No, I haven’t figured it out. I went through TransformBlock API and tried actual TransformBlock too.

@sgugger Please confirm if the multi label regression is available with the current fastai2 API (low or high)

sgugger · January 20, 2020, 2:28pm

TransformBlock should work as it should leave the targets as floats.