Fastai v2 text

Yep, TextBlock init requires tok_tfm https://github.com/fastai/fastai2/blob/3c8eec601e09b35b0d400768ffa04a492be7bb47/fastai2/text/data.py#L156 as the first param.

Ah yes :slight_smile: What issue is it giving you? And you shouldn’t need to pass anything in, it should default to SpaCy

TypeError: init() missing 1 required positional argument: ‘tok_tfm’

@much_learner sorry! That method is outdated (I haven’t looked at it in awhile). From what I can see, you should try TextBlock.from_df. There you should be able to do what we were trying above and replace CategoryBlock with MultiCategoryBlock

I already did that as I mentioned above dbch.show_batch() fails with

AttributeError: _IterableDataset_len_called
1 Like

Ah! That’s because torchvision did an update last night and broke everything. See the install directions for colab here as I’m constantly updating them Fastai-v2 - read this before posting please! 😊 You want version 0.4.2. Apologies for glazing over that @much_learner :slight_smile:

1 Like

I think I am creating labels in the wrong way https://colab.research.google.com/drive/1qYybvyqbQXHb820JIkOnHKqX_UDWcfQU#scrollTo=f6cKsenN7JEf

I see they are hot encoded tensor([1., 0., 0., 0., 1., 0., how to change them to regression? y vocab also contains categories, but these numbers should be floats.

df_tok, count = tokenize_df(raw_train, text_cols=text_cols)
dbch = DataBlock(blocks=(TextBlock.from_df(vocab=lm_vocab, text_cols="text"), MultiCategoryBlock),
                      get_x=ColReader('text'),
                      get_y=ColReader(labels),
                      splitter=RandomSplitter(0.2, seed=42),
                      dl_type=SortedDL).databunch(df_tok, home, bs=128)

learn = text_classifier_learner(dbch, AWD_LSTM, metrics=[accuracy], path=home,
                                cbs=[EarlyStoppingCallback(monitor='accuracy', min_delta=0.01, comp=np.greater, patience=5),
                                     SaveModelCallback(monitor="accuracy", comp=np.greater, fname="best_model")]
                               ).to_fp16()
learn.load_encoder("ft_enc_v2")

learn.lr_find()

   1837     if dim == 2:
   1838         ret = torch._C._nn.nll_loss(input, target, weight, _Reduction.get_enum(reduction), ignore_index)

ValueError: Expected input batch_size (128) to match target batch_size (3200).

Hello!

I’m attempting to recreate the IMDB example from Fastai v1:

With the provided IMDB notebook for Fastai v2:

It appears any databunch saving and loading is missing from the Fastai v2 notebook:

data_lm = load_data(path, ‘data_lm.pkl’, bs=bs)

Indeed, when I try to run this in my Fastai v2 notebook, I get:

NameError: name ‘load_data’ is not defined

I’m assuming this means this functionality hasn’t made it yet to Fastai v2. I’ve checked the documentation and don’t see a way to load/save a databunch. Can someone confirm if this is indeed the case so I know I’m not missing anything? :slight_smile:

Thanks!

Yes. This is mostly due to (I think) with how the new language model data works there’s no need to (its grabbed as its needed and made on the fly instead of all at once)

1 Like

Thanks for the quick answer!

On my system, it takes about an hour to open the databunch again in Fastai v2:

imdb_lm = DataBlock(blocks=(TextBlock.from_df(‘text’, is_lm=True),),
get_x=attrgetter(‘text’),
splitter=RandomSplitter())
dbunch = imdb_lm.databunch(df_orig, bs=64, seq_len=72)
dbunch.show_batch(max_n=6)

But only about 10 seconds to load the old databunch using Fastai v1 code:

data_lm = load_data(path, ‘data_lm.pkl’, bs=bs)

I hope that’s helpful feedback!

Edit from below: Update: It doesn’t actually take the full hour to load the databunch. After it runs for a few minutes, the estimate goes down significantly, then it finishes well ahead of schedule. I didn’t time it, but it probably took 5-10 minutes or so.

It is thank you! Most likely it’ll be implemented a bit later (I know they restructured text a little bit ago) as I know they just got into load_learner :slight_smile:

1 Like

Actually no. This is mostly due cause all our objects pickle now., so you can directly save with torch.save and load with torch.load the DataBunch that takes time to create.

Also note for a problem in a folder like IMDB, fastai2 caches all the tokenized texts so you don’t need to do it twice. It still takes time to load the second time cause it needs to read the lengths of all files, caching them is on my TODO list.

3 Likes

Hi @sgugger. Does this replicate the behaviour we had in fastaiv1 with collate? Fo example, if I fave a document with 1000 tokens will the databunch and the model use bptt to break the input into bptt chuncks and feed the model?

Yes. It will also use max_len to not backpropagate after that point like in v1. The main thing new from v1 is that it completely ignores the padding at the beginning and starts with a clean hidden state for each sequence in the batch.

Thanks!

Glad to be wrong :slight_smile:

Hi!

Did you find a solution? I agree that regression seems to be omitted. In theory it can be overcome via applying simple ToTensor() transformation, but it did not work for me. So, I am curious if there is an easy fix.

There is something different. Maybe is the new way of dealing with sequences. I tested 3 models I had running with fastaiv1 text and all of them, using the same parameters, perform 1.5 or 2 percentage points lower in fastaiv2 in terms of accuracy (even IMDB) using the 0.0.6 version. Inspecting the dataloader I notice some texts end up with some padding not only in the beggining of the sequence but also in some pad tokens in end of the sequence. Is it expected?

No, I haven’t figured it out. I went through TransformBlock API and tried actual TransformBlock too.

@sgugger Please confirm if the multi label regression is available with the current fastai2 API (low or high)

TransformBlock should work as it should leave the targets as floats.