Fastai v2 text

Yes. It will also use max_len to not backpropagate after that point like in v1. The main thing new from v1 is that it completely ignores the padding at the beginning and starts with a clean hidden state for each sequence in the batch.

Thanks!

Glad to be wrong :slight_smile:

Hi!

Did you find a solution? I agree that regression seems to be omitted. In theory it can be overcome via applying simple ToTensor() transformation, but it did not work for me. So, I am curious if there is an easy fix.

There is something different. Maybe is the new way of dealing with sequences. I tested 3 models I had running with fastaiv1 text and all of them, using the same parameters, perform 1.5 or 2 percentage points lower in fastaiv2 in terms of accuracy (even IMDB) using the 0.0.6 version. Inspecting the dataloader I notice some texts end up with some padding not only in the beggining of the sequence but also in some pad tokens in end of the sequence. Is it expected?

No, I haven’t figured it out. I went through TransformBlock API and tried actual TransformBlock too.

@sgugger Please confirm if the multi label regression is available with the current fastai2 API (low or high)

TransformBlock should work as it should leave the targets as floats.

@sgugger Inspecting the dataloader (iter) I notice that pad_input_chunk adds some pad tokens not only in the beggining of the sequence but also in the end of the sequence. Is it expected?

Yes. A sequence needs to begin at a round multiple of seq_len (otherwise the RNN is going to see some pad tokens that make no sense to it), so to do this, there is a little bit of padding at the end (that is then ignored in the masked concat pool).

Thanks!

But how to initialize it properly? When I do the following targ is a tuple

dbch = DataBlock(blocks=(TextBlock.from_df(vocab=lm_vocab, text_cols="text"), TransformBlock),
                      get_x=ColReader('text'),
                      get_y=ColReader(labels),
                      splitter=RandomSplitter(0.2, seed=42),
                      dl_type=SortedDL).databunch(df_tok, home, bs=128)

    241     def __call__(self, inp, targ, **kwargs):
    242         inp  = inp .transpose(self.axis,-1).contiguous()
--> 243         targ = targ.transpose(self.axis,-1).contiguous()
    244         if self.floatify and targ.dtype!=torch.float16: targ = targ.float()
    245         if targ.dtype in [torch.int8, torch.int16, torch.int32]: targ = targ.long()

AttributeError: 'tuple' object has no attribute 'transpose'

Thanks for the information!

I loaded up a databunch:

bs = 64
imdb_lm = DataBlock(blocks=(TextBlock.from_df(‘text’, is_lm=True),),
get_x=attrgetter(‘text’),
splitter=RandomSplitter())
dbunch = imdb_lm.databunch(df, bs=bs, seq_len=72)

Which consumed about 45GB of RAM.

Showing the batch worked as expected:

dbunch.show_batch(max_n=6)

Then I tried torch.save:

torch.save(dbunch, ‘dbunch.pkl’)

RAM usage jumped up another 30GB, then it failed with this:

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-12-02e721dd8bbd> in <module>
----> 1 torch.save(dbunch, 'dbunch.pkl')

~/environments/fastai2/lib/python3.6/site-packages/torch/serialization.py in save(obj, f, pickle_module, pickle_protocol)
    258         >>> torch.save(x, buffer)
    259     """
--> 260     return _with_file_like(f, "wb", lambda f: _save(obj, f, pickle_module, pickle_protocol))
    261 
    262 

~/environments/fastai2/lib/python3.6/site-packages/torch/serialization.py in _with_file_like(f, mode, body)
    183         f = open(f, mode)
    184     try:
--> 185         return body(f)
    186     finally:
    187         if new_fd:

~/environments/fastai2/lib/python3.6/site-packages/torch/serialization.py in <lambda>(f)
    258         >>> torch.save(x, buffer)
    259     """
--> 260     return _with_file_like(f, "wb", lambda f: _save(obj, f, pickle_module, pickle_protocol))
    261 
    262 

~/environments/fastai2/lib/python3.6/site-packages/torch/serialization.py in _save(obj, f, pickle_module, pickle_protocol)
    330     pickler = pickle_module.Pickler(f, protocol=pickle_protocol)
    331     pickler.persistent_id = persistent_id
--> 332     pickler.dump(obj)
    333 
    334     serialized_storage_keys = sorted(serialized_storages.keys())

AttributeError: Can't pickle local object 'ReindexCollection.__init__.<locals>._get'

This is with a 2.5GB csv file. I get the same error when I truncate that file to only 64 rows, but I wanted to flag the RAM usage as well. Not sure if it’s expected to use 75GB of ram to process a 2.5GB csv.

Thanks!

Update: It doesn’t actually take the full hour to load the databunch. After it runs for a few minutes, the estimate goes down significantly, then it finishes well ahead of schedule. I didn’t time it, but it probably took 5-10 minutes or so.

Will look at that. The aim is to have those objects pickle so this is a bug.

1 Like

Thanks! Regarding memory usage, in-case it helps, here’s a comparison with the same data on fastai and fastai2:

Creating Databunch:
Fastai 1.0.59: 30GB
Fastai2 0.0.6: 45GB

Additional memory used to save Databunch:
Fastai 1.0.59: 21GB
Fastai2 0.0.6: 30GB

Yes, got the same error.

@sgugger Another note - the classification learner hard codes cross entropy loss. So far I awa overcoming it via modifying the method. Would it make sense to adjust it? I do not think it worth a pull request to the repo…

@Cl_78_v you should use TextLearner instead I believe.

Yes, it has loss function in signature, thanks. The initiation did not work for me, may look later into this (getting a side error).

Ok, the bug for pickling has been fixed. LMDataLoader should pickle now (use fastcore master for the fix).

I am still trying to understand how to set dataloader for multi label regression, and I am stuck.

Here’s what I do, but it seems the API doesn’t handle each target properly. Do I need to add a block which handles all float targets?

dbch = DataBlock(blocks=(TextBlock.from_df(vocab=lm_vocab, text_cols="text"), TransformBlock),
                      get_x=ColReader('text'),
                      get_y=ColReader(labels),
                      splitter=RandomSplitter(0.2, seed=42),
                      dl_type=SortedDL).dataloaders(df_tok, home, bs=128)

learn = text_classifier_learner(dbch, AWD_LSTM, metrics=[accuracy], path=home,
                                loss_func=CrossEntropyLossFlat(),
                                cbs=[EarlyStoppingCallback(monitor='accuracy', min_delta=0.01, comp=np.greater, patience=5),
                                     SaveModelCallback(monitor="accuracy", comp=np.greater, fname="best_model")]
                               ).to_fp16()
learn.load_encoder("ft_enc_v2");