Fastai v2 text

sgugger · January 18, 2020, 1:51pm

Yes. It will also use max_len to not backpropagate after that point like in v1. The main thing new from v1 is that it completely ignores the padding at the beginning and starts with a clean hidden state for each sequence in the batch.

fmobrj75 · January 18, 2020, 1:56pm

Thanks!

muellerzr · January 18, 2020, 2:19pm

Glad to be wrong

Cl_78_v · January 19, 2020, 8:35pm

Hi!

Did you find a solution? I agree that regression seems to be omitted. In theory it can be overcome via applying simple ToTensor() transformation, but it did not work for me. So, I am curious if there is an easy fix.

fmobrj75 · January 20, 2020, 9:16am

There is something different. Maybe is the new way of dealing with sequences. I tested 3 models I had running with fastaiv1 text and all of them, using the same parameters, perform 1.5 or 2 percentage points lower in fastaiv2 in terms of accuracy (even IMDB) using the 0.0.6 version. Inspecting the dataloader I notice some texts end up with some padding not only in the beggining of the sequence but also in some pad tokens in end of the sequence. Is it expected?

much_learner · January 20, 2020, 1:05pm

No, I haven’t figured it out. I went through TransformBlock API and tried actual TransformBlock too.

@sgugger Please confirm if the multi label regression is available with the current fastai2 API (low or high)

sgugger · January 20, 2020, 2:28pm

TransformBlock should work as it should leave the targets as floats.

fmobrj75 · January 20, 2020, 3:05pm

@sgugger Inspecting the dataloader (iter) I notice that pad_input_chunk adds some pad tokens not only in the beggining of the sequence but also in the end of the sequence. Is it expected?

sgugger · January 20, 2020, 4:26pm

Yes. A sequence needs to begin at a round multiple of seq_len (otherwise the RNN is going to see some pad tokens that make no sense to it), so to do this, there is a little bit of padding at the end (that is then ignored in the masked concat pool).

fmobrj75 · January 20, 2020, 4:38pm

Thanks!

much_learner · January 20, 2020, 9:52pm

But how to initialize it properly? When I do the following targ is a tuple

dbch = DataBlock(blocks=(TextBlock.from_df(vocab=lm_vocab, text_cols="text"), TransformBlock),
                      get_x=ColReader('text'),
                      get_y=ColReader(labels),
                      splitter=RandomSplitter(0.2, seed=42),
                      dl_type=SortedDL).databunch(df_tok, home, bs=128)

    241     def __call__(self, inp, targ, **kwargs):
    242         inp  = inp .transpose(self.axis,-1).contiguous()
--> 243         targ = targ.transpose(self.axis,-1).contiguous()
    244         if self.floatify and targ.dtype!=torch.float16: targ = targ.float()
    245         if targ.dtype in [torch.int8, torch.int16, torch.int32]: targ = targ.long()

AttributeError: 'tuple' object has no attribute 'transpose'

chess · January 20, 2020, 11:16pm

Thanks for the information!

I loaded up a databunch:

bs = 64
imdb_lm = DataBlock(blocks=(TextBlock.from_df(‘text’, is_lm=True),),
get_x=attrgetter(‘text’),
splitter=RandomSplitter())
dbunch = imdb_lm.databunch(df, bs=bs, seq_len=72)

Which consumed about 45GB of RAM.

Showing the batch worked as expected:

dbunch.show_batch(max_n=6)

Then I tried torch.save:

torch.save(dbunch, ‘dbunch.pkl’)

RAM usage jumped up another 30GB, then it failed with this:

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-12-02e721dd8bbd> in <module>
----> 1 torch.save(dbunch, 'dbunch.pkl')

~/environments/fastai2/lib/python3.6/site-packages/torch/serialization.py in save(obj, f, pickle_module, pickle_protocol)
    258         >>> torch.save(x, buffer)
    259     """
--> 260     return _with_file_like(f, "wb", lambda f: _save(obj, f, pickle_module, pickle_protocol))
    261 
    262 

~/environments/fastai2/lib/python3.6/site-packages/torch/serialization.py in _with_file_like(f, mode, body)
    183         f = open(f, mode)
    184     try:
--> 185         return body(f)
    186     finally:
    187         if new_fd:

~/environments/fastai2/lib/python3.6/site-packages/torch/serialization.py in <lambda>(f)
    258         >>> torch.save(x, buffer)
    259     """
--> 260     return _with_file_like(f, "wb", lambda f: _save(obj, f, pickle_module, pickle_protocol))
    261 
    262 

~/environments/fastai2/lib/python3.6/site-packages/torch/serialization.py in _save(obj, f, pickle_module, pickle_protocol)
    330     pickler = pickle_module.Pickler(f, protocol=pickle_protocol)
    331     pickler.persistent_id = persistent_id
--> 332     pickler.dump(obj)
    333 
    334     serialized_storage_keys = sorted(serialized_storages.keys())

AttributeError: Can't pickle local object 'ReindexCollection.__init__.<locals>._get'

This is with a 2.5GB csv file. I get the same error when I truncate that file to only 64 rows, but I wanted to flag the RAM usage as well. Not sure if it’s expected to use 75GB of ram to process a 2.5GB csv.

Thanks!

chess · January 20, 2020, 11:41pm

Update: It doesn’t actually take the full hour to load the databunch. After it runs for a few minutes, the estimate goes down significantly, then it finishes well ahead of schedule. I didn’t time it, but it probably took 5-10 minutes or so.

sgugger · January 20, 2020, 11:41pm

Will look at that. The aim is to have those objects pickle so this is a bug.

chess · January 21, 2020, 2:06am

Thanks! Regarding memory usage, in-case it helps, here’s a comparison with the same data on fastai and fastai2:

Creating Databunch:
Fastai 1.0.59: 30GB
Fastai2 0.0.6: 45GB

Additional memory used to save Databunch:
Fastai 1.0.59: 21GB
Fastai2 0.0.6: 30GB

Cl_78_v · January 21, 2020, 12:44pm

Yes, got the same error.

@sgugger Another note - the classification learner hard codes cross entropy loss. So far I awa overcoming it via modifying the method. Would it make sense to adjust it? I do not think it worth a pull request to the repo…

muellerzr · January 21, 2020, 12:50pm

@Cl_78_v you should use TextLearner instead I believe.

Cl_78_v · January 21, 2020, 1:51pm

Yes, it has loss function in signature, thanks. The initiation did not work for me, may look later into this (getting a side error).

sgugger · January 21, 2020, 3:51pm

Ok, the bug for pickling has been fixed. LMDataLoader should pickle now (use fastcore master for the fix).

much_learner · January 27, 2020, 5:55pm

I am still trying to understand how to set dataloader for multi label regression, and I am stuck.

Here’s what I do, but it seems the API doesn’t handle each target properly. Do I need to add a block which handles all float targets?

dbch = DataBlock(blocks=(TextBlock.from_df(vocab=lm_vocab, text_cols="text"), TransformBlock),
                      get_x=ColReader('text'),
                      get_y=ColReader(labels),
                      splitter=RandomSplitter(0.2, seed=42),
                      dl_type=SortedDL).dataloaders(df_tok, home, bs=128)

learn = text_classifier_learner(dbch, AWD_LSTM, metrics=[accuracy], path=home,
                                loss_func=CrossEntropyLossFlat(),
                                cbs=[EarlyStoppingCallback(monitor='accuracy', min_delta=0.01, comp=np.greater, patience=5),
                                     SaveModelCallback(monitor="accuracy", comp=np.greater, fname="best_model")]
                               ).to_fp16()
learn.load_encoder("ft_enc_v2");