Fastai v2 chat

No, all afiine/coord/lighting transforms are done on the GPU

1 Like

@sgugger question on this new naming change. Is there still train, Val, and test_dl? (Like dbunch.train_dl) Or is it index based dbunch[0] (to more easily allow unlimited sets). I was noticing this on the commits and wanted to be sure this is the behavior I’m reading

Edit: it seems like there still is, but I also see behaviors like dbunch[x]. Could you enlighten me a little? :slight_smile:

There is no test_dl attribute, but there is a train_dl and valid_dl attribute for the first two. You can also index directly (for your second validation dl for instance).

1 Like

Thanks for the clarification :slight_smile:

The release with renamed DataBunch/DataSource is now on pypi.

2 Likes

Second breaking change in DataBlock this time. We moved the transforms in the init (and not .dataloaders anymore) which needs you to move those bits of codes. All notebooks in the fastai2 repo, examples and the course notebooks are up to date with that change, if you need some examples.

This will allow us to have a better representation/summary for the DataBlock class and some useful debug methods are in the oven (and all of that needs to know the transforms).

To easily change the item_tfms or batch_tfms of a given DataBlock, use DataBlock.new.

2 Likes

Follow-up on the rename DataBunch -> DataLoaders, I’ve pushed the renaming for the subclasses. To update existing code, run the following:

find . -type f -exec perl -pi -e 's/\bTextDataBunch\b/TextDataLoaders/g' {} +
find . -type f -exec perl -pi -e 's/\bImageDataBunch\b/ImageDataLoaders/g' {} +
find . -type f -exec perl -pi -e 's/\bSegmentationDataBunch\b/SegmentationDataLoaders/g' {} +
find . -type f -exec perl -pi -e 's/\bTabularDataBunch\b/TabularDataLoaders/g' {} +
find . -type f -exec perl -pi -e 's/\bCollabDataBunch\b/CollabDataLoaders/g' {} +

@sgugger, Are you also planing to change DataLoaders subclasses like ImageDataBunch ?

Just did actually :wink:

1 Like

That was quick : I was looking at couse examples :slight_smile:

Following up on the naming, DataLoaders.train_dl and DataLoaders.valid_dl was becoming redundant and inconsistent with the easier Datasets.train/Datasets.valid and TfmdLists.train/TfmdLists.valid.

So now it’s simply DataLoaders.train and DataLoaders.valid. Again all course notebooks and examples have been updated to that change.

1 Like

For text in the new DataLoaders format, do we have a backwards option?

This has not been implemented yet.

Edit: Sorry I meant not fully implemented. You have the option at the dataloader level, but I think it’s not properly propagated. In any case it’s not tested, so there might be bugs.

1 Like

Thanks. I will try it.

Thank you very much.

@sgugger I updated to last version (fastai2 and fastcore) to use Datasets and dataloaders and started to get an error when using SentencePieceTokenizer (it was working with 0.0.6). It seems to be something with the defaults and the special tokens:

sp=SentencePieceTokenizer(sp_model='spm15kpt.model')

The error:

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-35-02509167f33d> in <module>
----> 1 sp=SentencePieceTokenizer(sp_model='spm15kpt.model')

/media/hdd3tb/data/fastai2/fastai2/text/core.py in __init__(self, lang, special_toks, sp_model, vocab_sz, max_vocab_sz, model_type, char_coverage, cache_dir)
    306         self.vocab_sz,self.max_vocab_sz,self.model_type = vocab_sz,max_vocab_sz,model_type
    307         self.char_coverage = ifnone(char_coverage, 0.99999 if lang in eu_langs else 0.9998)
--> 308         self.special_toks = ifnone(special_toks, defaults.text_spec_todsrc.tokenizer[1].lengthsk)
    309         if sp_model is None: self.tok = None
    310         else:

AttributeError: 'types.SimpleNamespace' object has no attribute 'text_spec_todsrc'

That was some bad text introduced by mistake. Fixed now.

1 Like

I’m trying to understand the logic behind all the data classes. This is what I understood so far:

  • Transform applies a transformation from an input to output
  • Pipeline is used to apply a sequence of Transform, reordered based on order
  • DataLoaders are what we need to train a model and will typically have at least a train and validation data loader and are created from Datasets , TfmdLists or DataBlock
  • DataBlock helps in building DataLoaders by specifying the type of inputs, outputs we will have and internally create the relevant Pipeline for each input/output
  • Datasets are created by applying sets of Transforms to each input/output
  • TfmdLists is similar to Datasets and let us create DataLoaders but is more versatile as we define our custom Pipeline

Is that correct?

2 Likes

Thanks!

Almost.

  • DataBlock is a general blueprint where you use blocks for inputs/targets, specify how to get, label and split your samples. See notebook 50 for many examples of use.
  • TfmsLists is a set of items with one Pipeline of transforms, maybe with a training and validation set if you specified splits
  • Datasets groups TfmdLists together, because you usually have one pipeline of transforms for your inputs and another for your targets. This is what helps us have the block API in DataBlock.

You would use TfmdLists instead of Datasets only if you have a specific pipeline of transforms that create the final tuple itself, instead of grouping pipelines together.

8 Likes

Another small issue I am facing. Trying to use LabelSmoothingCrossEntropy and receiving the following message:

‘LabelSmoothingCrossEntropy’ object has no attribute ‘mean’