No, all afiine/coord/lighting transforms are done on the GPU
@sgugger question on this new naming change. Is there still train, Val, and test_dl? (Like dbunch.train_dl) Or is it index based dbunch[0] (to more easily allow unlimited sets). I was noticing this on the commits and wanted to be sure this is the behavior Iâm reading
Edit: it seems like there still is, but I also see behaviors like dbunch[x]. Could you enlighten me a little?
There is no test_dl
attribute, but there is a train_dl
and valid_dl
attribute for the first two. You can also index directly (for your second validation dl for instance).
Thanks for the clarification
The release with renamed DataBunch/DataSource is now on pypi.
Second breaking change in DataBlock
this time. We moved the transforms in the init (and not .dataloaders
anymore) which needs you to move those bits of codes. All notebooks in the fastai2 repo, examples and the course notebooks are up to date with that change, if you need some examples.
This will allow us to have a better representation/summary for the DataBlock
class and some useful debug methods are in the oven (and all of that needs to know the transforms).
To easily change the item_tfms
or batch_tfms
of a given DataBlock
, use DataBlock.new
.
Follow-up on the rename DataBunch
-> DataLoaders
, Iâve pushed the renaming for the subclasses. To update existing code, run the following:
find . -type f -exec perl -pi -e 's/\bTextDataBunch\b/TextDataLoaders/g' {} +
find . -type f -exec perl -pi -e 's/\bImageDataBunch\b/ImageDataLoaders/g' {} +
find . -type f -exec perl -pi -e 's/\bSegmentationDataBunch\b/SegmentationDataLoaders/g' {} +
find . -type f -exec perl -pi -e 's/\bTabularDataBunch\b/TabularDataLoaders/g' {} +
find . -type f -exec perl -pi -e 's/\bCollabDataBunch\b/CollabDataLoaders/g' {} +
Just did actually
That was quick : I was looking at couse examples
Following up on the naming, DataLoaders.train_dl
and DataLoaders.valid_dl
was becoming redundant and inconsistent with the easier Datasets.train
/Datasets.valid
and TfmdLists.train
/TfmdLists.valid
.
So now itâs simply DataLoaders.train
and DataLoaders.valid
. Again all course notebooks and examples have been updated to that change.
For text in the new DataLoaders format, do we have a backwards option?
This has not been implemented yet.
Edit: Sorry I meant not fully implemented. You have the option at the dataloader level, but I think itâs not properly propagated. In any case itâs not tested, so there might be bugs.
Thanks. I will try it.
Thank you very much.
@sgugger I updated to last version (fastai2 and fastcore) to use Datasets and dataloaders and started to get an error when using SentencePieceTokenizer (it was working with 0.0.6). It seems to be something with the defaults and the special tokens:
sp=SentencePieceTokenizer(sp_model='spm15kpt.model')
The error:
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-35-02509167f33d> in <module>
----> 1 sp=SentencePieceTokenizer(sp_model='spm15kpt.model')
/media/hdd3tb/data/fastai2/fastai2/text/core.py in __init__(self, lang, special_toks, sp_model, vocab_sz, max_vocab_sz, model_type, char_coverage, cache_dir)
306 self.vocab_sz,self.max_vocab_sz,self.model_type = vocab_sz,max_vocab_sz,model_type
307 self.char_coverage = ifnone(char_coverage, 0.99999 if lang in eu_langs else 0.9998)
--> 308 self.special_toks = ifnone(special_toks, defaults.text_spec_todsrc.tokenizer[1].lengthsk)
309 if sp_model is None: self.tok = None
310 else:
AttributeError: 'types.SimpleNamespace' object has no attribute 'text_spec_todsrc'
That was some bad text introduced by mistake. Fixed now.
Iâm trying to understand the logic behind all the data classes. This is what I understood so far:
-
Transform
applies a transformation from an input to output -
Pipeline
is used to apply a sequence ofTransform
, reordered based onorder
-
DataLoaders
are what we need to train a model and will typically have at least atrain
andvalidation
data loader and are created fromDatasets
,TfmdLists
orDataBlock
-
DataBlock
helps in buildingDataLoaders
by specifying the type of inputs, outputs we will have and internally create the relevantPipeline
for each input/output -
Datasets
are created by applying sets ofTransforms
to each input/output -
TfmdLists
is similar toDatasets
and let us createDataLoaders
but is more versatile as we define our customPipeline
Is that correct?
Thanks!
Almost.
-
DataBlock
is a general blueprint where you use blocks for inputs/targets, specify how to get, label and split your samples. See notebook 50 for many examples of use. -
TfmsLists
is a set of items with one Pipeline of transforms, maybe with a training and validation set if you specified splits -
Datasets
groupsTfmdLists
together, because you usually have one pipeline of transforms for your inputs and another for your targets. This is what helps us have the block API inDataBlock
.
You would use TfmdLists
instead of Datasets
only if you have a specific pipeline of transforms that create the final tuple itself, instead of grouping pipelines together.
Another small issue I am facing. Trying to use LabelSmoothingCrossEntropy and receiving the following message:
âLabelSmoothingCrossEntropyâ object has no attribute âmeanâ