Fastai v2 chat

MadeUpMasters · January 22, 2020, 1:49pm

Thanks for the suggestion, I did this and ran nbdev_test_nbs on fastai2 and fastcore and nothing broke (procedure I followed included below). Would you guys prefer to delete the delwrap lines or would it be easier if I submit a PR? I’ll note your preference going forward as well so I don’t ask each time

Force-pull fastai2 and reinstall with pip install -e .
Edit fastcore/foundations.py to remove the hasattr line
pip uninstall fastcore, pip install -e . inside fastcore root

sgugger · January 22, 2020, 3:28pm

You can directly use pickle to save/load your DataBunch object. It seems like you have a multilabel problem so you probably need to change the loss function.

Ah, I see the problem. text_classifier_learner hard-codes the loss function instead of picking it in the data. Will fix.

Pablo · January 22, 2020, 3:53pm

Yes, that fixed the problem! Now I have a new one, but things are moving forward

Edit: it seems like basic training is finally working! Now I just have to iron out some details.

Pablo · January 22, 2020, 4:18pm

I have some trouble with my data loading for the classifier. It works, but it seems to repeat tokenization (already done for the language model). Currently I do:

data_label_blocks = (TextBlock.from_folder(path=unsupervised_folder, vocab=self.vocab), 
                     MultiCategoryBlock(vocab=my_classes))
dsrc = DataBlock(blocks=data_label_blocks,
                 splitter=RandomSplitter(),
                 get_x=lambda x: unsupervised_folder / f'{x[0]}.txt',
                 get_y=lambda x: x[1].split(' '),
                 dl_type=SortedDL)

But I already have the data tokenized in another folder! Very naively I tried to pass that folder but then Fastai created a tokenized version of the tokenized version (which is mostly right, but not quite, and still repeating work). I suppose there is some option to say “skip tokenization”? I tried some silly ideas like TextBlock(None, vocab=self.vocab) but no luck so far.

Update: We have been exploring the code and perhaps this is already fixed by design?

From the Tokenizer class in fastai2/text/core.py we have:

@delegates(tokenize_folder, keep=True)
def from_folder(cls, path, tok_func=SpacyTokenizer, **kwargs):
    path = Path(path)
    output_dir = Path(ifnone(kwargs.get('output_dir'), path.parent/f'{path.name}_tok'))
    if not output_dir.exists(): tokenize_folder(path, **kwargs)
    res = cls(get_tokenizer(tok_func, **kwargs), counter=(output_dir/fn_counter_pkl).load(),
              lengths=(output_dir/fn_lengths_pkl).load(), mode='folder')
    res.path,res.output_dir = path,output_dir
    return res

if not output_dir.exists(): tokenize_folder(path, kwargs)
So it seems like no double work is being done

muellerzr · January 23, 2020, 3:25pm

What would be thoughts on including another dataset? This would be the one in question:

It needs a bit of cleaning for it to work (I’ve done this), the goal would be for a keypoint/pose detection based dataset

(The heatmap tutorial will be using this dataset)

Pablo · January 23, 2020, 4:47pm

I see there are some interesting changes in callbacks.

A good change is that we don’t need to pass the learner as a parameter (rather the callback is a parameter of the learner, which makes more sense) and that we don’t need to pass the callback at train.

The method (learn.show_training_loop()) to show all active callbacks at different parts of the loop is AMAZING. (Credit to David Cato.)

It’s a bit confusing that method names for events in callbacks have changed, and the base class is much harder to understand now. But there is a helpful (not trivial to find) list of events at fastai2/learner.py, called _loop

Event methods at the callbacks now don’t receive any parameters. The information is already available, although with new names as well. For instance, for the former parameter train, now we have to use self.learn.training. And other information is directly an attribute. For example, before we had the parameters last_target and last_output, and now we have self.pred (this is before sigmoid/softmax, I assume?) and self.yb (why yb?).

Do you have a quick explanation for why some info is stored at the learner and other is directly at the callback?

Update: I have been trying out combinations and it seems all attributes are accessible either through the learner or as attributes of the callback. I think this is the expected behaviour when inheriting from GetAttr.

The documentation is lacking, in my opinion, a small part explaining how to do your own callback (not much more info than in this mini-post would help a lot, I think).

sgugger · January 23, 2020, 5:17pm

No one said v2 was ready yet. The documentation is not done, and we will add tutorials, but we are still in development for now.

Pablo · January 24, 2020, 8:45am

Of course! I didn’t mean it like that. The documentation on callbacks seems rather advanced otherwise, so I was just pointing this out.

sgugger · January 24, 2020, 7:28pm

Attention, we have made some renaming that breaks everything:

DataBunch is now DataLoaders
DataSource is now Datasets
TfmdList is now TfmdLists

To automatically change your code, run this in the folder it lays

find . -type f -exec perl -pi -e 's/\bDataSource\b/Datasets/g' {} +
find . -type f -exec perl -pi -e 's/\bdatasource\b/datasets/g' {} +
find . -type f -exec perl -pi -e 's/\bDataBunch\b/DataLoaders/g' {} +
find . -type f -exec perl -pi -e 's/\bdatabunch\b/dataloaders/g' {} +
find . -type f -exec perl -pi -e 's/\bTfmdList\b/TfmdLists/g' {} +

baz · January 24, 2020, 8:11pm

Is RandomResizeCropGPU currently the only example of GPU Transform?

sgugger · January 24, 2020, 8:16pm

No, all afiine/coord/lighting transforms are done on the GPU

muellerzr · January 25, 2020, 3:16am

@sgugger question on this new naming change. Is there still train, Val, and test_dl? (Like dbunch.train_dl) Or is it index based dbunch[0] (to more easily allow unlimited sets). I was noticing this on the commits and wanted to be sure this is the behavior I’m reading

Edit: it seems like there still is, but I also see behaviors like dbunch[x]. Could you enlighten me a little?

sgugger · January 25, 2020, 2:14pm

There is no test_dl attribute, but there is a train_dl and valid_dl attribute for the first two. You can also index directly (for your second validation dl for instance).

muellerzr · January 25, 2020, 2:50pm

Thanks for the clarification

jeremy · January 25, 2020, 3:26pm

The release with renamed DataBunch/DataSource is now on pypi.

sgugger · January 25, 2020, 9:00pm

Second breaking change in DataBlock this time. We moved the transforms in the init (and not .dataloaders anymore) which needs you to move those bits of codes. All notebooks in the fastai2 repo, examples and the course notebooks are up to date with that change, if you need some examples.

This will allow us to have a better representation/summary for the DataBlock class and some useful debug methods are in the oven (and all of that needs to know the transforms).

To easily change the item_tfms or batch_tfms of a given DataBlock, use DataBlock.new.

sgugger · January 25, 2020, 9:10pm

Follow-up on the rename DataBunch -> DataLoaders, I’ve pushed the renaming for the subclasses. To update existing code, run the following:

find . -type f -exec perl -pi -e 's/\bTextDataBunch\b/TextDataLoaders/g' {} +
find . -type f -exec perl -pi -e 's/\bImageDataBunch\b/ImageDataLoaders/g' {} +
find . -type f -exec perl -pi -e 's/\bSegmentationDataBunch\b/SegmentationDataLoaders/g' {} +
find . -type f -exec perl -pi -e 's/\bTabularDataBunch\b/TabularDataLoaders/g' {} +
find . -type f -exec perl -pi -e 's/\bCollabDataBunch\b/CollabDataLoaders/g' {} +

s.s.o · January 25, 2020, 9:10pm

@sgugger, Are you also planing to change DataLoaders subclasses like ImageDataBunch ?

sgugger · January 25, 2020, 9:11pm

Just did actually

s.s.o · January 25, 2020, 9:12pm

That was quick : I was looking at couse examples