Fastai v2 chat

MadeUpMasters · November 4, 2019, 6:45pm

Where is the todo list for v2? I’d love to help knock out a few small things.

Also, any thoughts on a feature that auto-adds Table of Contents at the top of the notebook based on headers. It would be nested based on header size. I’d be happy to do it.

There’s also a JupyterLab extension that generates a sidebar TOC dynamically.

boris · November 4, 2019, 11:16pm

I was able to log data similar to below image on this W&B run.

For simplicity of the test, I logged training loss at every epoch and validation/accuracy/prediction samples at every 4 epochs.
I used an old project with v1.

Let me know if this is what you want and I can adapt it and submit you a PR for fastai_dev.

muellerzr · November 5, 2019, 1:32am

What would be the process to prepare a quick script to install libraries? Reasoning: one specifically for google colab (as things like a PIL update among others need to happen). Are there any recommendations? (besides just running “do x before doing anything” or a warning on the repo?)

ilovescience · November 5, 2019, 2:15am

Sorry I didn’t understand. What’s wrong with just bunch of pip commands? That is how it is being installed now.

muellerzr · November 5, 2019, 2:35am

You’re right. A better idea would be to update the ReadMe with Colab specific instructions Incoming once the Colab issue is fixed

ilovescience · November 5, 2019, 3:50am

Would this be fine?:

!pip install torch torchvision feather-format kornia pyarrow --upgrade 
!pip install git+https://github.com/fastai/fastai_dev

(taken from @jeremy’s amazing RSNA kernels)

muellerzr · November 5, 2019, 3:52am

There needs a few more, such as a Pillow upgrade and possibly one more after a fix is implemented in pytorch for 1.3.1 along with a notice to restart the kernel after running it

ilovescience · November 5, 2019, 3:54am

OK I didn’t realize this was a problem. These commands were enough for GCP (which I think is using similar setup as Colab) and Kaggle Kernels.

for Pillow you just need to upgrade? (pip upgrade pillow). What’s wrong with PyTorch 1.3?

muellerzr · November 5, 2019, 3:58am

There’s an issue in Colab (see the v2 vision thread) that’s being fixed soon (tonight). I was looking for a quick one liner, and your extended pip should do the trick, thanks once pytorch has pushed the update I’ll verify it works as intended then post on the github (as all the vision transforms pretty much wont work until then)

ilovescience · November 5, 2019, 4:00am

Ah ok I see. It seems like a K80-specific issue and I had been using T4 and V100, which is why I never ran into this issue.

They will update it in Colab, so yeah, the pip commands above should be enough once PyTorch is updated.

ilovescience · November 5, 2019, 4:02am

I thought I would post this video I watched couple weeks ago:

It goes over concepts like metaclasses, context managers, etc. which seem to be important parts of the lower-level API of fastai v2. I hope this is a helpful video.

alexamadori · November 6, 2019, 1:10pm

I’m trying to load a seq2seq dataset that is larger than what can fit in memory.
I can load samples with this class derived from a torch Dataset:

class SeqDataset(Dataset):
    # ...
    def get_line(self, idx):
        line = linecache.getline(self.path, idx + 1)
        return line.split("\t")
    def __getitem__(self, idx):
        return self.get_line(idx)[:2]

I’m trying to see if I can make this work with the tools from fastai v2. The first thing I need to do is set up a Pipeline for tokenization and numericalization. However, peeking into the code, I realized transforms are set up by going through the whole training set! In my case not only I can’t afford to do that, but it’s also unnecessary cause my data has a very limited vocabulary. Also TfmdList expects the input items to by listy, but I can’t load my whole dataset into a list.

Do you think it would be a good inclusion to make a version of DataSource or TfmdList specific to large datasets, for example by using only a small subset to set up the transforms?

arora_aman · November 6, 2019, 9:50pm

Building on top of my last post, I have written another one where I look into the DataSource object

It can be found here:

muellerzr · November 7, 2019, 3:45am

Question: is it possible to include our own custom dict’s to a PipeLine? Or must we use a DataBlock object. For example, trying to do ImageWoof at the moment with a custom dict for my y’s. I can do:

woof = DataBlock(blocks=(ImageBlock, CategoryBlock),
                 get_items=get_image_files,
                 splitter=GrandparentSplitter(valid_name='val'),
                 get_y=parent_label)

dbunch = woof.databunch(untar_data(URLs.IMAGEWOOF), bs=32, item_tfms=RandomResizedCrop(128),
                        batch_tfms=Normalize(*imagenet_stats))

To databunch it. Would I pass it into my item transforms?

Solved it we pass it into our get_y like so:
get_y=[parent_label, lbl_dict.__getitem__] Cool!

jeremy · November 7, 2019, 4:19am

Check the text nbs - they show how to do this already.

alexamadori · November 7, 2019, 7:48am

As far as I understand, the text nbs pass the tokenized dataset to DataSource. Following the chain of inheritance, the dataset eventually gets converted to an L, which excludes the possibility of using a cached dataset that is not fully loaded in memory. It can be made to sort of work by defining the ds like this:

ds = DataSource(range_of(large_dataset), tfms=[[lambda idx: large_dataset[idx], itemgetter(0), ...]])

But range_of is fully unrolled, and over 2 GB are being used just for storing the list of indexes (in my case, I don’t need shuffling, so it’s just an unrolled range).
I’ll try to look out fixes that could to alleviate this.

jeremy · November 7, 2019, 5:21pm

You asked about tokenization and numericalization, neither of which uses a DataSource, and both of which can operate on folders of files which are only loaded one chunk at a time. I’m not quite following your issue…

alexamadori · November 7, 2019, 5:57pm

I apologize, clarity is not one of my strengths. I realized I don’t need a DataSource to setup numericalization, and I can tokenize single chunks of text.

Now I’m trying to use my data with DataBunch and Learner. In my case, the whole dataset is saved as a huge (~10 GB) tsv file. Since I can’t load the file in memory, I can’t think of a good solution to make the dataset indexable (to implement __getitem__). I wonder if there’s a way to use fastai’s DataLoader without making the assumption that the dataset object implements __getitem__.

For example, in PyTorch you can subclass IterableDataset and then pass it to a DataLoader. This is really practical, because all I’d need to do is yield each line in the dataset’s __iter__ method.

Sorry if there’s already a way and I missed it. fastai is a real gem and every day I realize it’s even more flexible than I thought the day before. To everyone who’s involved in the development, thanks for all your work!

EDIT: just found out about bcolz. Let’s see what it can do

muellerzr · November 7, 2019, 8:02pm

Pytorch 1.3.1 has been released, I can confirm the vision issue has been fixed on Colab

arora_aman · November 7, 2019, 10:35pm

Question:

I was looking into TfmdList and above is something that I did not expect to happen.

From my understanding, calling setup in TfmdList calls setup on self.tfms passing in self.train, which is what I have tried to replicate in the img.

I am not sure why I get different vocabs, what am I doing wrong please?