Fastai v2 chat

(Zachary Mueller) #383

Question: is it possible to include our own custom dict’s to a PipeLine? Or must we use a DataBlock object. For example, trying to do ImageWoof at the moment with a custom dict for my y’s. I can do:

woof = DataBlock(blocks=(ImageBlock, CategoryBlock),
                 get_items=get_image_files,
                 splitter=GrandparentSplitter(valid_name='val'),
                 get_y=parent_label)

dbunch = woof.databunch(untar_data(URLs.IMAGEWOOF), bs=32, item_tfms=RandomResizedCrop(128),
                        batch_tfms=Normalize(*imagenet_stats))

To databunch it. Would I pass it into my item transforms?

Solved it :slight_smile: we pass it into our get_y like so:
get_y=[parent_label, lbl_dict.__getitem__] Cool!

2 Likes

(Jeremy Howard (Admin)) #384

Check the text nbs - they show how to do this already.

0 Likes

(Alex Amadori) #385

As far as I understand, the text nbs pass the tokenized dataset to DataSource. Following the chain of inheritance, the dataset eventually gets converted to an L, which excludes the possibility of using a cached dataset that is not fully loaded in memory. It can be made to sort of work by defining the ds like this:

ds = DataSource(range_of(large_dataset), tfms=[[lambda idx: large_dataset[idx], itemgetter(0), ...]])

But range_of is fully unrolled, and over 2 GB are being used just for storing the list of indexes (in my case, I don’t need shuffling, so it’s just an unrolled range).
I’ll try to look out fixes that could to alleviate this.

0 Likes

(Jeremy Howard (Admin)) #386

You asked about tokenization and numericalization, neither of which uses a DataSource, and both of which can operate on folders of files which are only loaded one chunk at a time. I’m not quite following your issue…

1 Like

(Alex Amadori) #387

I apologize, clarity is not one of my strengths. I realized I don’t need a DataSource to setup numericalization, and I can tokenize single chunks of text.

Now I’m trying to use my data with DataBunch and Learner. In my case, the whole dataset is saved as a huge (~10 GB) tsv file. Since I can’t load the file in memory, I can’t think of a good solution to make the dataset indexable (to implement __getitem__). I wonder if there’s a way to use fastai’s DataLoader without making the assumption that the dataset object implements __getitem__.

For example, in PyTorch you can subclass IterableDataset and then pass it to a DataLoader. This is really practical, because all I’d need to do is yield each line in the dataset’s __iter__ method.

Sorry if there’s already a way and I missed it. fastai is a real gem and every day I realize it’s even more flexible than I thought the day before. To everyone who’s involved in the development, thanks for all your work!

EDIT: just found out about bcolz. Let’s see what it can do :smiley:

0 Likes

(Zachary Mueller) #388

Pytorch 1.3.1 has been released, I can confirm the vision issue has been fixed on Colab

2 Likes

(Aman Arora) #389

Question:

I was looking into TfmdList and above is something that I did not expect to happen.

From my understanding, calling setup in TfmdList calls setup on self.tfms passing in self.train, which is what I have tried to replicate in the img.

I am not sure why I get different vocabs, what am I doing wrong please?

1 Like

(Aman Arora) #390

Never mind for now :slight_smile:

Different dsrc objects passed to Categorize lead to this inconsistency.

I will keep looking and see why that happens.

# tl dsrc
TfmdList: [PosixPath('/home/ubuntu/.fastai/data/oxford-iiit-pet/images/basset_hound_111.jpg'), PosixPath('/home/ubuntu/.fastai/data/oxford-iiit-pet/images/Siamese_178.jpg'), PosixPath('/home/ubuntu/.fastai/data/oxford-iiit-pet/images/keeshond_34.jpg'), PosixPath('/home/ubuntu/.fastai/data/oxford-iiit-pet/images/german_shorthaired_94.jpg')]
tfms - (#1) [Transform: True (object,object) -> RegexLabeller ]


# pipe dsrc
TfmdList: [PosixPath('/home/ubuntu/.fastai/data/oxford-iiit-pet/images/basset_hound_111.jpg'), PosixPath('/home/ubuntu/.fastai/data/oxford-iiit-pet/images/Siamese_178.jpg'), PosixPath('/home/ubuntu/.fastai/data/oxford-iiit-pet/images/keeshond_34.jpg'), PosixPath('/home/ubuntu/.fastai/data/oxford-iiit-pet/images/german_shorthaired_94.jpg')]
tfms - (#2) [Transform: True (object,object) -> RegexLabeller ,Categorize: True (object,object) -> encodes (object,object) -> decodes]
1 Like

(Jeremy Howard (Admin)) #391

bcolz is a great choice. You can also create iterable datasets in fastai v2. Have a look at the notebook where DataLoader is defined to see various approaches to that. Since I haven’t actually had a need to use them myself yet, they’re not tested in a practical setting - so do let us know if you try them out and have any issues.

It’s a pleasure! :slight_smile:

0 Likes

(Aman Arora) #392

Somehow, I haven’t been able to replicate setup method of TfmdList, could I please double check my understanding of the code?

Referring to the image:
The first step is to getattr(self, 'train', self) which means, we get the train subset where i=0 and therefore, in step-2, we get splits[0], and return items with that split.

Then, in step-2 itself, we call _new which takes us to step-4 (skipping step-3 as that is only _get to get items defined in L), where we call super._new(items, tfms=self.tfms, do_setup=False, **kwargs).

This takes us to step-5, because _new is defined inside L which all it does is return a new object of type TfmdList with items with idx in split[0], do_setup=False. So we get a new TfmdList with same tfms.

This takes us back to step-6, because now we have self.train which is a TfmdList with items in splits[0].
In step-6., we call self.tfms.setup() passing in this very self.train object.

Since self.tfms is a Pipeline, now we get to step-7 which finally performs setup on the individual Transforms which for pets tutorial for y variable are [RegexLabeller(pat), Categorize].

This Pipeline object self is nothing but TfmdList.tfms or TfmdList.train.tfms, right? The items arg that we have passed to step-7, is the train TfmdList with the same tfms as the original TfmdList.

Therefore, this items obj, (which is nothing but original TfmdLists train subset) passed to Pipeline.setup looks something like:

TfmdList: [PosixPath('/home/ubuntu/.fastai/data/oxford-iiit-pet/images/basset_hound_111.jpg'), PosixPath('/home/ubuntu/.fastai/data/oxford-iiit-pet/images/Siamese_178.jpg'), PosixPath('/home/ubuntu/.fastai/data/oxford-iiit-pet/images/keeshond_34.jpg'), PosixPath('/home/ubuntu/.fastai/data/oxford-iiit-pet/images/german_shorthaired_94.jpg')]
tfms - (#2) [Transform: True (object,object) -> RegexLabeller ,Categorize: True (object,object) -> encodes (object,object) -> decodes]

If originally,

items
>> (#5) [/home/ubuntu/.fastai/data/oxford-iiit-pet/images/keeshond_34.jpg,/home/ubuntu/.fastai/data/oxford-iiit-pet/images/Siamese_178.jpg,/home/ubuntu/.fastai/data/oxford-iiit-pet/images/german_shorthaired_94.jpg,/home/ubuntu/.fastai/data/oxford-iiit-pet/images/Abyssinian_92.jpg,/home/ubuntu/.fastai/data/oxford-iiit-pet/images/basset_hound_111.jpg]

splits
>> ((#4) [4,1,0,2], (#1) [3])

tfms
>> [<local.data.transforms.RegexLabeller at 0x7f49c026fd30>,
 local.data.transforms.Categorize]

This is where it gets a little confusing.

When we do self.fs.clear(), the items are now:

ipdb> items
TfmdList: [PosixPath('/home/ubuntu/.fastai/data/oxford-iiit-pet/images/basset_hound_111.jpg'), PosixPath('/home/ubuntu/.fastai/data/oxford-iiit-pet/images/Siamese_178.jpg'), PosixPath('/home/ubuntu/.fastai/data/oxford-iiit-pet/images/keeshond_34.jpg'), PosixPath('/home/ubuntu/.fastai/data/oxford-iiit-pet/images/german_shorthaired_94.jpg')]
tfms - (#0) []

The tfms are cleared on items which was the train subset of original TfmdList!!

Why would this be?

My understanding:

  1. A TfmdList and a TfmdList.subset(0) share the same tfms stored in both objects self.tfms
  2. This self.tfms is a Pipeline that we call setup on, thus when we do self.fs.clear(), it clears the Transforms on TfmdList and TfmdList.subset(0).
  3. Since this TfmdList.subset(0) is what get’s passed as items, therefore, tfms get cleared when we do self.fs.clear().

Following on the above understanding, therefore, when I do:

_tl = TfmdList(items[splits[0]], tfms=[RegexLabeller(pat)], do_setup=False)
pipe = Pipeline(tfms, as_item=True); 
pipe.setup(_tl)
pipe.vocab

>> (#4) [Siamese,basset_hound,german_shorthaired,keeshond]

It works :slight_smile:

1 Like

#394

can I get some advice with installing v2? On my windows 10 machine I’m doing :

git clone git@github.com:fastai/fastai_dev.git
cd .\fastai_dev\
conda env create -f environment.yml
conda activate fastai_dev
pip install git+https://github.com/fastai/fastai_dev

then running python from the shell where I get the following error

>>> from fastai2.basics import *
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ModuleNotFoundError: No module named 'fastai2'

Does anyone have any advice?

0 Likes

(Zachary Mueller) #395

I had a thought and I want your opinion before I try to do anything large like this. We have ClassificationInterpretation, but why not just an Interpreter. Don’t limit to classification. For example, regression. We plot the worst guesses (for image points), or for tabular we do a cluster of how close the guesses were to the correct point. Also we could implement permutation importance in for tabular here too, to make it easier to analyze tabular models. In terms of NLP, I need to revisit the NLP course again but I imagine something along the lines of word clustering and how close we came for our Language Models. I can try and come up with something for the tabular or vision models if you want a visualization (which I know always helps).

Let me know your thoughts on this and if you know of any other regression-based analysis techniques that should be included in the above. :slight_smile:

0 Likes

#396

Thought I might mention that there are SegmentationInterpretation and TextClassificationInterpretation classes already exist. I am not sure if it makes sense to have a general Interpreter class as each task will have different interpreting methods, so I think that’s why current approach has been to have separate classes for separate tasks.

1 Like

(Zachary Mueller) #397

I suppose that makes a lot more sense and sounds better :slight_smile: I was thinking something along the lines (in terms of code) a generic that can pick up what type is being passed in and apply particular functions to it, hence the Interpreter class.

1 Like

#398

Sure I guess it wouldn’t be hard to do something like that with type dispatching (IIRC that’s the correct term). However, it’s still hard as a user because there aren’t any set functions because there are different interpretation functions for each task.

0 Likes

(Zachary Mueller) #399

Thanks for the check! (We just learned about type dispatching in my Intermidate class a few weeks ago, forgot the proper term for a moment).

That makes sense, so perhaps then mabye TextInterpretation, VisionInterpretation, and TabularInterpretation? (If these regression/language inferences were added)

0 Likes

#400

Again, there could be various NLP or vision or tabular tasks with their own interpretation methods. However, I don’t want to be too negative and I will let Jeremy and Sylvain decide what is the best approach over here.

1 Like

(Zachary Mueller) #401

You’re certainly not being too negative :slight_smile: Constructed criticism is always welcome :slight_smile: For now, I’ll wait for Jeremy or Sylvain, and lull over which specific implementation ideas could convert over to one of the existing interps easily (a feature importance or feature visualization could easily be done IMO for tabular atleast)

1 Like

#402

There is already an interpretation class, and a plot_top_losses method that has a type-dispatched counterpart (like show_batch and show_results). For now it only handles image and text classification but more visualization cases can be added with the type-dispatch.

3 Likes

(Zachary Mueller) #403

Got it! I’ll look into that more. Thanks @sgugger :slight_smile:

0 Likes