Fastai_v1, adding features

You may already create custom dataloader see my post about sparse dataset above. The way fastai v1.0 handles data is as flexible as any library can get with PyTorch. If you have a particular use case maybe someone can help in the forums if you create a thread about it.

Sorry for a late response :sweat_smile:

Probably I was a bit unclear in my message. Yes, agree, fastai includes a great list of standard datasets and makes work with them really simple. However, here is what I am talking about. Consider the following snippet:

path = Path.home()/'data'/'MNIST'
train_ds = MNIST(path, train=True)
valid_ds = MNIST(path, train=False)
bunch = ImageDataBunch.create(train_ds, valid_ds)
learn = create_cnn(bunch, models.resnet18)
learn.fit_one_cycle(1)

The snippet cannot be used directly, because MNIST doesn’t work with the library if used directly:

AttributeError: 'MNIST' object has no attribute 'c'

I mean, the fastai library is not directly compatible with Dataset interface used by PyTorch. So I was thinking it would be great to have an opportunity to construct the DataBunch instance from “native” datasets. Because now it is definitely not enough to implement __getitem__ and __len__ to build a custom fastai-ready class.

Also, the most recent version of the library (from master) seems to be very focused on block API what makes a bit difficult to construct data bunch “manually”, I would say.


I hope my thoughts are clear :slight_smile:

Of course, I understand that the library builds a lot of additional abstractions on top of “plain” PyTorch capabilities. I only would like to make a note that making the library more “friendly” for PyTorch classes would be really helpful for someone who builds a lot of things manually.

Hi,

You may check previous post from me just above yours :slight_smile: If you look at the screenshot of my jupyter notebook ds is a native pytorch torch.utils.data Dataset with the methods you mentioned and dl is a native torch.utils.data Dataloader.

All that DataBunch cares is a dataloader instance, which will have __iter__ and __len__. Check this page out: https://pytorch.org/docs/stable/_modules/torch/utils/data/dataloader.html#DataLoader

Hope this helps.

class DataBunch():
    "Bind `train_dl`,`valid_dl` and`test_dl` to `device`. tfms are DL tfms (normalize). `path` is for models."
    def __init__(self, train_dl:DataLoader, valid_dl:DataLoader, test_dl:Optional[DataLoader]=None,
                 device:torch.device=None, tfms:Optional[Collection[Callable]]=None, path:PathOrStr='.',
                 collate_fn:Callable=data_collate):

As you see it takes Dataloaders during construction.

Best

Hi Kerem! Yes, you’re right, good example! I’m writing custom datasets as well :smile: So my point was introduce this behaviour into the library to make interoperation with “standard” PyTorch classes more simple. Like, to make it possible drop-in replacement of fastai loaders and datasets (maybe models also?) with plain PyTorch objects.

Hi Ilia,

Maybe i am not able to understand but what is the exact use case that you are not able to find a way of doing it with v1.0 for example ?

This way it might make more sense I guess, at least to me :slight_smile:

Best

Yes, sure, please check the snippet I’ve shared above. Like, you see, as I can understand the DataBunch class declares (via type annotation) that it accepts torch.data.utils.Dataset instances:

class DataBunch():
    ...

    @classmethod
    def create(cls, train_ds:Dataset, valid_ds:Dataset, test_ds:Dataset=None, path:PathOrStr='.', bs:int=64,
               num_workers:int=defaults.cpus, tfms:Optional[Collection[Callable]]=None, device:torch.device=None,
               collate_fn:Callable=data_collate)->'DataBunch':

I was reading this line like this, following OOP principles: I am capable to take any object that is compliant with Dataset interface and its heirs. However, it is not really the case with the library. For example, I can’t do the following:

from torchvision.datasets import MNIST
from fastai.vision import *

path = Path.home()/'data'/'MNIST'
train_ds = MNIST(path, train=True)
valid_ds = MNIST(path, train=False)

# the objects don't have property `c` and cannot be directly passed into `create`
bunch = ImageDataBunch.create(train_ds, valid_ds)
learn = create_cnn(bunch, models.resnet18)
learn.fit_one_cycle(1)

I would say it expects FastaiDataset interface, that extends the original definition with additional properties. I mean, here is what we have now:

# mock up interface to illustrate my idea
class FastaiDataset(Dataset):

    ... # some other properities

    @property
    def c(self):
        return len(self.classes)

# and then it should be more like this
class DataBunch():
    ...

    @classmethod
    def create(cls, train_ds:FastaiDataset, valid_ds:FastaiDataset, test_ds:FastaiDataset=None, path:PathOrStr='.', bs:int=64,
               num_workers:int=defaults.cpus, tfms:Optional[Collection[Callable]]=None, device:torch.device=None,
               collate_fn:Callable=data_collate)->'DataBunch':
        ...

Probably I am complicating things too much :smile: The idea is that it would be great to have a possibility to take any class that is compatible with torch.utils.data.Dataset interface and pass into DataBunch without additional manual wrappers and decorators. I mean, that the type annotation is a bit misleading :sweat_smile: At least, from my point of view. As soon as the library strictly annotates every argument and returned value, then probably it is important to have the appropriate interfaces.

Of course, it is only my personal opinion.

I’m unsure why this is an issue: why would you load MNIST like this when you have a convenience function to download the original dataset than then fastai can open, then not only gets you create_cnn but plenty of other convenience functions like show_batch, Learner.predict etc…
The idea when we say DataBunch can take any pytorch Dataset is that it will give you a data object you can pass into a Leaner with your model, then benefit from all the default parameters in training the library offers.

As you saw, if you want to use a custom Dataset with create_cnn, you have to add a c property to it. That’s never going to change. If you load MNIST this way, the data augmentation of the library can’t work on it, and nothing we add can make that change.

The good answer is to customize your Dataset into the fastai pipeline, because that’s always going to give you better result with the library.

Ok, understood! Yes, agree. I just find myself very often having small issues when trying to make something “on the edge” between fastai and pytorch. The MNIST is just an example :smiley:

As a not related example, you can easily pass pandas data frames into scikit-learn classes even though scikit-learn works with numpy arrays. All transformations are applied seamlessly thanks to interfaces compatibility.

But I agree, it is always possible to patch custom stuff to fit into fastai. Only wanted to clarify how the library is going to evolve to not find myself in the state of inventing bicycle :smile:

You’re looking at the sig of DataBunch, which I believe can take arbitrary Datasets. However create_cnn needs an ImageDataBunch, which needs c defined.

Yeah, that’s correct. I was thinking that if ImageDataBunch is OK with accepting arbitrary Dataset instances then it works with other library classes all the way down to training loops. It was a bit unexpected to me that it is not the case :smile:

Ok, sorry for the misinterpretation.

Hi,
I’m wondering if the tabular data class could be extended to provide the ability to use sample weights for each entry when computing losses?
For the data I work with (high energy physics) these weights are necessary in order to allow the simulated data we train on to match reality.
Currently I use Keras for my work, which has such a feature, but having followed the DL courses I’m looking to move to using the Fast.AI library.

Having looked on the forums, there only seems to be a few topics on balancing classes via hard-coded weights and a custom loss function, but with sample weights it depends on the batch of data being passed to the loss function, so it’s a bit more tricky.

Cheers

1 Like

Adding an option to auto-save model state after each fit_one_cycle() cycle

Hi,
I am wondering if an option to fit_one_cycle would be useful to autosave the model state after each cycle? I had the problem on several cloud providers that my notebook died when training a domain language model and so I reran the 10-cycle step like 4 times…
Maybe one could specify a flag autosave=True or autosave=“modelname” and the function would dump the state automatically into modelname-01.pth, modelname-02.pth, …

If you think this sounds worthwhile I might take a look at it? Hints where to start (hooks?) welcome…

Hi Christian,
Have you taken a look at the Save Model Callback? It might be already what you need.
https://docs.fast.ai/callbacks.tracker.html#SaveModelCallback

Did not know that one!

And yes, seems like it does just that!

Hi, and sorry if this request has been already discussed somewhere else!
I think that fastai is fantastic, but as allenNlp user I find quite convenient to instantiate objects from Jsonnet blobs. In ablation studies such declarative syntax allows to specify an entire experiment using json, moreover it allows to change architectures without changing code. It would be great to have experiment configuration files in fastai too. Thanks!!
For the unfamiliar reader, check this out to have an idea :

I’m using ImageDataBunch.from_df, which randomly splits the data frame to train/test, with no option to pass random seed.

I would like to add a seed=None argument, which will be passed to random_split_by_pct, and allow reproducible split. Do you think it’s necessary?

You can do it with the data block API (which you should learn since it’s more flexible than the factory methods).

1 Like

how to build custom layer, is there any tutorial?

I’m currently working on Object Detection and I’m thinking about developing an object similar to fastai’s ClassificationInterpretation to interpret the model.

Would that be a welcome addition to fastai? If so, do you have any particular request with respect to the features?

(originally posted on Developer chat, removed to post here)

It would be welcome, yes.

1 Like