Fastai_v1, adding features

I’m using DS, with valid_pct, which is using random_split, but after reopening notebook, random.uniform returns a different set that spoils further training.

Could we add to random_split additional seed argument with default len(arrs[0])?

def random_split(valid_pct:float, *arrs:NPArrayableList, seed:int=len(arrs[0]))->SplitArrayList:
    "Randomly split `arrs` with `valid_pct` ratio. good for creating validation set."
    np.random.seed(seed)
    is_train = np.random.uniform(size=(len(arrs[0]),)) > valid_pct
    return arrays_split(is_train, *arrs)

And enable it with kwargs from constructor

1 Like

No we don’t want to remove randomness by default. Just set the seed in your notebook.

Hi,

Not a huge deal, but I keep having to pull up the docs to remind myself the order of what is being presented when I call plot_top_losses (is it actual/predicted or predicted/actual?). It might be more verbose than you would want, but I would change it so it appears like this:

Code change here.

Notebook where I played with this (under “Results” section) here.

Let me know if you would like me to open a PR. Otherwise, I’m sure I’ll remember the order eventually :slight_smile:

2 Likes

If you added a title containing the list in order, I’d merge that PR :slight_smile:

@jeremy can fastai add adversarial training? now i know this is purely something being focused on research right now. But in production it would be better if the models were robust to noise as well?

I’d be happy to contribute or help out.

Minor feature request: Allow the use of a custom data loader. ie for image classification, it would be adding an optional parameter dataloader in ImageDataBunch.create static method.

Not sure how it is relevant to the library’s evolution roadmap, but probably it would be interesting to have a facade (or maybe build this logic into ImageDataBunch) to convert “plain” pytorch dataset class/instance into dataset supported by the library. Also, would be nice to have a way directly pass nn.Module heirs into the learner. Then you can do something like:

class FancyCustomModel(nn.Module):
   # ... some stuff to create and setup model

  def split(self):
      return (self[1], )

train_ds, valid_ds = MNIST(train=True), MNIST(train=False)
bunch = ImageDataBunch.create(train_ds, valid_ds)
learn = ClassificationLearner(bunch, FancyCustomModel())
learn.fit_one_cycle(1)

The main reason of this proposal is to make pytorch <-> fastai integration even more seamless then it already is.

Agree I often follow this pattern using “ModelModifier”
learn = create_cnn(data, ModelModifier(models.resnet50), metrics=error_rate)

class ModelModifier:
def init( self, arch ):
self.arch = arch
def call(self, pretrained):
module = self.arch(pretrained)
# do something with the model before passing it back to create_cnn
return module

2 Likes

Any pytorch dataset should already work fine with the library. Did you find some problems?

1 Like

I also played with your weight decay finder but with my two datasets the results almost always overlaid each other.
What is you experience with it? Does this depend on the data/network or you encountered the same behavior for different datasets?

Thank you for sharing your notebook!

Kind regards
Michael

You can use lambda here as a shortcut create_cnn(data, lambda pre: FancyCustomModel())

1 Like

Hi,

I have a working prototype for possible SparseDataset and LinearSparse for creating dataset (mostly for tabular) from scipy.sparse using a custom sparse_collate_fn , would this be something desired or worth adding as a feature ?

I read a lot of issues in Pytorch forums and came up with the ideas above for handling sparse datasets.

Some use cases, why would someone want sparse dataset when we have embeddings ?:

1) Extracting features from CNN models like ResNet, if you have n different models it will make n*2048 sized vector as input which has lot of zeros since it’s after relu.

2) Text features which are non-sequential and variable length. Here we can use BOW-TFIIDF, of course you can work your way around with more sophisticated models like LSTMs or Transformers but BOW-TFIDF is most of the will give you a strong and fast baseline.

3 Will think more…:slight_smile:

Why custom collate_fn?:

Because default_collate will fail. The way I do is i store scipy.sparse data inside dataset:

def sparse_collate_fn(dataset):
    """
    dataset: [self.dataset[i] for i in indices]
    """
    x,y = list(zip(*dataset))
    sparse_stacked = scipy.sparse.vstack(x)
    torch_sparse_stacked = torch_from_scipysparse(sparse_stacked, size=sparse_stacked.shape, device=0)
    torch_y = torch.FloatTensor(np.concatenate([y]))
    return torch_sparse_stacked, torch_y

def torch_from_scipysparse(sparse_matrix, *args, **kwargs):
    """sparse_matrix: scipy sparse matrix """
    sparse_matrix = sparse_matrix.tocoo(copy=False)
    row,col,values = sparse_matrix.row, sparse_matrix.col, sparse_matrix.data
    i = torch.LongTensor(np.vstack([row, col]))
    v = torch.FloatTensor(values)
    return torch.sparse_coo_tensor(i, v, *args, **kwargs)

Why custom LinearSparse?:

Because at the moment autograd supports some ops only, which can be find here: https://github.com/pytorch/pytorch/issues/9674. This is an open issue and i bet in near future they will support all modules. But if you create nn.Linear layer move it to say device:0 and do backward() will fail. The reason is, at least what I remember from forums, is that when you move variables intermediate variables are created and backward fails ? (maybe someone can explain further). So LinearSparse will allow us to move the module to GPU as we create it, it uses torch.mm and pretty similar to nn.Linear:

class LinearSparse(nn.Module):
    def __init__(self, in_features, out_features, bias=True, **kwagrs):
        """allow args and kwargs for weight and bias construction"""
        super(LinearSparse, self).__init__()
        self.in_features = in_features
        self.out_features = out_features
        self.weight = Parameter(torch.rand(in_features, out_features, **kwagrs))
        if bias:
            self.bias = Parameter(torch.rand(out_features, **kwagrs))
        else:
            self.register_parameter('bias', None)
        self.reset_parameters()

    def reset_parameters(self):
        stdv = 1. / math.sqrt(self.weight.size(1))
        self.weight.data.uniform_(-stdv, stdv)
        if self.bias is not None:
            self.bias.data.uniform_(-stdv, stdv)

    def forward(self, input):
        """input: Sparse (S), weight: Dense(S)"""
        return torch.mm(input, self.weight) + self.bias

    def extra_repr(self):
        return 'in_features={}, out_features={}, bias={}'.format(
            self.in_features, self.out_features, self.bias is not None
        )

You may already create custom dataloader see my post about sparse dataset above. The way fastai v1.0 handles data is as flexible as any library can get with PyTorch. If you have a particular use case maybe someone can help in the forums if you create a thread about it.

Sorry for a late response :sweat_smile:

Probably I was a bit unclear in my message. Yes, agree, fastai includes a great list of standard datasets and makes work with them really simple. However, here is what I am talking about. Consider the following snippet:

path = Path.home()/'data'/'MNIST'
train_ds = MNIST(path, train=True)
valid_ds = MNIST(path, train=False)
bunch = ImageDataBunch.create(train_ds, valid_ds)
learn = create_cnn(bunch, models.resnet18)
learn.fit_one_cycle(1)

The snippet cannot be used directly, because MNIST doesn’t work with the library if used directly:

AttributeError: 'MNIST' object has no attribute 'c'

I mean, the fastai library is not directly compatible with Dataset interface used by PyTorch. So I was thinking it would be great to have an opportunity to construct the DataBunch instance from “native” datasets. Because now it is definitely not enough to implement __getitem__ and __len__ to build a custom fastai-ready class.

Also, the most recent version of the library (from master) seems to be very focused on block API what makes a bit difficult to construct data bunch “manually”, I would say.


I hope my thoughts are clear :slight_smile:

Of course, I understand that the library builds a lot of additional abstractions on top of “plain” PyTorch capabilities. I only would like to make a note that making the library more “friendly” for PyTorch classes would be really helpful for someone who builds a lot of things manually.

Hi,

You may check previous post from me just above yours :slight_smile: If you look at the screenshot of my jupyter notebook ds is a native pytorch torch.utils.data Dataset with the methods you mentioned and dl is a native torch.utils.data Dataloader.

All that DataBunch cares is a dataloader instance, which will have __iter__ and __len__. Check this page out: https://pytorch.org/docs/stable/_modules/torch/utils/data/dataloader.html#DataLoader

Hope this helps.

class DataBunch():
    "Bind `train_dl`,`valid_dl` and`test_dl` to `device`. tfms are DL tfms (normalize). `path` is for models."
    def __init__(self, train_dl:DataLoader, valid_dl:DataLoader, test_dl:Optional[DataLoader]=None,
                 device:torch.device=None, tfms:Optional[Collection[Callable]]=None, path:PathOrStr='.',
                 collate_fn:Callable=data_collate):

As you see it takes Dataloaders during construction.

Best

Hi Kerem! Yes, you’re right, good example! I’m writing custom datasets as well :smile: So my point was introduce this behaviour into the library to make interoperation with “standard” PyTorch classes more simple. Like, to make it possible drop-in replacement of fastai loaders and datasets (maybe models also?) with plain PyTorch objects.

Hi Ilia,

Maybe i am not able to understand but what is the exact use case that you are not able to find a way of doing it with v1.0 for example ?

This way it might make more sense I guess, at least to me :slight_smile:

Best

Yes, sure, please check the snippet I’ve shared above. Like, you see, as I can understand the DataBunch class declares (via type annotation) that it accepts torch.data.utils.Dataset instances:

class DataBunch():
    ...

    @classmethod
    def create(cls, train_ds:Dataset, valid_ds:Dataset, test_ds:Dataset=None, path:PathOrStr='.', bs:int=64,
               num_workers:int=defaults.cpus, tfms:Optional[Collection[Callable]]=None, device:torch.device=None,
               collate_fn:Callable=data_collate)->'DataBunch':

I was reading this line like this, following OOP principles: I am capable to take any object that is compliant with Dataset interface and its heirs. However, it is not really the case with the library. For example, I can’t do the following:

from torchvision.datasets import MNIST
from fastai.vision import *

path = Path.home()/'data'/'MNIST'
train_ds = MNIST(path, train=True)
valid_ds = MNIST(path, train=False)

# the objects don't have property `c` and cannot be directly passed into `create`
bunch = ImageDataBunch.create(train_ds, valid_ds)
learn = create_cnn(bunch, models.resnet18)
learn.fit_one_cycle(1)

I would say it expects FastaiDataset interface, that extends the original definition with additional properties. I mean, here is what we have now:

# mock up interface to illustrate my idea
class FastaiDataset(Dataset):

    ... # some other properities

    @property
    def c(self):
        return len(self.classes)

# and then it should be more like this
class DataBunch():
    ...

    @classmethod
    def create(cls, train_ds:FastaiDataset, valid_ds:FastaiDataset, test_ds:FastaiDataset=None, path:PathOrStr='.', bs:int=64,
               num_workers:int=defaults.cpus, tfms:Optional[Collection[Callable]]=None, device:torch.device=None,
               collate_fn:Callable=data_collate)->'DataBunch':
        ...

Probably I am complicating things too much :smile: The idea is that it would be great to have a possibility to take any class that is compatible with torch.utils.data.Dataset interface and pass into DataBunch without additional manual wrappers and decorators. I mean, that the type annotation is a bit misleading :sweat_smile: At least, from my point of view. As soon as the library strictly annotates every argument and returned value, then probably it is important to have the appropriate interfaces.

Of course, it is only my personal opinion.

I’m unsure why this is an issue: why would you load MNIST like this when you have a convenience function to download the original dataset than then fastai can open, then not only gets you create_cnn but plenty of other convenience functions like show_batch, Learner.predict etc…
The idea when we say DataBunch can take any pytorch Dataset is that it will give you a data object you can pass into a Leaner with your model, then benefit from all the default parameters in training the library offers.

As you saw, if you want to use a custom Dataset with create_cnn, you have to add a c property to it. That’s never going to change. If you load MNIST this way, the data augmentation of the library can’t work on it, and nothing we add can make that change.

The good answer is to customize your Dataset into the fastai pipeline, because that’s always going to give you better result with the library.

Ok, understood! Yes, agree. I just find myself very often having small issues when trying to make something “on the edge” between fastai and pytorch. The MNIST is just an example :smiley:

As a not related example, you can easily pass pandas data frames into scikit-learn classes even though scikit-learn works with numpy arrays. All transformations are applied seamlessly thanks to interfaces compatibility.

But I agree, it is always possible to patch custom stuff to fit into fastai. Only wanted to clarify how the library is going to evolve to not find myself in the state of inventing bicycle :smile: