Get value counts from a ImageDataBunch

jimypbr · February 20, 2019, 9:53am

Fast.ai library usage question here.

I have a ImageDataBunch object of labeled images.
This object has the attributes: data.classes, data.c, len(data.train_ds), len(data.valid_ds).

This gives me the class names, number of classes, and lengths of the training and validation sets.
How do I get the value counts of the classes in the training and validation sets?

I have tried experimenting with the labels of the training set: data.train_ds.y which is of type CategoryList. Is there some method I can use on this object? I also tried using the value_counts function from pandas. The output of that is interesting:

pd.value_counts(data.train_ds.y)
chimp        1
gorilla      1
gorilla      1
chimp        1
gorilla      1
gorilla      1
gorilla      1
gorilla      1
gorilla      1
orangutan    1
chimp        1
...

Thanks

sgugger · February 20, 2019, 1:22pm

If you take data.train_ds.y.items, you’ll get the various indices corresponding to those classes. It might work better with pandas to get the counts, or with numpy.

jimypbr · February 22, 2019, 1:47pm

Thanks!
So with pandas that would be:

> vc = pd.value_counts(data.train_ds.y.items, sort=False)
> vc.index = data.classes; vc

chimp        173
gorilla      177
orangutan     56
dtype: int64

I dug a bit deeper and FYI the problem also occurs with the Counter from collections:

> Counter(data.train_ds.y)

Counter({
Category chimp        1
Category gorilla      1
Category gorilla      1
Category chimp        1
Category gorilla      1
Category gorilla      1
Category gorilla      1
Category gorilla      1
Category gorilla      1
...

I think however the problem could be fixed in the Fast.ai library if you made the Category class hashable.
This is easily achieved by overriding the __eq__ and __hash__ methods in the Category class.
I monkey-patched the Category object in a jupyter notebook as a PoC:

> Category.__eq__ = lambda self, that: self.data == that.data
> Category.__hash__ = lambda self: hash(self.obj)
> Counter(data.train_ds.y)

Counter({Category orangutan: 56, Category gorilla: 177, Category chimp: 173})

Could I submit a PR for this maybe??

sgugger · February 22, 2019, 1:54pm

Yes, but I think it would be even better at the ItemBase level, so that every type of item in fastai has it.

jimypbr · February 22, 2019, 1:57pm

Cool. I’ll give it a go!

jimypbr · February 25, 2019, 11:51pm

Here is my proposed fix (and first ever PR ): https://github.com/fastai/fastai/pull/1717

sgugger · February 26, 2019, 2:59pm

I commented on it, but we can also continue the discussion here. It’s great, but I think we can make it even better by having only one method for eq and hash at the ItemBase level that would work in any subclass
If you need to make some checks like np.all because the data is an array sometimes, we should test it in the base function, and make sure it handles it properly.
And also, sometimes people (me for instance) are lazy and forget the obj attribute in a new ItemBase, so we should make sure there is a fallback to data in this case (for the hash function)/

jimypbr · February 27, 2019, 8:54pm

Thanks for your feedback!

I have a couple of questions:

The data type of data could be either a scalar, numpy array, or a torch Tensor?
What is the reasoning behind the FloatItem class? :

class FloatItem(ItemBase):
    "Basic class for float items."
    def __init__(self,obj): 
        self.data, self.obj = np.array(obj).astype(np.float32), obj
    def __str__(self): return str(self.obj)

Is obj here supposed to represent a single float number (my assumption so far)? Then why is the self.data converted to an array zero dims? This part is making the ‘is scalar?’ testing fiddly.

jimypbr · February 27, 2019, 9:27pm

Here is my proposed change. I managed to whittle it down and remove all if statements. Two methods for ItemBase only:

    def __eq__(self, other): 
        return np.all(np.atleast_1d(self.data == other.data))
    def __hash__(self): 
        return hash(str(self.data)

Using atleast_1d allows it to handle the cases where data is scalar, array, and also torch Tensor.
For the __hash__ would it be sufficient to just convert the data to a string and hash that instead? That takes advantage of the __str__ method in the subclasses and also would avoid problems where obj is set to null.

sgugger · February 27, 2019, 9:29pm

FloatItem can contain one float or a list of floats. Problem is that torch isn’t always very gracious with things that aren’t proper numpy arrays, and I had some bugs here, which is why I’m converting the thing like this.

sgugger · February 27, 2019, 9:45pm

You’re looking at str(self.data) and not self, so you won’t use the __str__ methods in the subclasses. Why not just put self.data in the hash? Are there types not hashable we should expect in float, numpy arrays and torch tensors?

jimypbr · February 27, 2019, 10:06pm

You’re correct, my mistake.

Numpy arrays aren’t hashable and, from testing, torch tensors don’t produce a unique hash given the same value. Actually using str(self.data) is totally wrong here, sorry. hash(str(self)) should work except for the case of where obj is null and you want to fall back on data. That’s where it gets tricky.
I’m a bit stuck there to satisfy the different potential types of data.

jimypbr · March 5, 2019, 11:18am

I’ve thought a bit more about it.
I think this should work:

   def __eq__(self, other): 
      return np.all(np.atleast_1d(self.data == other.data))
   def __hash__(self): 
      return hash(str(self))

I use the string method of ItemBase subclasses as a hash. This seems to be a good hash function because the __str__ implementations correspond directly with the underlying data attribute and so should always be unique and unchanging over the objects lifetime and strings are hashable.

The only thing I’m not sure I’ve solved yet it your comment:

And also, sometimes people (me for instance) are lazy and forget the obj attribute in a new ItemBase, so we should make sure there is a fallback to data in this case (for the hash function)/

Do you mean if you implement a new subclass of ItemBase? Or simply that you sometimes don’t bother to assign a value of obj? e.g.:

Category(0, '')
Category(1, '')

And so in this case you’d rather the hash be ‘0’ or ‘1’. If that is true then I think it would be simpler to just handle that case in the constructor of Category and set obj to be data if obj is ‘’.

sgugger · March 5, 2019, 2:45pm

Not always: for Image for instance, the string representation is the class and the size, so it will be the same for all Images. If leave the default hash (which is the id of the object) does it impact your Counter thing?

jimypbr · March 5, 2019, 4:54pm

Counter is fundamentally a dictionary where the keys are hashable python objects and the values are the count.
For correctness in a dictionary a == b implies hash(a) == hash(b) and the hash value should be immutable. For efficiency in a dictionary, if a != b then ideally hash(a) != hash(b) (so avoiding hash collisions). So id wouldn’t work here because it would violate the correctness; they would be seen as different objects even though their ‘values’ were the same. Conversely if you just set hash to 0 all the time then it would work, but it would be very inefficient.

I see, so the image case makes it more fiddly. If it’s the same for all images then it would still work, but inefficiently. Whereabouts in the code is this? I only tested for the subclasses in ‘core.py’.

sgugger · March 5, 2019, 6:07pm

Image class is defined in fastai.vision.image

jimypbr · March 6, 2019, 4:34pm

Thanks! With the some of the other subclasses of ItemBase - Image, Text, Tabular - it doesn’t make sense to me to try to make these hashable because they represent data. It seems that there are subclasses of ItemBase that are used to represent categorical labels and there other subclasses that are used to represent data.
My original objective was to just get the ones that represent labels to be hashable so they play nice with dictionaries and counters. I don’t see why you’d want to put an Image object as a key in a dictionary or Counter. So I don’t think it’s possible to satisfy all the cases by putting __eq__ and __hash__ into ItemBase. Maybe a solution would be to put them in a Mixin class and have classes like Category, MultiCategory etc inherit from that as well as from ItemBase:

class LabelMixin:
    def __eq__(self, other): 
        return np.all(np.atleast_1d(self.data == other.data))
    def __hash__(self): 
        return hash(str(self))

class Category(LabelMixin, ItemBase):
    ...

sgugger · March 6, 2019, 4:36pm

We can have the eq in in ItemBase and the Hash in Category and the other classes it’s needed then?

jimypbr · March 6, 2019, 4:42pm

Right. I think eq makes sense in all cases of ItemBase and just implement hash where for cases that make sense.

jimypbr · March 8, 2019, 9:15am

Thanks. That was fun!