Fast.ai v3 lesson 1 normalize

pokadan · January 15, 2020, 6:10pm

I have a question about notebook fast.ai v3 lesson 1. I wanted to see what normalize() does to images so I displayed the images after normalize call and without normalize call. I was expecting to see the same picture repeated with some image processing done to first batch. Instead I was given two different image sets. Is normalize() suppose to change the order of the data or is there something else I’m doing wrong?
Thank you.

I have added code and results below:

   data = ImageDataBunch.from_name_re(path_img, fnames, pat, ds_tfms=get_transforms(), size=224, bs=bs, num_workers=0).normalize(imagenet_stats)
   data2 = ImageDataBunch.from_name_re(path_img, fnames, pat, ds_tfms=get_transforms(), size=224, bs=bs, num_workers=0);

        data.show_batch(rows=3, figsize=(7,6))
        data2.show_batch(rows=3, figsize=(7,6))

Lim · January 16, 2020, 4:45am

Did you set the seed? ex. np.random.seed(2)?
ImageDataBunch makes training set and validation set. Validation set is set to 0.2 by default.
Training set and validation set are chosen randomly.
So, if you did not set the seed, it is reasonable that your data and data2 generated different training set batches.

pokadan · January 17, 2020, 11:36am

I will try this asap. Thanks for this!

pokadan · January 17, 2020, 11:43am

I checked random seed is set.
I run :
np.random.seed(2)
pat = re.compile(r’/([^/]+)_\d+.jpg$’)
just before the code.
What else could it be. Could it be a bug/ not suppose to happen?
Thank you!

pokadan · January 17, 2020, 2:56pm

Actually I shoul do

data = ImageDataBunch.from_name_re(path_img, fnames, pat, ds_tfms=get_transforms(), size=224, bs=bs, num_workers=0);
data2=data.normalize(imagenet_stats);

pokadan · January 17, 2020, 3:03pm

I still get differen results for data.show() vs data2.show().

Is it possible data and data2 are sharing the same counter and they are showing “subsequent” data ?

Lim · January 17, 2020, 4:02pm

I am not sure what you mean by “sharing the same counter”.

But, I think you need to do:
import numpy.random
seed(3)

or try
import random
random.seed(3)

I ran the lesson 1 script. I do not know if fastai library implicitly calls numpy library, but np.random.seed() is not working properly without calling the library explicitly.

Now, the two DataBunch with and without normalization should show you similar output from show_batch(). They might be little different because of normalization, which is just taking each data, subtract the average of the entire data, and divide that by the standard deviation.

pokadan · January 17, 2020, 4:15pm

Thank you so much. I will do that right away.

pokadan · January 17, 2020, 6:45pm

I tried both
import random
and
import numpy.random
Same behaviour. It still shows me different animals for data.show_batch() and data2.show_batch().

I must say that I run in inside kaggle notebook. Maybe their code is not up to date or maybe there is something wrong with their enviroment?
This is notebook i based my own on:
https://www.kaggle.com/hortonhearsafoo/fast-ai-v3-lesson-1

Thanks for any hint

Lim · January 17, 2020, 7:59pm

The notebook on Kaggle seems pretty much the same.

Try:
pat = r’/([^/]+)_\d+.jpg$’

instead of pat = re.compile(r’/([^/]+)_\d+.jpg$’)

I don’t think re.compile is necessary because from_name_re uses re.complie inside its function.
Not sure if that would make any difference.

Could you share the notebook you wrote? it would be easier if I can see your exact codes.

pokadan · January 17, 2020, 8:12pm

https://www.kaggle.com/danpoka/fast-ai-v3-lesson-1

Lim · January 18, 2020, 12:00am

I cannot access the notebook. Perhaps it is not publicly available?

pokadan · January 18, 2020, 5:40am

Yes, sorry. I have now made it public and I put you as collaborator so you can edit. Million thanks for taking a look at it.

Lim · January 18, 2020, 5:12pm

I couldn’t figure out why show_batch() shows different pictures every time, but try these:

data.train_ds.x[0]

data2.train_ds.x[0]

These grab a specific element in the training dataset. You should see the same picture for both, confirming that the dataset order does not change after normalizing.

pokadan · January 20, 2020, 9:17am

Hi Lim, you are correct
data.train_ds.x[0]
and
data2.train_ds.x[0]
show the same picture slightly modified

pokadan · January 20, 2020, 10:30am

Do you think we should bump this up as a bug in fast ai? Or do you think we should ignore it?

Thanks

Archaeologist · January 20, 2020, 2:55pm

It might be not a bug but a feature : isn’t a batch chosen randomly?

pokadan · January 21, 2020, 1:17pm

Hi Archaeologist, thank you for contributing. Could you point to documentation where it says so? Thanks so much.

Archaeologist · January 21, 2020, 3:12pm

I looked up the source code of show_batch():

github.com

fastai/fastai/blob/master/fastai/basic_data.py#L184


        x = self.denorm(x)
        if norm.keywords.get('do_y',False): y = self.denorm(y, do_x=True)
    return x,y


def one_item(self, item, detach:bool=False, denorm:bool=False, cpu:bool=False):
    "Get `item` into a batch. Optionally `detach` and `denorm`."
    ds = self.single_ds
    with ds.set_item(item):
        return self.one_batch(ds_type=DatasetType.Single, detach=detach, denorm=denorm, cpu=cpu)


def show_batch(self, rows:int=5, ds_type:DatasetType=DatasetType.Train, reverse:bool=False, **kwargs)->None:
    "Show a batch of data in `ds_type` on a few `rows`."
    x,y = self.one_batch(ds_type, True, True)
    if reverse: x,y = x.flip(0),y.flip(0)
    n_items = rows **2 if self.train_ds.x._square_show else rows
    if self.dl(ds_type).batch_size < n_items: n_items = self.dl(ds_type).batch_size
    xs = [self.train_ds.x.reconstruct(grab_idx(x, i)) for i in range(n_items)]
    #TODO: get rid of has_arg if possible
    if has_arg(self.train_ds.y.reconstruct, 'x'):
        ys = [self.train_ds.y.reconstruct(grab_idx(y, i), x=x) for i,x in enumerate(xs)]
    else : ys = [self.train_ds.y.reconstruct(grab_idx(y, i)) for i in range(n_items)]

What happens there is basically an invocation of one_batch(), which is in the same file. This appears to be an iterator and returns the next batch of data, so never the same. I assume that the entire dataset is in random order, but cannot be sure.

vijayabhaskar · January 21, 2020, 5:26pm

I’m not 100% sure but from my experience data and data2 are the same thing, for example if you build two databunches from the same labeledlist and name it data1 and data2, the data2(most recently executed line) will share the parameters with the data1 and becomes an exact copy of data2.
Here is an example:
create a databunch with images of size 200 and name it data200

now create a databunch from same src with size 100 and name it data100, and if you print what data200 has now, you can notice data200 has become data100. data200 now has images with size 100x100.