Help! DataBlock mysteriously altering data

nauzk · August 10, 2021, 5:24pm

I’m really new to all of this (including Python itself), and I am absolutely stumped about what is going on with my DataBlock/Transforms.

I’m trying to take in a bunch of tabular-ish data and turn it into images, from which I want to get two regression predictions. I managed to get everything working (images created, labels generated, things loaded into a DataBlock, etc.) all the way up to training and getting predictions from a model, but when I went back to try to make extra improvements I noticed that my data was behaving bizarrely.

Setup

Each of my “images” is a three-channel tensor that is normalized/feature-scaled so its values are in the range [0,1]. To double-check that this is the case, I’ve confirmed that the min and max of every input item is 0 and 1, respectively, before I feed the data to the DataBlock.

I pass the data into the DataBlock as a list of zipped tuples (i.e. (input, target)), where the input is a 3xHxW tensor/ImageTensor and the target is a tensor with two values (essentially a high prediction and a low prediction).

The issue

After I put my data into the DataBlock, some kind of transform or multiplication happens, because my input items are no longer in the range [0,1]. I’ve added some print statements into transforms and things to try to figure out when this happens, but I just can’t seem to find it.

Code

Here’s what my DataBlock looks like, along with some of the transforms/helper functions I’m using:

def get_items(data): # Should essentially just pass the data along. If there's a way to do that without 
  return data        # having to use this kind of separate function, that'd be great to know!

class ItemGettr(ItemTransform): # Just copy/pasted ItemGetter with a print statement
    _retain = False
    def __init__(self, i): self.i = i
    def encodes(self, x):
      res = x[self.i]
      print(f'{res.amin()}/{res.amax()}') # Print min/max. This is roundabout where the problem happens
      return res

class GetY(ItemTransform):
    def encodes(self, x):
      return x[1]

regblock = DataBlock(blocks=(TransformBlock, RegressionBlock(n_out=2)),
                      get_items=get_items
                      get_x=ItemGettr(0), # Slightly customized to print intermediate values
                      get_y=GetY, # Seems to work fine, but I could probably use ItemGetter(1)?
                      splitter  = EndSplitter()
                      # I've removed all batch and item transforms to try to single out the issue.
                     )

The most recent result of the ItemGettr(0) print statement is:

0.22984468936920166/1.0
0.22984468936920166/1.0

(It can vary, apparently depending on how many items are actually being passed in to the DataBlock…)

Some other considerations

I’ve found that, when my dataset is small (a few thousand items), I actually don’t have any issue—the min/max comes out as 0/1 as expected. It’s only when I have larger amounts of data (15000+) that the mins and maxes start getting wonky. I’m ideally trying to use like 100,000 items, and what tipped me off to the issue in the first place was that with 100,000 items, valid.show_batch() was printing out almost completely black images, when they should show much more of a full spectrum.

If you want the actual data to looks at, or the functions I use to prep, I can provide that, too (it’s sourced from a free API, so anyone can grab it). But, as I mentioned earlier, I think I’ve been able to verify that everything is correct before I hand it off the the DataBlock.

Any ideas on what I’m doing wrong here?

JackByte · August 10, 2021, 7:50pm

Hi @nauzk,
as far as I understand, ItemGetter is a Transform that is intended to be part of the item transforms.

Try to use get_image_files as your get_x.

Does this make show_batch() show what you expect?

nauzk · August 10, 2021, 8:08pm

Thanks for the response @JackByte !

Unfortunately, get_image_files doesn’t work for my use case because I don’t actually have any image files—all of the “images” are just series of numerical data that have been processed and squished into image-shaped tensors (kind of like the mindset behind some audio machine learning techniques). I could save my generated images to disk as image files and then provide the paths, but hopefully there’s a way to avoid that…

Is there a reason ItemGetter would be altering my data, anyways? It’s returning the right sort of thing, it’s just not returning the same values that I’m expecting.

nauzk · August 10, 2021, 11:43pm

So, I figured out the problem

I didn’t check the inputs quite as thoroughly as I should have, and upon further inspection discovered that some adjustments I made were subtly messing the normalization up. Everything is working fine now.

On the plus side! If anybody is trying to work with generated images instead of files, or wants to pass a ready-made source in to a DataBlock, it turns out ItemGetter(i) does the trick perfectly for both get_x and get_y. And if you just format your data as ImageTensors, the DataBlock handles it like a charm.