I’m really new to all of this (including Python itself), and I am absolutely stumped about what is going on with my DataBlock/Transforms.
I’m trying to take in a bunch of tabular-ish data and turn it into images, from which I want to get two regression predictions. I managed to get everything working (images created, labels generated, things loaded into a DataBlock, etc.) all the way up to training and getting predictions from a model, but when I went back to try to make extra improvements I noticed that my data was behaving bizarrely.
Setup
Each of my “images” is a three-channel tensor that is normalized/feature-scaled so its values are in the range [0,1]. To double-check that this is the case, I’ve confirmed that the min and max of every input item is 0 and 1, respectively, before I feed the data to the DataBlock.
I pass the data into the DataBlock as a list of zipped tuples (i.e. (input, target)), where the input is a 3xHxW tensor/ImageTensor and the target is a tensor with two values (essentially a high prediction and a low prediction).
The issue
After I put my data into the DataBlock, some kind of transform or multiplication happens, because my input items are no longer in the range [0,1]. I’ve added some print statements into transforms and things to try to figure out when this happens, but I just can’t seem to find it.
Code
Here’s what my DataBlock looks like, along with some of the transforms/helper functions I’m using:
def get_items(data): # Should essentially just pass the data along. If there's a way to do that without
return data # having to use this kind of separate function, that'd be great to know!
class ItemGettr(ItemTransform): # Just copy/pasted ItemGetter with a print statement
_retain = False
def __init__(self, i): self.i = i
def encodes(self, x):
res = x[self.i]
print(f'{res.amin()}/{res.amax()}') # Print min/max. This is roundabout where the problem happens
return res
class GetY(ItemTransform):
def encodes(self, x):
return x[1]
regblock = DataBlock(blocks=(TransformBlock, RegressionBlock(n_out=2)),
get_items=get_items
get_x=ItemGettr(0), # Slightly customized to print intermediate values
get_y=GetY, # Seems to work fine, but I could probably use ItemGetter(1)?
splitter = EndSplitter()
# I've removed all batch and item transforms to try to single out the issue.
)
The most recent result of the ItemGettr(0)
print statement is:
0.22984468936920166/1.0
0.22984468936920166/1.0
(It can vary, apparently depending on how many items are actually being passed in to the DataBlock…)
Some other considerations
I’ve found that, when my dataset is small (a few thousand items), I actually don’t have any issue—the min/max comes out as 0/1 as expected. It’s only when I have larger amounts of data (15000+) that the mins and maxes start getting wonky. I’m ideally trying to use like 100,000 items, and what tipped me off to the issue in the first place was that with 100,000 items, valid.show_batch()
was printing out almost completely black images, when they should show much more of a full spectrum.
If you want the actual data to looks at, or the functions I use to prep, I can provide that, too (it’s sourced from a free API, so anyone can grab it). But, as I mentioned earlier, I think I’ve been able to verify that everything is correct before I hand it off the the DataBlock.
Any ideas on what I’m doing wrong here?