So I made some progress with this issue. I’m trying to process multiple columns from Dataframe
class TensorContinuous(TensorBase): pass
class RegSetup(Transform):
"Transform that floatifies targets"
def encodes(self, o): return TensorContinuous(o).float()
def decodes(self, o:TensorContinuous):return TitledStr(o.item())
pipe = Pipeline([RegSetup])
temp = df[['age', 'parity']]
p = pipe(temp); p
output:
TensorContinuous([[43., 1.],
[43., 1.],
[43., 1.],
...,
[49., 9.],
[49., 9.],
[49., 9.]])
Now, the next thing I want to do is normalize these columns with their respective stats
class Norm(Transform):
"Normalize/denorm batch of `TensorImage`"
order=99
def __init__(self, mean=None, std=None, axes=(0,2,3)): self.mean,self.std,self.axes = mean,std,axes
@classmethod
def from_stats(cls, mean, std, dim=1, ndim=4, cuda=True): return cls(*broadcast_vec(dim, ndim, mean, std, cuda=cuda))
def setups(self, dl:DataLoader):
if self.mean is None or self.std is None:
x = dl.one_batch()
self.mean,self.std = x.mean(self.axes, keepdim=True),x.std(self.axes, keepdim=True)+1e-7
print(self.mean, self.std)
def encodes(self, x:TensorContinuous): return (x-self.mean) / self.std
def decodes(self, x):
f = to_cpu if x.device.type=='cpu' else noop
return (x*f(self.std) + f(self.mean))
tl = TfmdLists(temp, pipe)
dl = tl.dataloaders(bs=8, after_batch=[Norm(axes=0)])
Output
TensorContinuous([[41.7500, 2.2500]], device='cuda:0') TensorContinuous([[12.7811, 1.2817]], device='cuda:0')
dl.one_batch()
Output
ensorContinuous([[-0.1369, -0.1950],
[-0.8411, 0.5851],
[-1.5452, -0.9752],
[-0.2152, -0.1950],
[ 0.8020, -0.9752],
[-0.6064, -0.1950],
[-0.6064, 1.3653],
[ 1.0367, 1.3653]], device='cuda:0')
Now, this worked fine because I was dealing with ony these dataframe columns. But in my actual pipeline, I’ve ImageBlock
and these RegressionBlock
s
For that, I’m using getters and due to my SemanticTensors, that too works fine for me (for only single block of data
)
def get_x(x): return f'{path}/{x.image_path}'
def get_age(x): return x.age
def get_parity(x): return x.parity
def get_y(x): return x.category
getters = [get_x, get_age, get_y]
Currently, I’m only working with age
(I’ll explain the reason) and the pipeline seems to work fine
def RegressionFBlock():
return TransformBlock(type_tfms=[RegSetup()], batch_tfms=[NormalizeTfm(axes=0)])
dblock = DataBlock(blocks=(ImageBlock, RegressionFBlock, CategoryBlock),
getters=getters,
splitter=ColSplitter('is_val'),
item_tfms=Resize(size),
batch_tfms = [*aug_transforms(max_zoom=0, flip_vert=True)])
The normalize transform I’m using needs to take care of the element I’m working on (Which is huge disadvantage for me)
class NormalizeTfm(Transform):
"Normalize/denorm batch of `TensorImage`"
order=99
def __init__(self, mean=None, std=None, axes=(0,2,3)): self.mean,self.std,self.axes = mean,std,axes
@classmethod
def from_stats(cls, mean, std, dim=1, ndim=4, cuda=True): return cls(*broadcast_vec(dim, ndim, mean, std, cuda=cuda))
def setups(self, dl:DataLoader):
if self.mean is None or self.std is None:
_,x,_ = dl.one_batch()
self.mean,self.std = x.mean(self.axes, keepdim=True),x.std(self.axes, keepdim=True)+1e-7
def encodes(self, x:TensorContinuous): return (x-self.mean) / self.std
def decodes(self, x:TensorContinuous):
f = to_cpu if x.device.type=='cpu' else noop
return (x*f(self.std) + f(self.mean))
check the setups
method of NormalizeTfm
. Neverthless, it’s working and the end result of dblock.summary(df)
is as follows:
Applying batch_tfms to the batch built
Pipeline: IntToFloatTensor -> AffineCoordTfm -> LightingTfm -> NormalizeTfm
starting from
(TensorImage of size 4x3x224x224, TensorContinuous([43., 43., 43., 43.], device='cuda:0'), TensorCategory([2, 2, 2, 2], device='cuda:0'))
applying IntToFloatTensor gives
(TensorImage of size 4x3x224x224, TensorContinuous([43., 43., 43., 43.], device='cuda:0'), TensorCategory([2, 2, 2, 2], device='cuda:0'))
applying AffineCoordTfm gives
(TensorImage of size 4x3x224x224, TensorContinuous([43., 43., 43., 43.], device='cuda:0'), TensorCategory([2, 2, 2, 2], device='cuda:0'))
applying LightingTfm gives
(TensorImage of size 4x3x224x224, TensorContinuous([43., 43., 43., 43.], device='cuda:0'), TensorCategory([2, 2, 2, 2], device='cuda:0'))
applying NormalizeTfm gives
(TensorImage of size 4x3x224x224, TensorContinuous([0.0307, 0.0307, 0.0307, 0.0307], device='cuda:0'), TensorCategory([2, 2, 2, 2], device='cuda:0'))
Now the real problem is, how can I deal with multiple columns in the same pipeline as I was able to do using Pipeline
and TfmdLists
above.
I tried modifying getters to accept tuple/array as:
getters = [get_x, lambda x: (x.age, x.parity), get_y]
But this results in two objects of TensorContinuous
and sets up dedicated pipeline for each (Which is not expected in my scenario)
TL;DR
- How can I improvise
NormalizeTfm
to work with my data?
- How to pass multiple items from
getters
to single TransformBlock
?
- Even if I need to create separate block for each column, how can I normalize that with the respective stats? Otherwise, the only option I’ve is to write yet another
NormalizeTfm
with modifed setups
method for respective column.