I am having trouble using my custom transform block multiple times in Datablock
Here is my transform block
def TextBlock(mode):
if mode =='bert':
bert_tokenizer = BertTokenizer.from_pretrained('bert-base-uncased',max_length=20)
tf_bert = TextTransform(bert_tokenizer, 20)
return TransformBlock(type_tfms=tf_bert.encodes)
elif mode =='xlnet':
xlnet_tokenizer = XLNetTokenizer.from_pretrained('xlnet-base-cased', max_length=20)
tf_xlnet = TextTransform(xlnet_tokenizer, 20)
return TransformBlock(type_tfms=tf_xlnet.encodes)
Here is how I use it in my datablock:
dblock_full = DataBlock(blocks=(TextBlock(mode='bert'), TextBlock(mode='xlnet'),
CategoryBlock),
get_x= [ColReader('bert_text'),
ColReader('xlnet_text')],
get_y = ColReader('category'),
splitter=RandomSplitter(valid_pct=0,seed=42)
)
Two ColReader are reading off 2 different column ( pandas series ) here. I made sure these 2 columns are independent object ( IE: not shallow copy) .
Issue:
Whatever Transform block that comes second in the blocks tuple gets apply twice and I have both of my passed in columns transformed by the last block twice.
blocks=(TextBlock(mode=‘bert’), TextBlock(mode=‘xlnet’) produce the same effect as passing in
(TextBlock(mode=‘xlnet’), TextBlock(mode=‘xlnet’) )
And if I reverse the order and let blocks=( TextBlock(mode=‘xlnet’) , TextBlock(mode=‘bert’)) then the effect will be the same as passing in (TextBlock(mode=‘bert’), TextBlock(mode=‘bert’) )
What I am doing wrong here ?