NLP Sentences Siamese Network

heye0507 · September 26, 2019, 1:21am

Hi all,

I am trying to build a sentences siamese network. I have done similar work in computer vision, but very limited understanding of the TextItemList.

I have tried quick hack solution by change the data_loader, but I think the problem is the collate function… Since the data structure I have now is ([seq_1, seq_2),label), collate function doesn’t know what to do with the list.

I am wondering if anyone has any inputs, I can simply override the pad_collate function for TextItemList, but the problem is I have no idea what kind of structure I should return…

A quick code of what I have is:

SiameseDataset just returns a datastructure ([sentence_1,sentence_2],label).

data_siamese = (TextList
    .from_df(df=df_1,path=path,cols='Body',vocab=data_lm.vocab)
    .split_by_rand_pct(0.3,seed=42)
    .label_from_df(cols='Label')
   )

train_dl = DataLoader(
dataset = SiameseDataset(data_siamese.train),
batch_size = 16,
shuffle = True,
num_workers = 0)

valid_dl = DataLoader(
dataset = SiameseDataset(data_siamese.valid),
batch_size = 16,
num_workers = 0)

data_siamese = TextClasDataBunch(train_dl,valid_dl)

When I call x,y = next(iter(data_siamese.train_dl)), the code will crash give me:

invalid argument 0: Sizes of tensors must match except in dimension 0

Which I think the problem is pad_collate… now the sentence is in different size after numericalize…

Any inputs are highly appreciated…

Daniel.R.Armstrong · September 26, 2019, 3:00am

I am not sure if this will help you, but @brian made a siamese network using fast.ai V.7. https://github.com/briandw/SiameseULMFiT. I haven’t been successful in using in getting it to work with v1.xx I was going to try to recreate in 2.0.

If you you get it working I would lo love to see it!

heye0507 · September 26, 2019, 4:24am

Thanks for the input.

I will see if I can make it work… hopefully v2 will have a easier way to hack around…

Daniel.R.Armstrong · September 26, 2019, 9:31am

I hope so too, If I get it working I will let you know.

brian · February 14, 2020, 12:05am

I’m looking to get back into Fastai and see about updating that Siamese model. @Daniel.R.Armstrong did you make some progress?

Daniel.R.Armstrong · February 14, 2020, 3:09am

@brian I ran into a few roadblocks, so I decided I was going to wait till V2 was ready, to give it try. I will let you know if get it working.

ncoop57 · May 30, 2020, 6:53pm

@Daniel.R.Armstrong I’ve also started working on this problem using the Fastai V2 library by modeling off of the Siamese tutorial: https://github.com/fastai/fastai2/blob/master/nbs/24_tutorial.siamese.ipynb

Have you had any progress on this? This is what I’ve tried:

First generating a TextTuple and TextTupleBlock similar to the siamese tutorial:

class TextTuple(Tuple):
    @classmethod
    def create(cls, texts): return cls(tuple(text for text in texts))

def TextTupleBlock(tok_tfm):
    return TransformBlock(
        type_tfms=[TextTuple.create, tok_tfm]#, Numericalize(None)]
    )

And then preparing the data using the DataBlock API

def get_df(t): return t

def get_x(t): return (t['premise'], t['hypothesis'])
def get_y(t): return t['label']

tok_tfm = SpacyTokenizer()
siamese = DataBlock(
    blocks = (TextTupleBlock(tok_tfm), CategoryBlock),
    get_items = get_df,
    get_x = get_x, get_y = get_y,
    splitter=RandomSplitter(),
)
dls = siamese.dataloaders(df_all.head(50))

But it seems that the tokenizer is not correctly tokenizing both texts because when I evaluate it, the data seems to be just split into individual characters.

Any help would be greatly appreciative!

Daniel.R.Armstrong · May 31, 2020, 1:23pm

@ncoop57 I haven’t had the chance to spend any time on it, so I don’t think I will be any help to you. I am assuming that you looked at @brian’s 0.7 notebook to see how he did tokenization. The only thing I am thinking about is doing tokenization first. I ended up just comparing the second to last layer of my multi label classification models, and finding my similar text by using annoy to limit my potential results, I then use a binary classifier to find my matches. I am sure a Siamese network would give me better vector representations but my current method is working good enough at the moment. That being said it is still on my list of things to do. I really hope that you have success getting it working.