I am building a “word2vec” type model with products on an e-commerce store similar this model used by airbnb for listings.. The dataset doesn’t have any labels per say since there is a central input, context inputs, and global context inputs which are all changing since they are being passed into an embedding.
The first 3 obs of my dataset looks like this:
[ [24702, [0, 0, 11665, 24702], ], [11665, [0, 24702, 24702, 0], ], [24702, [24702, 11665, 0, 0], ] ]
The first item is the input, the 2nd list is the context items, and the last list (can have multiple) are the global context items.
I’ve been having problems with the
DataLoader since either the
fit functions for a learner expect a y batch,
for xb,yb in progress_bar(data.train_dl, parent=pbar):
or because the
DataLoader attempts to collate a list of inputs, when mine aren’t necessarily the same size:
batch = self.collate_fn([self.dataset[i] for i in indices])
Is there an example that I could go off for this “unsupervised” approach? I was able to get the model to work by removing the global context input and treating the input item as
xb and context items as
yb, but I would like to test out this other version and find a generalizable solution to having many inputs without requiring a label.