Creating Dataloaders for classification task with tabular

teeth2i4 · November 29, 2020, 4:26pm

Lesson 8 (Collaborative filtering) asks to

create a model for MovieLens that works with cross-entropy loss

nn.CrossEntropyLoss needs targets with dtype: torch.int64
shape: [batch_size] and range of values: from 0 to n-1

creating dataloaders through -
CollabDataLoaders.from_df()
produces targets as tensor of shape [batch_size, 1], dtype: torch.int8(though in pandas df it was int64!), and keeps a range of values unaffected: from 1 to 5

How can I tell this method .from_df() to process my
target values in required fashion?

Also, I tried using TabularPandas(df, [Categorify, FillMissing, Normalize], ['user', 'movie'], y_names="rating", y_block=CategoryBlock), but dls.one_batch() made from that produced 3 tensors with shapes [batch_size, 2], [batch_size, 0](!?), [batch_size, 1]. Didn’t help much, but target values were from 0 to n-1, at least.

thanks in advance.

hubwoy · December 6, 2020, 10:50am

Hey, have you found a solution yourself?

teeth2i4 · December 6, 2020, 1:03pm

@hubwoy I haven’t found a solution yet, but will try to tackle this problem again after I’m done with Lesson 9 exercises

muellerzr · December 6, 2020, 2:19pm

Try doing this:

        user_name   = ifnone(user_name,   ratings.columns[0])
        item_name   = ifnone(item_name,   ratings.columns[1])
        rating_name = ifnone(rating_name, ratings.columns[2])
        cat_names = [user_name,item_name]
        splits = RandomSplitter(valid_pct=valid_pct, seed=seed)(range_of(ratings))
        to = TabularCollab(ratings, [Categorify], cat_names, y_names=[rating_name], y_block=TransformBlock(), splits=splits, reduce_memory=False)
        dls = to.dataloaders(path=path)

(This is the entire code for your from_df, so you can see how to use TabularPandas here, we have a TabularColab it uses which sets with_cont to False)

teeth2i4 · December 11, 2020, 5:18pm

thanks for answering, but suggested code doesn’t change that
_, y = dls.one_batch()
y.shape is [batch_size, 1]
I need y.shape to be [batch_size] for nn.CrossEntropyLoss

teeth2i4 · December 21, 2020, 11:04am

while I still don’t know why creating TabularCollab for categorization like this:
cat_names = ['user', 'movie']
splits = RandomSplitter()(range_of(ratings))
to = TabularCollab(ratings, [Categorify], cat_names, y_names=['rating'], y_block=CategoryBlock, splits=splits, reduce_memory=False)
dls = to.dataloaders(path=path)
- produces y.shape as [batch_size, 1] and not [batch_size]

I realized that I could just modify y.shape in custom loss function, like that:
def loss_function(inp, target):
return F.cross_entropy(inp, target.squeeze(1).long())

and then use this function in Learner:
learn = Learner(dls, model, loss_func=loss_function)

that works!

Yorick · March 3, 2021, 9:18am

Hi!

Could you share notebook with your solution?
Did it actually learn to distinguish categories or just passed training cycles?
Could you take a look at my notebook here, may be advise something?

teeth2i4 · March 3, 2021, 11:14am

you can find my notebook here
https://github.com/saint-angels/fastbook/blob/master/08_exercises.ipynb

the results are trash, but I remember reading that this is a bad approach for this problem and results are expected to be trash.

Yorick · March 3, 2021, 6:53pm

The very strange thing about this is that preds[0].sum() do not sum to 1. I just uploaded your notebook on my VM and it shows the same.

In chapter 5 - Pet Breeds the task was pretty much the same and there we had preds[0].sum() == 1. I mean, the neural net doesnt care if its images on the input or embedding vectors, its just numbers. I’m thinking, i did something wrong in implementation.

@muellerzr perhaps you could comment on this, please?

Yorick · March 10, 2021, 10:55am

Turns out, fastai library has loss-function that digests tensors of any shapes by flattening them:

They’re called MSELossFlat, CrossEntropyLossFlat etc