Purpose of zip and list in creating dataset Chapter 4 MNIST

Hi there, how are you doing?

I’m going through chapter 4 at the moment. I’m a bit confused on the used of list and zip in creating dataset.

dset = list(zip(train_x,train_y))
x,y = dset[0]
x.shape,y

I’m not sure what does zip and list do here and especially why x,y is set to be dset[0].

I hope you could help me crack this confusion.

Thank you

1 Like

Hi @toannguyen,

The zip function aggregates the iterables you pass to it: the first item of train_x and the first item of train_y are paired together in a tuple, and so one.
It gives a structure like [(train_x[0], train_y[0]), (train_x[1], train_y[1]), ...].
The list function then transforms this zip object (which is an iterator) into a list!

So we can then get the first item of this list, which will be the first pair, where x is the first item of train_x and y the first item of train_y.
I don’t have all the context here, but I guess it’s just to check their dimensions & values :slightly_smiling_face:

2 Likes

Hi @dway8 ,

Thank you so much for your detailed answer. I really couldnt have asked for a better one

1 Like

I have a follow up question if any of you see this - why are we checking x.shape, but then simply ‘y’?

I’m not sure if it’s an issue yet, but when I try to bring in the data for the full MNIST dataset, my x.shape,y is returning “(torch.Size([784]), tensor([0]))” (as opposed to tensor([1]) ) – even though my train_x and train_y passed into zip are of size [60,000,784] and [60,000,1], which I thought was correct.

1 Like

I’m not sure why they chose to check only the shape of x and not of y, I don’t think there’s any reason not to, just a matter of choice.

Regarding your second question, for x.shape, y if you are getting (torch.Size([784]), tensor([0])) that looks fine—the second value of the tuple tensor([0]) is telling you that the y value is the digit 0.

For y.shape you should get torch.Size([1]).

Remember from @dway8 's explananation, x,y = dset[0] returns the first item of the Dataset which would be the first pair.

Querying dset[1] up to dset[5922] will still return the value of y stacked as 0 because we have 5923
0 values stacked up first in the tensor.

x,y = dset[5923] returns (torch.Size([784]), tensor([1])).