# Purpose of zip and list in creating dataset Chapter 4 MNIST

Hi there, how are you doing?

I’m going through chapter 4 at the moment. I’m a bit confused on the used of list and zip in creating dataset.

dset = list(zip(train_x,train_y))
x,y = dset[0]
x.shape,y

I’m not sure what does zip and list do here and especially why x,y is set to be dset[0].

I hope you could help me crack this confusion.

Thank you

1 Like

Hi @toannguyen,

The `zip` function aggregates the iterables you pass to it: the first item of `train_x` and the first item of `train_y` are paired together in a tuple, and so one.
It gives a structure like `[(train_x[0], train_y[0]), (train_x[1], train_y[1]), ...]`.
The `list` function then transforms this zip object (which is an iterator) into a list!

So we can then get the first item of this list, which will be the first pair, where `x` is the first item of `train_x` and `y` the first item of `train_y`.
I don’t have all the context here, but I guess it’s just to check their dimensions & values

2 Likes

Hi @dway8 ,

Thank you so much for your detailed answer. I really couldnt have asked for a better one

1 Like

I have a follow up question if any of you see this - why are we checking x.shape, but then simply ‘y’?

I’m not sure if it’s an issue yet, but when I try to bring in the data for the full MNIST dataset, my x.shape,y is returning “(torch.Size([784]), tensor([0]))” (as opposed to tensor([1]) ) – even though my train_x and train_y passed into zip are of size [60,000,784] and [60,000,1], which I thought was correct.

1 Like

I’m not sure why they chose to check only the shape of `x` and not of y, I don’t think there’s any reason not to, just a matter of choice.

Regarding your second question, for `x.shape, y` if you are getting `(torch.Size([784]), tensor([0]))` that looks fine—the second value of the tuple `tensor([0])` is telling you that the `y` value is the digit `0`.

For `y.shape` you should get `torch.Size([1])`.

Remember from @dway8 's explananation, x,y = dset[0] returns the first item of the Dataset which would be the first pair.

Querying dset[1] up to dset[5922] will still return the value of y stacked as 0 because we have 5923
0 values stacked up first in the tensor.

x,y = dset[5923] returns (torch.Size([784]), tensor([1])).