Wrong dataset leading to "weird" result

toannguyen · October 28, 2022, 3:28am

Hi there, how are you doing?

I just finished Lesson 4 MNIST and I’m going through it again doing things a bit differently. Instead of using MNIST_Sample, I used MNIST. I noticed that there are only training data in this set. So I decided to split those data into training and valid set.

However, I found out that my dataset is wrong as the result I got (using the optimized Learner module) has starting accuracy of 100% and then degraded.

I hope you could take a look of my code and help me debug this problem.

Eight_path = (path/‘training/8’).ls().sorted()
Five_path = (path/‘training/5’).ls().sorted()

Eight_list = [tensor(Image.open(i)) for i in Eight_path]
Five_list = [tensor(Image.open(i)) for i in Five_path]

Eight_tensor = torch.stack(Eight_list).float()/255
Five_tensor = torch.stack(Five_list).float()/255

Eight_tensor = Eight_tensor.view(-1,2828)
Five_tensor = Five_tensor.view(-1,2828)

Eight_label = tensor([1]*len(Eight_tensor))
Five_label = tensor([0]*len(Five_tensor))

Data = torch.cat([Eight_tensor,Five_tensor])
Data_label = torch.cat([Eight_label, Five_label]).unsqueeze(1)

def Data_split(Dataset, ratio):
Test_length = round(len(Dataset)*0.8)
Train_length = len(Dataset) - Test_length
return Dataset[0:Test_length-1], Dataset[Test_length-1:]

SplitRatio = 0.8
Data_train, Data_valid = Data_split(Data, 0.8)
Label_train, Label_valid = Data_split(Data_label,0.8)

Train_dset = list(zip(Data_train, Label_train))
Valid_dset = list(zip(Data_valid, Label_valid))

Train_MiniBatch = DataLoader(Train_dset, batch_size = 256)
Valid_MiniBatch = DataLoader(Valid_dset, batch_size = 256)

dls = DataLoaders(Train_MiniBatch, Valid_MiniBatch)

learn = Learner(dls, nn.Linear(28*28,1), opt_func=SGD,
loss_func = loss_fx, metrics = batch_accuracy)

benkarr · October 28, 2022, 9:44am

Hey
right now you are splitting your data by cutting the lists of images/labels at a certain point. That list is generated by concatenating the 8s and the 5s, so roughly the first half are 8s. If you do a 80/20 split this meanst that none of your 8s is in the test set and the 5s are underrepresented at training.
You could zip images and labels first, shuffle that list and pass the result to Data_split (btw, the ratio is hard coded there right now ) and check if that produces a better result.
Hope that resolves the issue!

Edit:

The ‘validation’ folder is called testing, e.g. path/'testing'/'8' containst the validation 8s. You don’t need to shuffle the data if you use the validation set but it’s still not a bad idea .

toannguyen · October 28, 2022, 3:03pm

Dear @benkarr

Thank you so much for your help.

I did not realize that I cut the dataset after grouping them together, thus, effectively having more “8” and less “5” in the training set as you mentioned.

I fixed it and it works now