Shuffling DataLoader causing AucRoc falls down

jonmunm · August 24, 2020, 6:21pm

Hi,

I’m playing with de Adults Dataset (UCI Repository), and I’m experiencing something I don’t have an explanation about it

This is my model

class UciAdultsClassifier(nn.Module):
    def __init__(self, q_continius_features:int, q_categorical_features:int, embedding_dims:list):
        super(UciAdultsClassifier, self).__init__()
        
        embedding_sizes = sum([embedding_size for _, embedding_size in embedding_dims])
        
        self.embeddings_layer=nn.ModuleList(
            [nn.Embedding(vocabulary_size, embedding_size) for vocabulary_size, embedding_size in embedding_dims]
        )
        
        self.embedding_dropout = nn.Dropout(0.6)
        
        self.layer1=nn.Sequential(
            nn.Linear(embedding_sizes + q_continius_features, 128),
            nn.ReLU(),
            nn.BatchNorm1d(128),            
        )
    
        self.layer2=nn.Sequential(
            nn.Linear(128, 64),
            nn.ReLU(),
            nn.Dropout(0.3),
            nn.BatchNorm1d(64),            
        )
        
        self.layer3=nn.Sequential(
            nn.Linear(64, 32),
            nn.ReLU(),
            nn.Dropout(0.3),
            nn.BatchNorm1d(32),            
        )
        
        self.output=nn.Sequential(
            nn.Linear(32, 1),
            nn.ReLU(),
            nn.Dropout(0.3),
            nn.BatchNorm1d(1),
            nn.Sigmoid()
        )
        
    def forward(self, continius_features, categorical_features):
        embeds = [emb_layer(categorical_features[:, i]) for i, emb_layer in enumerate(self.embeddings_layer)] 
        embeds = torch.cat(embeds, 1)
        
        x = self.embedding_dropout(embeds)
        x = torch.cat([embeds, continius_features], 1)
        x = self.layer1(torch.cat([embeds, continius_features], 1))
        x = self.layer2(x)
        x = self.layer3(x)
        
        return self.output(x)
    
    def fit(self, train_dl:DataLoader, epochs:int, opt:Optimizer, loss_fn:any) -> list:
        self.train()
        losses = []
        for i in range(epochs):
            for x_continius, x_categorical, y in train_dl:
                y_pred = self.forward(x_continius, x_categorical)
                loss = loss_fn(y_pred, y)
                losses.append(loss.item())

                loss.backward()
                opt.step()
                opt.zero_grad()
        return losses
    
    def predict(self, data_loader:DataLoader) -> torch.Tensor:
        self.eval()
        predictions = []
        with torch.no_grad():
            for x_continius, x_categorical, y in data_loader:
                preds = self.forward(x_continius, x_categorical)
                predictions.append(preds)
        return torch.cat(predictions)

My Dataloaders

train_dl = DataLoader(train_ds, batch_size=1000, shuffle=False)
test_dl = DataLoader(test_ds, batch_size=1000, shuffle=False)

My settings:

model = UciAdultsClassifier(q_continius_features=q_continius_columns, q_categorical_features=q_categorical_columns, embedding_dims=embedding_dims)
optimizer = optim.Adam(model.parameters(), lr=1e-2)
bceloss_fn = nn.BCELoss(reduction='mean')
epochs=100

losses = model.fit(train_dl=train_dl, epochs=epochs, loss_fn=bceloss_fn, opt=optimizer)

My metrics

When I change the Dataloader to this line (shuffle=True)

train_dl = DataLoader(train_ds, batch_size=1000, shuffle=True)
test_dl = DataLoader(test_ds, batch_size=1000, shuffle=False)

My training Auc Roc falls down to 50%

Why is this happening? Let me know if you want to check de NB.

Best regards

micstan · August 24, 2020, 6:55pm

hey @jonmunm, it is interesting, can you share how train_ds and test_ds are created?

jonmunm · August 24, 2020, 7:06pm

Sure …

I created a Custom Dataset, to retrieve continius features and categorical features separetedly

class TabularDataset(Dataset):
    def __init__(self, continius_features:np.ndarray, categorical_features:np.ndarray, y:np.ndarray, normalize=True):
        if normalize:
            features = (features - features.mean(axis=0))/features.std(axis=0)
        
        self.continius_features = torch.tensor(continius_features, dtype=torch.float)
        self.categorical_features = torch.tensor(categorical_features, dtype=torch.long)
        y = torch.tensor(y, dtype=torch.float)
        self.y = torch.reshape(y, (-1, 1))

    def __len__(self):
        return len(self.y)

    def __getitem__(self, idx):
        sample = self.continius_features[idx], self.categorical_features[idx], self.y[idx]
        return sample
    
    def reverse_transform(self):
        return self.continius_features.numpy().squeeze(), self.categorical_features.numpy().squeeze(), self.y.numpy().squeeze()

The continius features were scaled used ScikitLearn, and categorical variables where encoded with LabelEncoder

And then …

micstan · August 24, 2020, 7:19pm

For sure scaling the whole dataset before split gives some leak but it is not it. If you also scale targets on the whole df and if it is sorted in a specific way this can be important. If you can share the notebook i would love to dig into. Also, only training auc is going down?

jonmunm · August 24, 2020, 7:34pm

mmmm … scaling the target is something I’ve never heard about …

Here is the NB … thanks for your time @micstan

micstan · August 24, 2020, 7:35pm

thanks! yes, it was just hard to follow where y_ndarray comes from. if it is continuous and in adult_train_df i believe you scale it. Is it only training that goes down after the shuffle?

Sorry, cannot see the notebook

jonmunm · August 24, 2020, 7:48pm

You have to loggin to Paperspace Gradient and you could run it. In the left pane explorer, you can see folder and files, that way you can view a static version of the NB …

Changing to

train_dl = DataLoader(train_ds, batch_size=1000, shuffle=True)
test_dl = DataLoader(test_ds, batch_size=1000, shuffle=False)

Caused this

Training Auc Roca is rather bad … but testing’s metrics aren’t that bad at all … However, the Training’ Auc Roc is what calls my attention…

Let me know if you can access or not … to try to share it again …

micstan · August 24, 2020, 8:03pm

It is still provisioning, not sure i will be able to check now. I think the model is trained correctly and test results are good (in the last one you’ve put accuracy_train instead of test, i think it is good looking at auc). It is just the prediction on the train set looks fishy, so my guess would be that the order/indexes of train_dl became misaligned with your y_train, not sure how, will try later

jonmunm · August 24, 2020, 8:11pm

Yes … you’re right … eventhoug the metrics didn’t chance a bit

I’ll apprecciate if you can check it later …

Best regards

micstan · August 24, 2020, 8:24pm

sorry @jonmunm, maybe i’m doing something wrong but i get only linear regression notebooks (also not sure if y_train you compare with is somehow different from Y_train create on split)

jonmunm · August 24, 2020, 8:37pm

Sorry my friend … I restarted the instance to delete Paperspace’s cache … Now you can see the correct files

Thanks, and best regards

jonmunm · August 24, 2020, 9:01pm

I got it … yes … you’re right. In principle, it should be the same, since y_train comes from the dataset via a reverse transformation …

micstan · August 24, 2020, 9:38pm

ok, so my understanding is that DataLoader with shuffle=True will randomly reshuffle every time you iterate over it. During the training id doesn’t hurt you because y is shuffled together with x. Unfortunately it randomly shuffles also when you pass it to your predict. I’m sure there should be elegant way to prevent it on validation similar to model.eval() but i don’t know it. What worked for me was to set a seed: torch.manual_seed(42) and recreate y_train from the loader with the same order (as below). I guess more elegant way is to create two train loaders, one with shuffle and one without for prediction only

jonmunm · August 24, 2020, 10:14pm

I see … i’ve not taken into account using the shuffled dataloader for prediction … yes, you’re completely right, since it affects the predictions order w.r.t. the grount truth …

Yes, two dataloader for training could do it (one for training and other for prediction results) …
It’s a good desing question you posed …

I’ll do some test in a couple of hours more based on what you told me …

My friend, you don’t know how much I appreciate the time you have spent helping me … best regards

jonmunm · August 24, 2020, 10:35pm

Yes, you’re right. Creating a new fresh dataloader without shuffle makes everything works …

Thank you my friend