Facing a weird problem in a kaggle competition

Hello everyone!

I am doing a kaggle competition now. In which, I have to classify the relation of two Images. So input will be two images for each sample.
For that, I have made a model by nn.Sequential(…) which is taking 2 inputs and giving output as needed.
But the problem is that whenever it is just predicting for any input without updating the waits(Like in learn.get_preds() or when it predicts the outputs for validation set while running learn.fit_one_cycle() ), all the outputs are same. But for the training inputs, it gives the outputs properly!
I have checked it by printing the output of the forward method of my model. And below is the screenshot of that.


The upperside outputs are the outputs of my training set and the others, which are all the same are the outputs of the validation set.
It will be really great if someone could help me.
Thank you!

I believe that this does not give us enough information to go about to help diagnose the issue you are facing. Could it be that your entire batch of validation data is entirely identical in that forward pass?

Hey @dreambeats!
Thanks for responding.

Could it be that your entire batch of validation data is entirely identical in that forward pass?

No that is definitely not the case. Because,

  1. I have checked my validation set(That is entirely different).
  2. When I run learn.get_preds(DatasetType.Train), It also gives the same outputs for every input though while training it was giving different outputs for those inputs.

Edited: I don’t know what the problem is? Because sometimes it gives proper results, but just and then when I rerun the session it starts giving wrong results(For the same code).

Sometimes it starts giving nan values as output. And then for the same code, it starts giving numbers as output.
Actually, this has happened multiple times on a kaggle kernel with me. I tried copy-pasting my code in google colab but that is not using GPU for this code(I don’t know why. I had enabled the GPU).

See this:

There could be a bug in your model or your collate function. In any case, looking at the output alone makes deducing the source of your problem pretty difficult. Perhaps sharing a gist of your code would help.

Hi @dreambeats!
I have made another account in google colab and my code is running on GPU in that account(Maybe my original account’s packages had a problem.)
And now I have found the exact location of the problem.
Whenever it runs on GPU it gives the same output all the time but while running on CPU it gives the outputs properly.

This is How I am creating the dataset. Df is a dataframe where p1 are paths to image 1, p2 are paths of image 2 and Relativity is column of labels.

class MyDataset(Dataset):
def __init __(self, Df ,istest=False):
self.X1=list(Df.p1)
self.istest=istest
self.X2=list(Df.p2)
if not self.istest:
self.y=list(Df.Relativity)

def __len__(self):
    return len(self.X1)

def batch_stats(self, funcs:Collection[Callable]=None, ds_type:DatasetType=DatasetType.Train)->Tensor:
    funcs = ifnone(funcs, [torch.mean,torch.std])
    x = self.one_batch(ds_type=ds_type, denorm=False)[0].cpu()
    return [func(channel_view(x), 1) for func in funcs]

def normalize(self, stats:Collection[Tensor]=None, do_x:bool=True, do_y:bool=False)->None:
    if getattr(self,'norm',False): raise Exception('Can not call normalize twice')
    if stats is None: self.stats = self.batch_stats()
    else:             self.stats = stats
    self.norm,self.denorm = normalize_funcs(*self.stats, do_x=do_x, do_y=do_y)
    return self
def __getitem__(self, idx):
    if self.istest:
        return [reshape_image(trans(Image.open(self.X1[idx]))), reshape_image(trans(Image.open(self.X2[idx])))]
    return [reshape_image(trans(Image.open(self.X1[idx]))), reshape_image(trans(Image.open(self.X2[idx])))] , torch.tensor([self.y[idx],torch.tensor(pow((int(self.y[idx])-1),2),dtype=torch.float)])

An this is my model:

class MultInputNN(nn.Module):
def __init __(self):
super(MultInputNN,self).__init __()
self.model=nn.Sequential(
nn.Conv2d(3, 8, kernel_size=3, stride=2, padding=1),
nn.BatchNorm2d(8),
#nn.ReLU(),
nn.Tanh(),
nn.Conv2d(8, 16, kernel_size=3, stride=2, padding=1),
nn.BatchNorm2d(16),
#nn.ReLU(),
nn.Tanh(),
nn.Conv2d(16, 28, kernel_size=3, stride=2, padding=1),
nn.BatchNorm2d(28),
nn.ReLU(),
nn.Conv2d(28, 48, kernel_size=3, stride=2, padding=1),
nn.BatchNorm2d(48),
nn.ReLU(),
nn.Conv2d(48, 64, kernel_size=3, stride=2, padding=1),
nn.BatchNorm2d(64),
nn.ReLU(),
nn.Conv2d(64, 64, kernel_size=3, stride=2, padding=1),
nn.BatchNorm2d(64),
nn.ReLU(),
nn.Conv2d(64, 72, kernel_size=3, stride=2, padding=1),
nn.BatchNorm2d(72),
nn.ReLU(),
nn.Conv2d(72, 80, kernel_size=3, stride=2, padding=1),
nn.ReLU(),
)
self.model2=nn.Sequential(
nn.Linear(160,80),
nn.ReLU(),
nn.Linear(80,80),
nn.ReLU(),
nn.Linear(80,64),
nn.ReLU(),
nn.Linear(64,28),
nn.ReLU(),
nn.Linear(28,28),
nn.ReLU(),
nn.Linear(28,16),
nn.ReLU(),
nn.Linear(16,2),
nn.Sigmoid()
)

    for params in self.model:
      params.requires_grads=True
    for params in self.model2:
      params.requires_grads=True
     
def forward(self, input1, input2):
    
    c1=self.model(input1)
    c2=self.model(input2)
    combined = torch.cat((c1.view(c1.size(0), -1),
                      c2.view(c2.size(0), -1)), dim=1)
    output=self.model2(combined)
    return output

Edited: GPU is also not the problem.
My loss function is leading it to the direction of being disconnected from the inputs. So first it starts giving similar kind of outputs and after more training it gives the same output.
nn.BCELoss is giving me the error “RuntimeError: bool value of Tensor with more than one value is ambiguous”.
So,I had used mse. And now I used rmse which is a little bit better than mse.