NN from scratch (in NumPy) on Titanic Data.... Need help

Hey guys, to understand better the things that happen under the hood of a Neural Network, I’m trying to build one from scratch without frameworks, using just NumPy.
I’ve followed multiple articles online to see what works, but I’m unable to create it and it’s eating me up.
For some reason, my loss shoots up to infinity while my accuracy goes down to NaN even though I’m applying the formulas correctly.
Can someone please help me understand where I’m going wrong?

Here is the link to the notebook I’m editing: ANN in NumPy on Titanic Dataset [0.765] | Kaggle

  • I’ve checked all the matrix dimensions multiple times, they are all fine.
  • Tried to use Binary Cross Entropy, MSE, RMSE and Mean Absolute Error as the loss functions, they all shoot up.
  • Used sigmoid, softmax and relu as the final activation functions (relu was always the non-final activation function).
  • Even used a negative learning rate (Both ways, W - \alpha * dW and W + \alpha * dW)

Update V5: Noticed some issues in the backdrop code I wrote, and checked if the gradients actually are updating or not. Now, the issue of the loss shooting up has been solved, but the loss now remains stagnant and the accuracy does not go beyond zero. Somehow, the predictions that the model is returning, happen to numbers beyond 1 even though I’m using sigmoid.

Update V11: Built this model, tested out which numpy seed was giving best results and used that seed to get a result equivalent to the one I got on scikit-learn’s DecisionTreeClassifier. Notebook here. Thanks to @Timme for helping out!

1 Like

I am unable to view your notebook. Confirm that it is public

Yeah sorry I thought it was public

Hi, Sorry for the late reply.

The errors are fixed by updating the following functions as follows:

def update_params(params, grads, lr):
    n_layers = len(params) // 2
    for i in range(1, n_layers+1):
        params[f'W{i}'] -= (lr * grads[f'dW{i}'])
        params[f'b{i}'] -= (lr * grads[f'db{i}'])
    return params

To update a value number by subtracting anotherNumber is done by;

number -= anotherNumber

and not

number =- anotherNumber 

The latter actually updates number to the negative of anotherNumber

def calc_AL_deriv(preds, actuals):
    return -2*(actuals-preds)

This is the gradient of your loss function defined as;

def calc_cost(Y, predictions):    # Predictions are the "A2" in our two layer neural network
    # We'll be using the Binary Cross Entropy formula
    num_samples = Y.shape[1]
    # Calculating the loss
    return ((Y-predictions)**2).mean()

I expect that everything runs fine with these changes. Kindly let me know. Thanks


Thanks, it does work to some extent now.
In the sense that it is training, but for reason, both the loss and the accuracy increase over time and stabilize at 0.383 (loss) and 0.616 (accuracy) no matter how many epochs I run, or the architecture I use and I get an overflow error in NumPy when doing the sigmoid calculation, where I change the datatype of the array to either float32, or float64 or even float128.

This does work, and this is a good bare minimum model, but I was hoping the accuracy would be somewhere around 70%.

Also, is there any way to overcome the overflow error?

Another issue I’m seeing is that no matter how many sampling techniques I try for the weights and the biases, no matter the scale and the locus, for some reason my final layer’s Z and the activations always vary very slightly, in the sense that if the first value in my A3 layer is 0.38918, the following 889 values always vary somewhere between 0.39118 and 0.38002, with the values gradually decreasing with some exceptions and the final value being 0.37992 . I’ve tried about 30 different random samplings for my weights and biases, but it is always the same thing.

Update V8: Changed the entire Data Processing Pipeline, because I realized that the previous one was very disparate in terms of the training and test sets (because it was taking the dummies for all columns, the test x’s dimensions were (97, 418) whereas that of the training x were (167, 891), i.e., the number of features per set were different, so I couldn’t get their predictions.
In the new process, the features of the two happen to be the same, and I have more control as to view whatever’s going on.

Hi, I just went through version 7 and realized that you did not make the second correction as stated above.
The gradient of the loss function is -2*(actuals-preds) and not 2*(actuals-preds).

Finally I apologize for missing this one;
Your code within the train function;

activations = forward_pass(X, params)
#... Some code in here
grads = backprop(params, cache, y)

Your gradients is not being calculated on the right activations;

activations = forward_pass(X, params)
#... Some code in here
grads = backprop(params, activations, y)

Everything should work fine now

Also, reduce your learning rate to 0.01.

1 Like

Oh my god…
You have no idea just how just how embarrassed I am right now. I used to make these kinds of silly mistakes very ignorantly back in high school math, but still doing these is just absolutely humiliating!!
I must have scrolled in front of that code about 150 times in the past week and I turned up model upside down trying to look for the mistake, but oh this is gonna be a major learning point in my DL career.

Thank you so very much for catching my mistake, and you shouldn’t be the one to apologize for it, I should be the one apologizing to you, so I am so sorry about this! And once, thank you for catching my mistake!

(Side note: I tried the -2 in the equation as well, but that just seemed to make my model even less accurate, so I changed it to +2 and it got better. Should have trusted the derivative calculation I did. Thanks for pointing that out!)

All good, I am glad to have been of help.
Take care

1 Like