XOR Problem - Compare MLP networks

I’m learning keras by solving the toy example of an XOR gate with a simple feed-forward network (multi-layer perceptron), and I’d like to compare different architectures, optimizers, activation functions and anything else that is plausible to solving this problem that could help me learn more about neural networks in general.

Here is what I have done so far with questions mixed in:

  1. Created a MLP with 4 layers (3 hidden layers and one output layer), with a total of 52 neurons. For the activation function I used the softplus function but I don’t really know how it compares to “relu” or other activation functions. How can I compare architectures or activation functions?
  2. For my default optimizer, I used the “adam” optimizer because I have heard it is very efficient at using momentum but I am not sure how it works internally compared to stochastic gradient descent. Are there any other optimizers that are of interest? How can I compare each one?
  3. As for accuracy measures, I am only interested in MSE for now.
  4. Is dropout worth trying? Again, if it can help me learn, then I am interested in trying it out here.
  5. Any other visualizations that worth looking at? I have looked at MSE loss per iteration and overlayed the fitted values ontop of the input values (to indeed validate that it learned the function).

This is what the data looks like, where y is the variable we are trying to learn and the inputs A,B are zeros or ones (representing the XOR gate) and the last input, t, takes on values [0,1] :

training_data.head()
Out[62]:

          A   B      t         y
    0  0.0  1.0  0.0171  35.982006
    1  1.0  0.0  0.9064  28.974946
    2  1.0  0.0  0.3912  27.142839
    3  1.0  0.0  0.1957  29.134787
    4  0.0  1.0  0.9094  31.281645
input_train = training_data.as_matrix(columns=['A', 'B', 't'])
output_train = training_data.as_matrix(columns=['y'])

This is what the neural network looks like in python:

from keras.models import Sequential
from keras.layers.core import Dense, Activation, Dropout
from keras.optimizers import Adam

model=Sequential()
model.add(Dense(output_dim=8, input_dim=3, init="glorot_normal"))
model.add(Activation("softplus"))
model.add(Dense(output_dim=8, init="glorot_normal"))
model.add(Activation("softplus"))
model.add(Dense(output_dim=16, init="glorot_normal"))
model.add(Activation("softplus"))
model.add(Dense(output_dim=16, init="glorot_normal"))
model.add(Activation("softplus"))
model.add(Dense(output_dim=1))
model.add(Activation("softplus"))

adam = Adam(lr=0.001, beta_1=0.9, beta_2=0.999, epsilon=10**-8)
model.compile(loss='MSE', optimizer=adam)

model.fit(input_train, output_train, nb_epoch=nepochs, batch_size=32, verbose=2)
self_pred = model.predict(input_train)
test_pred = model.predict(input_test)