I did a small experiment that suggests as networks get deeper we should train them multiple times using different initialisation parameters and use a voting scheme for inference. Below is my rational. Interested in peoples thoughts.
In the previous lessons we learnt that parameter initialisation is very important. However, Kaiming initialisation is still derived from random numbers. Therefore, we should not assume we get a good starting position when we train a network. If we just make one attempt we could get unlucky. If we try multiple attempts it reduces our chances of starting off on the wrong foot. It also means we get to explore different state space in the network because they are designed to minimise loss and that goal starts after initialisation. So if we use different starting positions and save those models for inference we increase our changes of success (because we explored a broader space that allowed the models to collectively calibrate against the data).
I did a small experiment to show how this might play out with Kaiming initialisation. The left and right charts (and green and red histograms) represent the means and standard deviations of the parameter space for each consecutive matrix multiplication. I simulated 1000 initialisations and performed 20 (L) consecutive matrix multiplication operations. What is interesting is the range. As L increases the range in mean and standard deviation increases which suggests we are more likely to randomly choose an unlucky initialisation as L increases. FYI I used some of @jamesd code from his great blog
def kaiming(m,h):
return np.random.normal(size=m*h).reshape(m,h)*math.sqrt(2./m)
data = []
inputs = np.random.normal(size=512)
for i in range(1000):
data.append([])
x = inputs.copy()
for j in range(20):
a = kaiming(512, 512)
x = np.maximum(a @ x, 0)
data[i].append((x.mean(), x.std()))
fig, ax = plt.subplots(1, 2, figsize=(20,10))
ax[0].plot(data[:,:,0].T, '.', color='gray', alpha=0.1)
ax[0].set_title('mean')
ax[0].set_xlabel('layer')
ax[1].plot(data[:,:,1].T, '.', color='gray', alpha=0.1)
ax[1].set_title('std')
ax[1].set_xlabel('layer');
Also a histogram plot.
import seaborn as sns
layers = []
means = []
stds = []
for layer in range(20):
mean = data[:,layer,0]
std = data[:,layer,1]
l_values = len(mean)
layer += 1
layers.extend([layer]*l_values)
means.extend(mean)
stds.extend(std)
df = pd.DataFrame({'layers': layers, 'means': means, 'stds': stds})
plt.figure()
g = sns.FacetGrid(df, row="layers", hue="layers", aspect=15, height=4)
g.map(sns.distplot, 'means', kde=False, bins=100, color='green')
g.map(sns.distplot, 'stds', kde=False, bins=100, color='red')
g.map(plt.axhline, y=0, lw=1, clip_on=False);