Applying part 2 to tabular data

As I said already, it is a bug, and I showed the details of the pytorch team discussing the bug in the lesson slides. Check it out! :slight_smile:

I saw the G+ discussion showed on lesson 9. It focused on the \sqrt{5} and I figured something else (the 3 instead of \sqrt{3} problem even when you state that relu and not leaky_relu is your nonlinearity) but I guess its all just (wrongly implemented) details of the same thing and this Issue that was filed after Jeremy raised it will hopefully improve all of these…

The issue itself don’t explicitly say what the problem is, only that there is a problem. It would be nice if Kaiming He could step in and state his opinion about the correct usage of his init so it won’t be used in a wrong way :slight_smile:

How did you get \sqrt{3} for ‘relu’ nonlinearity? If you pass relu parameter, it gives you the correct gain \sqrt{2} in calculate_gain function

elif nonlinearity == 'relu':
    return math.sqrt(2.0) 

and in this case it works out correctly

std = gain / math.sqrt(fan) 

which gives \sqrt{\frac{2}{fan}} and then in order to calculate the bounds of uniform distribution we multiply by \sqrt{3}.

Thanks Nick, its true about the ReLU, I wrote a mistake above.
I’ll try to say more clearly what I see:

  1. default initialization of linear layers in pytorch always calls the kaiming_uniform with nonlinearity=leaky_relu and a=sqrt(5). I didn’t see a way to use the pytorch initializer with a different non linearity.
  2. the calculate_gain function invoked from kaiming_uniform is returning \sqrt{3} for the arguments: leaky_relu and a=sqrt(5).
  3. after a gain of \sqrt{3} is calculated, the kaiming_uniform returns the gain multiplied again by \sqrt{3} which gives back 3.

so the default init is wrong for ReLU, not because of the \sqrt{5} but because of the extra \sqrt{3} that is multiplied later.

What happens when explicitly stating ReLU for the kaiming_uniform? the calculate_gain will return \sqrt{2} and then it will be multiplied again by the \sqrt{3} so the bounds will be \sqrt{6} which might be correct :slight_smile:

Anyhow, hopefully soon I’ll implement the LSUV with great success and we won’t have to discuss these magic number sqrts anymore!

1 Like

Thanks Yonatan.
For me that multiplication by \sqrt{3} seems correct. Let me just explain my intuition and maybe you can point where i am wrong. So in Kaiming paper they said that for Relu we want to initialize the our weights from distribution with 0 mean and standard deviation \sqrt{\frac{2}{n}}.This is exactly what we get when using nonlinearity = ‘relu’.

std = gain / math.sqrt(fan) #where gain = math.sqrt(2.0)

Next we can sample either from a normal distribution or from a uniform. In case of kaiming_normal function:

std = gain / math.sqrt(fan)
with torch.no_grad():
    return tensor.normal_(0, std)

But in order to sample from the uniform distribution we need to get the bounds. Since the uniform distribution on interval [-bound, bound] has std = \frac{bound}{\sqrt{3}}, to get the bound we have to multiply std by \sqrt{3} which is done in kaiming_uniform function:

std = gain / math.sqrt(fan)
bound = math.sqrt(3.0) * std  # Calculate uniform bounds from standard deviation
with torch.no_grad():
    return tensor.uniform_(-bound, bound)

As i said, i could probably be wrong, just want to make sure i also understand what is going on there.

1 Like

Ok, so that’s the reason! thanks Nick! now I understand it better!

Applying LSUV

In the most recent lesson of the course (11), in a perfect timing for our progress here, Jeremy introduced his version of LSUV implementation. I guess he also decided to skip the orthonormal thing…
His function is the following:

def lsuv_module(m, xb):
    h = Hook(m, append_stat)

    while mdl(xb) is not None and abs(h.mean)  > 1e-3: m.bias -= h.mean
    while mdl(xb) is not None and abs(h.std-1) > 1e-3: m.weight.data /= h.std

    h.remove()
    return h.mean,h.std

Some preliminary remarks about LSUV:

  1. At first I didn’t understand why the iteration is necessary - by definition, anything you divide by its own STD will have an STD=1 after that division. But the reason for the iteration here is that we want the std=1 to apply to activations after the nonlinearity layer. The nonlinearity causes the mean and std to become less predictable and an easy solution here (which i’m not sure is guaranteed to work every time) is to iterate until we are close to what we want.
  2. Jeremy’s code iterates on the first level modules in the model. Our model is currently flat: all the layers, non-linearities, etc. are on the same level. Since, as I wrote above, we are interested in normalizing the activation only after the nonlinearity, (and also to keep our business working with the lesson’s code) I have to aggregate the layers in the model to [Linear, ReLU]. Each pair will then be initialized as one module, resulting in normalizing the activations only after the non linearity.

I made my adaptation of Jeremy’s code, so it will apply the LSUV to our deep FC network. It included importing the data as before, into notebook 7. I also made the following additions/replacements:

class FCLayer(nn.Module):
    def __init__(self, ni, no, sub=0.0, **kwargs):
        super().__init__()
        self.linear = nn.Linear(ni, no)
        self.relu = GeneralRelu(sub=sub, **kwargs)
    
    def forward(self, x): return self.relu(self.linear(x))
    
    @property
    def bias(self): return -self.relu.sub
    @bias.setter
    def bias(self,v): self.relu.sub = -v
    @property
    def weight(self): return self.linear.weight
def get_fc_model(data, layers):
    model_layers = []
    for k in range(len(layers)-2):
        model_layers.append(FCLayer(layers[k], layers[k+1], sub=0))
    model_layers.append(nn.Linear(layers[-2], layers[-1])) # last layer is without ReLU
    return nn.Sequential(*model_layers)
def init_fc_(m, f, nl, a):
    if isinstance(m, nn.Linear):
        f(m.weight, a=0.1, nonlinearity=nl)
        if getattr(m, 'bias', None) is not None: m.bias.data.zero_()
    for l in m.children(): init_cnn_(l, f) # recursively proceed into all children of the model
def init_fc(m, uniform=False, nl='relu', a=0.0):
    f = init.kaiming_uniform_(nonlinearity=nl, a=a) if uniform else init.kaiming_normal_
    init_fc_(m, f, nl, a)
def get_learn_run(layers, data, lr, cbs=None, opt_func=None, uniform=False):
    model = get_fc_model(data, layers)
    init_fc(model, uniform=uniform)
    return get_runner(model, data, lr=lr, cbs=cbs, opt_func=opt_func)

and the actual model initialization happens here (using the same config as before):

sched = combine_scheds([0.3, 0.7], [sched_cos(0.003, 0.6), sched_cos(0.6, 0.002)]) 
cbfs = [Recorder,
        partial(AvgStatsCallback,accuracy),
        partial(ParamScheduler, 'lr', sched)]

layers = [m] + [40]*10 + [c]
opt = optim.SGD(model.parameters(), lr=0.01)

learn,run = get_learn_run(layers, data, 0.01, cbs=cbfs)

this is the loss evolution when training the model in the usual way (with the default Kaiming init):
image
and these are the metrics after 80 epochs:

train: [0.560090118024993, tensor(0.7870)]
valid: [0.6289463995033012, tensor(0.7657)]

The LSUV module just worked (after I made all the changes). Here are the means and stds before the LSUV init:

0.20017094910144806 0.37257224321365356
0.10089337080717087 0.16290006041526794
0.030267242342233658 0.05773892626166344
0.04704240337014198 0.05144619196653366
0.0308739822357893 0.051715634763240814
0.03219463303685188 0.047352347522974014
0.02379254810512066 0.041318196803331375
0.031485360115766525 0.04568878561258316
0.04652426764369011 0.05621001869440079
0.04491248354315758 0.06492853164672852

and after the LSUV:

(0.3296891152858734, 0.9999774694442749)
(0.3093450665473938, 0.9999678134918213)
(0.3306814730167389, 0.9998359680175781)
(0.2864744961261749, 1.0004031658172607)
(0.2594984173774719, 0.9999544024467468)
(0.23850639164447784, 1.0)
(0.20919467508792877, 0.9999997019767761)
(0.2371777594089508, 0.9999346137046814)
(0.16776584088802338, 0.9999942779541016)
(0.17556752264499664, 0.9999728202819824)

A significant improvement. Let’s train it:
image
Wow, still slightly bumpy but much much better. Its faster and more stable (the original init sometimes didn’t converge after 80 epochs). The plateau is gone…

And the final metrics are much better for the training, and less good for validation:

train: [0.3298365177516958, tensor(0.8721)]
valid: [0.6690512438988577, tensor(0.7803)]

According to Jeremy, we have 2 stages in training a model:

  1. overfit.
  2. reduce overfitting.

I didn’t put a good figure for that here, but I can tell you that now (unlike before) the validation dives to around 0.60 before going up again so we managed to pass the first of the two stages: overfit.

Next i’ll try to show more detailed stats of what is going on in each layer, and then I’ll try to reduce the overfitting to improve validation score!

7 Likes

Those graphs are great. Crazy how much faster it seems to be! I have a question on adding layers. If I’m understanding what you are using it is just a single linear layer with a relu activation, but how would initialization need to change if there were multiple layers? Also is your code on GitHub or anywhere? I’d be interested in in pulling it to do some exploration if that’s available. Thanks again for all of these snippets of insight. I am going to start doing this myself while I’m floundering through Jeremy’s notebooks and breaking it down into chunks that make sense to me. I really love seeing the broken down thought process as you dive into the content.

Thanks Kevin, I’m not sure I understood your question -
I’m using 10 layers currently, each one with its own ReLU activation, and the LSUV initialization I showed above is for that configuration. The layers shape is stored in the list object layers:

layers = [m] + [40]*10 + [c]

where m is the number of inputs, [40]*10 in python is a list of 10 elements of 40 which are the hidden layers and c is the number of classes (outputs).

About putting my notebooks on github - i’d like to, but i’m not doing it yet because:

  1. I currently have limited time which I rather invest in research and not in making the notebooks publishable. But later I will.
  2. I’m not sure what’s the policy of this course, and whether we can already put in a public place things we see here.
  3. I have to learn a few more things about git…

Anyhow, until I’ll publish the notebooks, I’m doing my best to give all the relevant pieces of code to reproduce what I’m doing here on the forum. If you take all the code pieces above and put them into the course’s notebook 7, it should (hopefully - never tried) work. You also need to download Otto data of course. If it doesn’t work, feel free to ask me here!

displaying train and validation loss

I really wanted to show here the train and validation losses so we could see when the model overfits, and in general what’s going on along training. Jeremy’s recorder callback currently only records the batch losses. I’d like to add a recorder for the epoch stats, i.e. the train and validation average metrics after each epoch. We already calculate these numbers with the AvgStatsCallback class. So we just have to pull them in the right time, after each epoch, and record them in a list.

I modified the Recorder class to account for that in the following way:

  1. add self.train_stats_list, self.valid_stats_list = [], [] in the begin_fit method.
  2. add an after_epoch method as:
    def after_epoch(self):
        self.train_stats_list.append(self.avg_stats.train_stats.avg_stats)
        self.valid_stats_list.append(self.avg_stats.valid_stats.avg_stats)
    
    which will add all the stats to the list.
  3. Now I can plot the train and validation losses by adding this method:
    def plot_epoch_loss(self): 
        plt.plot([l[0] for l in self.train_stats_list], c="blue")
        plt.plot([l[0] for l in self.valid_stats_list], c="orange")
        plt.legend(["train", "validation"])
    

and here are the results - after training 80 epochs, 10 hidden layers each with 40 units, no LSUV:
image
and with LSUV (notice the overfit).
image

Different architecture

From the beginning I didn’t really like the 10x40 unit layers. I prefer an architecture that forces the net to evolve (evolve=my fantasy, not sure its right in reality) such as [1000, 500, 200, 100, 50, 20]. Lets try it…
Here are the results without LSUV:
image
Huh, seems like in this case we are overfitting without LSUV as well! So the architecture might be an important factor in the net’s learning ability in addition to the initialization!

lets check how it does with LSUV:
image
that looks bad! Actually, the net with LSUV, in this case, seemed to get nans after epoch 18 or so and the plot shows only until then. What happened?

I’ll try again (different random init):
image
now it worked. The training is faster than without the LSUV, and the training loss is almost 0. But can it be that the LSUV makes things unstable under this architecture? why the nans before? I can see a “rough” region around epochs 20-30 which might cause numerical problems but I don’t really know. probably more trials are in order before I can make any conclusions.

This is it for now, I’ll continue in the next post. Let me know if you have any insights about what I found here…

2 Likes

Note the change in your y intercept in each plot. Looks to me like LSUV is simply training better. So it gets to the point it can overfit, which you don’t get to otherwise.

So you simply need to add some regularization - try dropout.

1 Like

very deep net (100 layers)

Another experiment at this stage - why not train a 100 layer deep net? will the LSUV allow the net to learn anything? lets try… Here’s the training stats for a 100 layer deep net without LSUV:
image
The net learns something and then gets stuck in the plateau phase “forever”.

with LSUV:
image
Hmm, even the LSUV has its limits. BTW, the stds are very close to 1.0 all the way until the 100th layer. So this is not the only issue. Maybe the learning rate should change? I checked 0.1 and 0.001 but the result is similar.

I feel like seeing exactly when the net breaks. In order to do that I’ll iterate on various depths of the net and check the minimal validation score obtained through each depth training. This is the loop code:

hidden_sizes = np.unique(np.logspace(0,4,30, base=3, dtype=np.int))
min_valid_loss = []
for hs in hidden_sizes:
    print("hidden layers: ", hs)
    layers = [m] + [50]*hs + [c]
    learn,run = get_learn_run(layers, data, 0.01, cbs=cbfs)
    mods = find_modules(learn.model, lambda o: isinstance(o,FCLayer))
    xb,yb = get_batch(data.train_dl, run)
    mdl = learn.model.cuda()
    with Hooks(mods, append_stat) as hooks:
        mdl(xb.cuda())
        # for hook in hooks: print(hook.mean,hook.std)
    for mod in mods: lsuv_module(mod, xb.cuda())
    run.fit(50, learn)
    min_valid_loss.append(min([l[0] for l in run.recorder.valid_stats_list])) 

Here is the result: minimal validation loss as a function of the network depth.
image
Which seems to hint that for this problem we are better off with 2 layers (!) and not a deep net at all.

Why? How can a very shallow net be better than a deeper one, assuming we solved the init problem and its able to learn? I feel that if i’ll be able to answer that thoroughly, my ability to solve problems with deep nets will increase by a lot :slight_smile:

I want to see also the minimal training loss, this will help, I think. Here is the same graph as above (with a new random initialization) with both train and validation losses:
image
Now that made me check what happens after ~10 layers that causes the training loss to jump. Its nans. As the number of layers increase, the amount of epochs until the network gets nans reduce. That explains why we see bad train and validation losses for deeper nets.

Actually, until depth of 10 things seem good - the training loss is declining which hints that a good regularization might reduce the validation loss. So now the question is more focused - why do we get nans suddenly after several successful epochs with the deeper nets? I played with learning rates and was surprised to see that it did not seem to matter much to this issue.

Next, (also in accordance with Jeremy’s reply above) I think I’ll check the effect of regularization - dropout.

3 Likes

You’ll need bn for deep nets btw. Or selu (with appropriate init and activation and stuff, which is v fiddly).

1 Like

Hmm, great, so first I’ll add batchnorm, see if I can train deeper nets, and then I’ll regularize.

I added batch norm by modifying the FCLayer class in the following way:

class FCLayer(nn.Module):
    def __init__(self, ni, no, bn=True, sub=0.0, **kwargs):
        super().__init__()
        self.bn = bn
        self.linear = nn.Linear(ni, no)
        if bn: self.BatchNorm = nn.BatchNorm1d(no)
        self.relu = GeneralRelu(sub=sub, **kwargs)
    
    def forward(self, x): 
        x = self.linear(x)
        if self.bn: x = self.BatchNorm(x)
        return self.relu(x)

that results in a very nice training graph! Without LSUV!
image

Now lets see if LSUV is doing anything when applied to the batch norm network. Oh, running the LSUV init and then fit as here:

learn,run = get_learn_run(layers, data, 0.01, cbs=cbfs)
mods = find_modules(learn.model, lambda o: isinstance(o,FCLayer))
xb,yb = get_batch(data.train_dl, run)
mdl = learn.model.cuda()
with Hooks(mods, append_stat) as hooks:
    mdl(xb.cuda())
    for hook in hooks: print(hook.mean,hook.std)
for mod in mods: lsuv_module(mod, xb.cuda())

run.fit(50, learn)

results in nans all the way from the first epoch. What is going on? It seems that the LSUV is making everything nan. Lets check more deeply. These are the means and stds of activations of each layer after the batch norm, before LSUV:

0.15797513723373413 0.39113742113113403
0.15434011816978455 0.37467488646507263
0.15448042750358582 0.3945228159427643
0.1196012943983078 0.33887019753456116
0.17741800844669342 0.4215557277202606
0.12779605388641357 0.3248080611228943
0.14722740650177002 0.370301753282547
0.13303065299987793 0.3530457317829132
0.1496778130531311 0.38954412937164307
0.13442032039165497 0.3293791115283966

So the batch norm already does a nice job (not perfect though, probably because it lacks the iteration of the LSUV) normalizing our activations. So maybe LSUV is not so important anymore, but why doesn’t it work?

I checked what is going on in the inner loop of the LSUV module, and got it. Look at these std values of the weights, during the LSUV weight iteration loop:

w s  tensor(0.1484, device='cuda:0', grad_fn=<StdBackward0>)
w s  tensor(0.3650, device='cuda:0', grad_fn=<StdBackward0>)
w s  tensor(0.8979, device='cuda:0', grad_fn=<StdBackward0>)
w s  tensor(2.2088, device='cuda:0', grad_fn=<StdBackward0>)
w s  tensor(5.4337, device='cuda:0', grad_fn=<StdBackward0>)
w s  tensor(13.3668, device='cuda:0', grad_fn=<StdBackward0>)
w s  tensor(32.8823, device='cuda:0', grad_fn=<StdBackward0>)
w s  tensor(80.8900, device='cuda:0', grad_fn=<StdBackward0>)

It keeps looping until infinity breaks the loop.
And now I understand - the batchnorm keeps fooling the lsuv cycle to think the std is small, while increasing the real weight std until infinity.

Now I discovered something important:
BatchNorm was actually covered (thoroughly, all types, running norm, etc.) in the previous lesson!!
I was so occupied with the LSUV initialization in the last few days that I completely forgot about that part of the course. I feel a bit stupid now but i’ll adhere to my strict “publish everything I’m doing” policy and won’t delete or edit what I did above, maybe it will be useful to someone, and keep going :slight_smile:

First I’ll converge my code to the lesson’s way of adding batch norm to our framework. Maybe it will also solve the issues above. So I define a function fc_layer and instead of calling the class FCLayer I call this function.

def fc_layer(ni, no, bn, **kwargs):
    layers = [nn.Linear(ni, no), GeneralRelu(**kwargs)]
    if bn: layers.append(nn.BatchNorm1d(no))
    return nn.Sequential(*layers)

I also update the get_fc_model function so it will call all the hidden layers with bn=True.

I am going to stop trying to make the LSUV init work with BatchNorm. After all, they are doing a similar thing, but BN continues to do it as the net learns. The only advantage I can attribute to LSUV is the iteration in order to find better statistics after the nonlinearity. BN doesn’t have that iteration, but as far as I understand, if we place the BN layer after the nonlinearity, it will have the same effect, as it will make sure whatever comes out of our non linearity has unit std. I’m not completely sure about that yet.

A pedagogical digression - considering what I wrote above, I’m not sure the LSUV should be taught after batchnorm as it creates the impression that it is “better” or more recent (at least for me). As it seems now, batchnorm makes the LSUV redundant. If anyone think differently, please let me know!

So let’s see the loss as a function of network depth, with BatchNorm:
image image
Nice! I put the LSUV version of this graph on the right for comparison.

During training the deeper nets had some explodingly large values for validation. nans appeared only at the last experiment with 81 layers. So all in all the behavior with BN seem to be better in terms of convergence. If i had to choose an architecture, I would choose the point where the training loss is minimal in the graph above, i.e. ~7 layers, and try to regularize it. Maybe that’s what i’ll do next.

  • After thinking some more, I realize BN can allow a situation where the weights are crazy large but the activations are still normal, which is not so good. Maybe we do need LSUV or something similar after all to make sure that the weights themselves are not doing anything funky. Maybe L2 regularization would do it. I’ll think some more…

Not at all! :slight_smile: LSUV and BN both are from 2015, and they are doing different things (both with the goal of stabilizing training). As mentioned in the lesson, BN can’t properly be used with RNNs, and has trouble (in the classic form) with small batch sizes. In addition, starting out with a good init helps a bit even with BN.

2 Likes

@yonatan365 - this paper might be of interest to you. It’s a completely different approach from the awesome work you’re doing above, but might give you some food for thought. I’m an awful programmer, but am going to try to implement it over the next few weeks.

2 Likes

One thing may I suggest is to get a local git running. As you are making many changes it may help to track them and/or revert back ,also save disk space. I note as I say this that I am now duty bound to explain how. Well I don’t quite have that knowledge except that it is possible. Here is a description of the difference between the local and the remote repositories. I am pretty sure you don’t need github to do this. Also there are other hubs of type git out there

Difference between local and remote git repositories

Also this How too was very very helpful 10/10 for adelphus

I am going to try this myself as time permits with your code.

Many thanks for your input and foresight.

EDIT: In the steps of the link above I worked slightly differently so had different results.

  • in step 2a I used notebooks 05…07
  • so when I got to step 2b I had to clone the repo
  • then copy and add another local file with add commit
  • then push with the -u to master
  • in each case I had to use -config for user.name and user.email

This book I found useful in the understanding of git internals and what happens when.

English Version of ProGit book pdf/epub/mobi

1 Like

Maybe it would be better to apply LSUV on the output of the linear layer before the batch norm?
But then we loose the ability to take into account the RELU which come after the batchnorm.

I have done some quick experiments and I also have LSUV that sometimes is pushing value to NaN if it’s applied after a Linear -> BatchNorm -> Relu (but only sometimes). The issue is with the Mean part, for the standard deviation, it seems to work fine applied after the ReLu.

I will try to apply it only after the Linear and use a RELU with sub=0.5 to compensate.

1 Like

Nice post.

I have not looked at the dataset in any detail but things I would investigate.

Is there a class imbalance? If so, adjust for it by oversampling the under represented classes.

Run the tabula data through a CNN by treating each product id the same way we would an image where rows are entries for the matching id, columns are the features activated in that row and there is 1 channel.

If that doesn’t work try feeding the CNN different representations. For example, rows are entries for the matching id, columns are the features activated in that row and each feature is also a channel that is repeated across all columns.

replies to the replies:

@RogerS49 - thanks for the git advice! I hope soon I’ll overcome my time/courage constraints and get to it, I know it’s going to improve my life. I’m pretty sure that if you will write about your journey into beginning git it will be relevant to many of us here.

@tomsthom - following Jeremy’s reply, namely BN and LSUV are NOT redundant together, I’ll definitely make an effort to make them work together. Your idea, to normalize activations after the linear layer sounds like the easiest solution as it won’t require iterating at all. I hope to find the time soon to report about this more. Do let me know what you find with the relu sub!

@maral - Thanks for the advice. As for class imbalance - I tried to minimize preprocessing to minimum here as my interest is to check the modeling methods we learn, and I feel I don’t want to mix in more factors and hyperparameters such as sampling methods, etc. The CNN sounds like a great idea, on the lines of the link that was posted above by knesgood. As I said, my first priority now is to follow the courses methods, but I hope that afterwards I’ll get to checking CNNs on tabular data.

Thanks everyone for your replies!
I’m in some busy days now and it will probably take a bit longer until my next code post. Meanwhile if anyone wants to share her/his experiences, insights, failures, ideas with tabular data please do!