[Invitation to open collaboration] Practice what you learn in the course and help animal researchers! 🐵

muellerzr · March 26, 2020, 1:33am

Just thought I’d throw in a little quick example of xresnet etc. Note this is not pretrained model and what I found is that epoch for epoch it’s fairly similar minus a few training bits that you’ll notice. My current setup is:

Mish activation
Self-Attention
Label Smoothing Cross Entropy
Ranger optimizer
Cosine Annealing fit function

I also normalized our data by taking in the first batch of data’s stats.

For architecture it was a xresnet18 where I modified the first input layer like so (we don’t have pretrained weights so it’s just converting the conv2d):

l = nn.Conv2d(1,32, kernel_size=(3,3), stride=(2,2),
              padding=(1,1), bias=False)
l.weight = nn.Parameter(l.weight.sum(dim=1, keepdim=True))
net[0][0] = l

In the first epoch alone I was able to get 7% error, with a finish of 2.8% however if you notice I wasn’t quite training properly or something because epoch 3 spiked to 18% error. Running another test now

barnacl · March 26, 2020, 1:36am

Why would this step be required if it is not a pretrained model ? Aren’t the weights being initialized randomly

muellerzr · March 26, 2020, 1:37am

We need it because we don’t here. If you run this you’d need to rerun init_cnn(net) so they get initialized. Weights are already initialized on the call to xresnet18. I checked on this myself, see this under the __init__ of XResNet:

        super().__init__(
            *stem, nn.MaxPool2d(kernel_size=3, stride=2, padding=1),
            *blocks,
            nn.AdaptiveAvgPool2d(1), Flatten(), nn.Dropout(p),
            nn.Linear(block_szs[-1]*expansion, c_out),
        )
        init_cnn(self)

github.com

fastai/fastai2/blob/master/fastai2/vision/models/xresnet.py

# AUTOGENERATED! DO NOT EDIT! File to edit: nbs/11_vision.models.xresnet.ipynb (unless otherwise specified).

__all__ = ['init_cnn', 'XResNet', 'xresnet18', 'xresnet34', 'xresnet50', 'xresnet101', 'xresnet152', 'xresnet18_deep',
           'xresnet34_deep', 'xresnet50_deep', 'xresnet18_deeper', 'xresnet34_deeper', 'xresnet50_deeper', 'se_kwargs1',
           'se_kwargs2', 'se_kwargs3', 'g0', 'g1', 'g2', 'g3', 'xse_resnet18', 'xse_resnext18', 'xresnext18',
           'xse_resnet34', 'xse_resnext34', 'xresnext34', 'xse_resnet50', 'xse_resnext50', 'xresnext50',
           'xse_resnet101', 'xse_resnext101', 'xresnext101', 'xse_resnet152', 'xsenet154', 'xse_resnext18_deep',
           'xse_resnext34_deep', 'xse_resnext50_deep', 'xse_resnext18_deeper', 'xse_resnext34_deeper',
           'xse_resnext50_deeper']

# Cell
from ...torch_basics import *
from torchvision.models.utils import load_state_dict_from_url

# Cell
def init_cnn(m):
    if getattr(m, 'bias', None) is not None: nn.init.constant_(m.bias, 0)
    if isinstance(m, (nn.Conv2d,nn.Linear)): nn.init.kaiming_normal_(m.weight)
    for l in m.children(): init_cnn(l)

This file has been truncated. show original

barnacl · March 26, 2020, 1:45am

Thanks will check it out and get back. i’m still not sure i understand this.

muellerzr · March 26, 2020, 1:47am

I’m learning it too. We’d actually want the weights from the original already initialized model here. So it would be:

w = net[0][0].weight
nn.Parameter(w.sum(dim=1, keepdim=True)

@barnacl another thing we can do is simply:

net[0][0] = nn.Conv2d(1, 32, kernel_size=3, stride=2, padding=1)
init_cnn(net)

barnacl · March 26, 2020, 2:09am

w = net[0][0].weight
nn.Parameter(w.sum(dim=1, keepdim=True)

won’t doing w.sum(dim=1) defeat the purpose of the kaiming init ?
Averaging or just initializing that one channel seems like what we need to be doing ( not sure how much of difference that would make in training though)

barnacl · March 26, 2020, 2:10am

.

adpostma · March 26, 2020, 7:26am

@radek: Created a pull request for a model based on Fastai2_audio

dhoa · March 26, 2020, 8:18am

Sorry if I missed some details but what is the number 32767 when we do audio.append(x[0] / 32767) ? I think it is the maximum value and we divide it to get audio from range (0,1) right ? Thanks

radek · March 26, 2020, 9:57am

@muellerzr living on the edge with those high lrs and ranger thx so much for contributing to this, really appreciate it!!! would be great if you could please submit a PR when you are ready

@adpostma, this is looking terrific!!! Thank you very much for this!

we now officially have first PR merged into the repo! Take a look at the amount of information @adpostma was able to squeeze out from the recordings using fastai2_audio:

I have also now added a little leaderboard:

The more interesting ways of working with the data we could document and get into the repo, regardless the results, the better

radek · March 26, 2020, 10:07am

Earlier I was using librosa that reads in wav files and normalizes the output to [-1,1]. 32767 is the max value for a 16bit signed int, this is the most common format the values in a wav file are stored. If you are interested, I wrote a little bit about working with wav files and how the data is stored here.

You are right though, I should have explained what I was doing there. Updated the notebook to include this now:

florianl · March 26, 2020, 10:21am

did you move the repo? the links are broken.

radek · March 26, 2020, 12:46pm

thx for letting me know @florianl I renamed Introduction.ipynb to 'introduction.ipynb`, that broke the links

faib · March 26, 2020, 1:52pm

I used the trick which centers elements in images from iafoss’ example found in this notebook.
This makes the images look like so:

The results look good, unfortunately the model performance is worse than with the original dataset at this point.

radek · March 26, 2020, 2:22pm

Centering horizontally could be quite helpful I feel.The problem here probably is that you are also centering vertically, and this loses information. The x axis on a spectrogram is just time, so shifting the image towards the center can probably help. But the y axis is frequency - if you move the shapes up or down, align them, this loses information.

CNNs are not that great with figuring out where things are (great material on that from uber research here), but still it is probably better to not align things vertically when it cams to spectrograms.

Wonder what your results would be with only horizontal alignment

faib · March 26, 2020, 2:27pm

Couldn’t you just pass the c_in parameter, e.g.
xresnet18(c_in=1) an achieve the same result?

faib · March 26, 2020, 2:34pm

This makes sense! The centering was out of curiosity, I wanted to see if it has any effect on the model

I am training it without vertical centering right now.
But wouldn’t I lose information with only horizontal centering as well? How long a “coo” is might also an important feature which I am losing, right?

dhoa · March 26, 2020, 2:37pm

Horizontal centering doesn’t affect how long is a “coo” but only when is the “coo” start. I think it is kind of normalize the signal because intuitively, when the signal start should not have any importance.

Vertical centering is different. Because where is your signal in y affect how height is the vocal. For example for human, male have lower frequency than female. Centering all then you can not differentiate male or female

faib · March 26, 2020, 2:45pm

I set up the dataset so the images are centered only horizontally.

In the tables below, the left side is the training with vertical and horizontal centering, the right side only with horizontal centering. One can acutally see that the model could learn some information from the vertical position! The training generally doesn’t seem to be that stable, though.

learn = Learner(dls, xresnet50(c_in=1), metrics=error_rate).to_fp16()

radek · March 26, 2020, 2:50pm

CNNs do most computation on objects in the center - the regions towards the edges receive less scrutiny. The closer to the edge you go, the more pronounced this effect. This has to do with the effective receptive field size, but even for a single 3x3 conv without padding, the pixel in the upper left hand corner will be processed by a kernel only once, adjacent pixel to the right of it twice and the next pixel in line will be processed by a kernel thrice (this assumes stride of 1).

As a rule of thumb, where possible, it is good to center the object and not have important information appear by the edges. In theory positioning the signal horizontally in the center should be helpful to the CNN in processing the spectrogram, and I agree with you, this shouldn’t be removing any information.