Share your work here (Part 2)

No definitely avoid that.

3 Likes

I modified model_summary a little in 11_train_imagenette notebook to:

def model_summary(model, find_all=False):
    xb,yb = get_batch(data.valid_dl, learn)
    mods = find_modules(model, is_lin_layer) if find_all else model.children()
    f = lambda hook,mod,inp,out: print(f'{mod}\n{out.shape}\n------------------------------------------------------------------------------')
    with Hooks(mods, f) as hooks: learn.model(xb)

then did model_summary(learn.model, find_all=True)
now it prints out the modules and the out shape:


Or model_summary(learn.model):

I thought it was helpful to see in one place how the modules were changing the out shape so I thought I’d share it!

2 Likes

Great minds think alike - that’s what the summary in nb 08 does too! :slight_smile:

FYI you can write this more conveniently as:

f'{mod}\n{out.shape}\n{"-"*40}')
2 Likes

Here is a small Medium post summarizing the BERT training using LAMB paper that was introduced during Lecture 11.
As always, corrections and comments improving the style and content are welcome.

1 Like

Hey, tired of reading how everything went right? Want to see a bunch of AC/DC references shoved into an article?

Then look no further. I did the Kaggle VSB competition about power lines that went horribly wrong for me, and wrote it up here anyway.

1_kPHzKUQMmJ8Y6c8UtdZu5A
That heatmap can’t be right at all!

4 Likes

TfmsManager: visually tune your transforms

I’ve published a tool to quick visualize and tune a complex chain of data transforms (Audio & Image).



15 Likes

How to have a swift kernel in colab:

3 Likes

Ever since getting into deep learning, and making my first PR to pytorch last year, I’ve been interested in digging into what’s behind the scenes of the python wrappers we use, and understanding more about what’s going on at the GPU level.

The result was my talk " CUDA in your Python: Effective Parallel Programming on the GPU", which I had the chance to present at the PyTexas conference this past weekend.

I would love any feedback on the talk, as I’m giving it again at PyCon in ~3 weeks.

15 Likes

Google colab template (link) - exported stuff taken care of so you can pretend you have a local jupyter with all the prev lessons.

3 Likes

When the development began last fall on Fast.ai 1.0, I decided to try to write my own version in Swift so that I could learn more about how Fast.ai is put together and try to demystify the things that it does as well as learn Pytorch better. Also, it would allow me to continue practicing my Swift skills. Since a couple of the Part 2 lessons are going to be using Swift, I thought I’d share what I’ve created so far for anyone here who is interested. I thought it might be useful for those who want to see how things like callbacks, training loops, closures, etc. can be done in Swift as well as how to run Python and Pytorch code from within Swift. I’ve created a Docker setup for it for ease of installation as well as some examples that can be run. Those are: MNIST, CIFAR-10, Dogs Vs. Cats Redux (Kaggle), Kaggle Planet competition and Pascal VOC 2007. You can also submit the output to Kaggle for the 2 Kaggle competition ones. I’m going to try to add to the readme on how things are architected but for now it just has installation instructions and how to run the examples. Also, unfortunately it only supports CPU for now. Here’s a link to the repo https://github.com/sjaz24/SwiftAI

7 Likes

Thanks @stephenjohnson! Would love to hear if you see anything in our dev_swift notebooks that you think could be improved based on your experiences.

I think the medium link might be broken. It won’t let me click it

I’ll take a look and let you know.

I did a small experiment that suggests as networks get deeper we should train them multiple times using different initialisation parameters and use a voting scheme for inference. Below is my rational. Interested in peoples thoughts.

In the previous lessons we learnt that parameter initialisation is very important. However, Kaiming initialisation is still derived from random numbers. Therefore, we should not assume we get a good starting position when we train a network. If we just make one attempt we could get unlucky. If we try multiple attempts it reduces our chances of starting off on the wrong foot. It also means we get to explore different state space in the network because they are designed to minimise loss and that goal starts after initialisation. So if we use different starting positions and save those models for inference we increase our changes of success (because we explored a broader space that allowed the models to collectively calibrate against the data).

I did a small experiment to show how this might play out with Kaiming initialisation. The left and right charts (and green and red histograms) represent the means and standard deviations of the parameter space for each consecutive matrix multiplication. I simulated 1000 initialisations and performed 20 (L) consecutive matrix multiplication operations. What is interesting is the range. As L increases the range in mean and standard deviation increases which suggests we are more likely to randomly choose an unlucky initialisation as L increases. FYI I used some of @jamesd code from his great blog

def kaiming(m,h):
    return np.random.normal(size=m*h).reshape(m,h)*math.sqrt(2./m)

data = []

inputs = np.random.normal(size=512)

for i in range(1000):

    data.append([])
    
    x = inputs.copy()
    
    for j in range(20):
        a = kaiming(512, 512)
        x = np.maximum(a @ x, 0)

        data[i].append((x.mean(), x.std()))

fig, ax = plt.subplots(1, 2, figsize=(20,10))
ax[0].plot(data[:,:,0].T, '.', color='gray', alpha=0.1)
ax[0].set_title('mean')
ax[0].set_xlabel('layer')
ax[1].plot(data[:,:,1].T, '.', color='gray', alpha=0.1)
ax[1].set_title('std')
ax[1].set_xlabel('layer');

Also a histogram plot.

import seaborn as sns

layers = []
means = []
stds = []

for layer in range(20):
    mean = data[:,layer,0]
    std = data[:,layer,1]
    l_values = len(mean)
    layer += 1
    layers.extend([layer]*l_values)
    means.extend(mean)
    stds.extend(std)
    
df = pd.DataFrame({'layers': layers, 'means': means, 'stds': stds})

plt.figure()
g = sns.FacetGrid(df, row="layers", hue="layers", aspect=15, height=4)
g.map(sns.distplot, 'means', kde=False, bins=100, color='green')
g.map(sns.distplot, 'stds', kde=False, bins=100, color='red')
g.map(plt.axhline, y=0, lw=1, clip_on=False);

7 Likes

Interesting work.

I have a question regarding your suggestion of training multiple NN with different init and ensembling their predictions.

After having a good init (mean close to 0 and std close to one through all the layers), with kaiming or LSUV, what is the point of training the same model multiple times and ensembling their predictions?
If I want to do ensembling, wouldn’t it be much better to train NN with different architectures or hyper parameters to get more diversity (as the goal of ensembling, if I understand correctly, is to get uncorrelated errors)?
I am not sure but I think it would be a more useful use of computation?

Hi, thanks for pointing it out. Just fixed it. :slightly_smiling_face:

1 Like

I purposefully did not mention ensemble because that comes with its own connotations. I don’t see this replacing ensembles. I see this as another option to try and improve a models performance. I guess the proof is in the results. When I get a chance I will try it and report back.

Edit…

Having thought about it a little more it reminds me of how random forests work. A random forest produce many models (trees) and each model gets a vote. The randomness is in the features applied to each model. In the approach I’ve suggested the randomness is in each models initialisation parameters.

1 Like

This is an intriguing idea, @maral. It would be interesting to see a pilot experiment in which you implement your idea of training models with multiple initializations and demonstrate improved accuracy or reduced training time or both!

I had an interesting idea where you split the pretrained embeddings matrix in two groups: trainable and frozen. During training, you only update the indices in the vocab which were missing in the pretrained matrix and you leave the others frozen. This allows you to learn the domain specific word embeddings while leaving the more general language model components frozen.
Blog post

Code

3 Likes

one of thing in deep learning always confuses me is to decide on proper use of weight initialization. I did lot of study on this but could not figure out any guidelines…