Hi everyone! I just finished Lesson 3 and took a lot of notes if there’s a good place to share them.

Anyway, I appreciated how in this lesson Jeremy helps to develop our intuition about how and why these networks work – using the universal approximation theorem, and the neat visualizations showing how different learning rates do a better or worse job at finding the loss minimum.

This all came together especially well for me, because I previously took a Coursera course on Discrete Optimization (from the University of Melbourne). In solving different NP problems with constraint programming and local search, I became familiar with the “lay of the land” and understanding how to traverse the loss landscape. Here are some concepts with direct application to this course:

- The less you know about the space you’re searching, the more you want to incorporate randomness
- SGD introduces randomness, and according to the likes of Tishby, is essential for network performance
- Mixed precision floats introduce randomness too

- The more you know about the problem, the more you want to bake that knowledge into the model architecture
- CNNs leverage assumptions of spatial locality
- I once did an experiment where I trained a CNN and MLP to learn the generating functions of elementary cellular automata, and plotted the accuracy of each model against the entropy of the function. Holding other factors constant, the CNN
**far**outperformed the MLP.- It appeared that the MLP was compressing the data (Shannon entropy)
- It appeared that the CNN was learning the function (Kolmogorov complexity)

- I once did an experiment where I trained a CNN and MLP to learn the generating functions of elementary cellular automata, and plotted the accuracy of each model against the entropy of the function. Holding other factors constant, the CNN
- RNNs match problems with sequences
- But don’t overstate it! Be careful. AlphaZero beats Stockfish

- CNNs leverage assumptions of spatial locality
- Apparently these two points together form the No Free Lunch theorem, which sort of restates Bayes’ theorem
- When searching for global minima, you want to take big steps in the beginning, and then take smaller steps later on
- Temperature schedule in simulated annealing
- Also almost
*exactly*what Jeremy said re: learning rate.- Except the idea to gradually peak at max learning rate in phase one, was new to me

What is the point of sharing all this? One, is to point out that the Discrete Optimization course is a great resource, and will help to develop your intuition. Coming at the same problem from different angles is a good way to learn and get comfortable with the subject.

Two, is to potentially start a discussion if anyone finds this interesting.

One last thing: https://en.wikipedia.org/wiki/Level_set#Level_sets_versus_the_gradient

If the function

fis differentiable, the gradient offat a point is either zero, or perpendicular to the level set offat that point.

The way they describe this in the wiki article, sounds an awful lot like ReLU. I wonder if there’s something there?