Live coding 16

Hi @moody, I liked your comments on Excel during the session very much. Jeremy once (2019 to be exact) said something super encouraging for people (including me) who want to dive into Excel:

I use it everyday
I use it to see how algorithms behave when everything is all laid out in front of me.

I finally managed to invest in MS365 yesterday and starting to play with the excel files from our 22 version of the course. However, I have forgot most of Excel I learnt many years ago, which makes it hard to even play around with the files effectively. I probably have to watch some tutorials to pick things up one by one. But a much nicer way for us to learn would be to have Jeremy give us one or more walthrus on his Excel practice in deep learning or anything he uses on everyday. :grin: Sure Excel is not a notation, but since Jeremy uses it everyday, I do believe it deserves a few walkthrus or live codings. What do you think?

It’s very encouraging to hear you appreciate the use of Excel 45:06. So I wonder whether you could share your experience of using Excel. How useful or unexpectedly useful Excel has been for you? Thanks!

4 Likes

a slightly more detailed note for live coding 16 (built from Video timeline by @fmussari)

00:00 - Start

01:04 - About Weighting (WeightedDL)

01:50 - Nick has been applying all the techniques learnt from chp2 to Paddy competition. Jeremy has not practiced curriculum learning.

03:08 - Distribution of the test set vs training set. Why we don’t want balanced dataset? We like our training set to be more like the test set. When is the appropriate situation to use WeightedDL? when the distribution of training set is different from test set.

03:35 - Is Curriculum Learning related to Boosting? What is Boosting and what’s the pitfall of boosting if not careful? What is curriculum learning? To use more often the subset of data which the model did poorly

04:38 - Are the labels ever wrong by accident or real life complexity? Do read the techniques in chp2 and experiment on the techniques which should be enough for this case

06:40 - Image annotation issues: Paddy Kaggle discussion 6
Don’t knock out hard samples, still don’t knock out wrongly labelled samples as the test set may be consistently wrongly labelled too. Again review chap 2.

08:23 - UNIFESP X-ray Body Part Classifier Competition 6

10:20 1 - Medical images / DICOM Images
What are the troubles of using this type of medical images?

10:57 1 - fastai for medical imaging
There is a small sublibrary called fastai.vision.medical can handle DICOM directly

11:40 1 - JPEG 2000 Compression and fastai medical image tutorial available

12:40 - ConvNet Paper and Syvian’s AdamW blog post

13:50 - On Research Field

15:30 1 - When a paper is worth reading?
A paper from a Kaggle competition or good results from experiments with less data or less time
Papers on transfer learning and people you read and liked before and their colleagues

17:14 - Quoc V. Le 2

17:50 - What to do when your model is trained on the dataset which is not quite the same to the data samples during deployment? Try to capture the data during deployment because these are the real data you want to train your model with. Also try to use semi-supervised learning and transfer learning to maximize the juice of the data you collected during deployment.

20:30 What would you do when some of the dataset have updated or changed to some extent, e.g., a new medical equipment is producing new images for your dataset? use fine-tuning, and it won’t take much time nor data for your model to be fine-tuned. It won’t solve this problem by training the entire dataset longer.

21:33 - What if you don’t have enough data for some category? So Don’t use the model for this category. Use Binary Sigmoid as last layer instead of SoftMax. Have a human review in the loop

23:50 - Question about submitting to Kaggle
Create a good validation set is very important

24:50 - Always have a validation set. When a random split is appropriate? What should the validation set be like? Should it be as similar to test set and deployment set as possible? You should check whether training set and the test set have similar distribution. If the test set and training set are not randomly selected, then you should be alarmed.

27:19 Radek comments on the uses of comparing your validation set results ( as many as you can since stored locally) and public leaderboards results (only 2 per day)

29:30 - Where did we get to in the last lesson?

31:20 - GradientAccumulation on Jeremy’s Scaling Up: Road to the Top, Part 3 Notebook

37:20 - “Save & Run” a Kaggle notebook

38:55 1 - Plans for the next lessons: How outputs (multi-target loss, softmax, x-entropy loss, binary sigmoid) and inputs (embeddings in collaborative filtering) to a model looks like?

40:55 - Plans for Next lessons: How the “middle” (convnet) of a model looks like
Plan for the next lesson or lesson 8

41:32 - How to debug middle layers? It will be in Part 2: Deep dive into the middle layers for advanced debugging techniques in previous part 2 and collaborative filtering will lead us into it.

42:53 - The Ethical Side also deserve more attention, a lecture video 2020 from Rachel.

44:30 - fastai1/courses/dl1/excel/ 2 How underappreciated has Excel been and how useful and helpful Excel is actually.

4 Likes

Actually, I have found a Excel-walkthru like content from 2020 Lecture 2. 45:50 How to read a paper and experiment the data and model in Excel.

Does anyone know more of them? Thanks

3 Likes

Hi @Daniel, I am glad you like Excel too. If Jeremy wants to share his Excel tips, I would love to attend it. BTW, Excel can perform Lambda function, it becomes more powerful.

I am aware you have attention to detail. So, I tried to answer as details as possible. Please feel free to ask further questions if anything is unclear to you.

Coming from an accounting background, Excel is my universal tool with a strong application in forecasting and scenario modelling. When I came across fast.ai, Jeremy used Keras at first; then, moved to Tensorflow. In the middle of the course, he changed again to PyTorch. For someone who learnt Python not too long, it was hard to cope with different frameworks. So, I focused on learning the underlying concepts. Jeremy used Excel to explain softmax (maths), cross-entropy (maths), gradient descent with different variations (maths and solver add-on), convolution (visualisation), and recommendation system (matrix multiply and solver add-on). So, I could follow along the Part 1 in 2017.

I am a visual person. I need to “see” before I can absorb new information/concepts. I found “dropout” was very unintuitive. WHY do we spend all the time to train a model (much slower and expensive to train at that time) but delete some of the features/activations the model just learnt??? But, by doing dropout, the model will generalise better!? I couldn’t process this concept in my head. :dizzy_face:
So, I did the visualisation (note: Jeremy explained the details operation in Lesson 8 1:08:42 few days ago). All of the sudden, I GOT IT!!! (For those who don’t have Excel, all the files were converted into Google Sheet previously)

Attending Part 2 in Spring 2018 was a big stretch for me. Reading ML research papers, with lots of maths notations, was intimidating. Again, I tried to learn the concept and immediately fell back into my Excel comfort zone. I managed to re-produce focal loss graph in Excel first and then re-produce it again in Python. So, I learned it twice. (I just realised it help to improve my forget curve). While I was running (and waiting impatiently) Part 2 notebooks, I kept using Excel to understand/experiment with the following concepts:

  • Gradient explosion for CycleGAN
  • Wasserstein GAN (comparing L1 and L2 losses)
  • Artifact when up-sampling images using different methods and kernel sizes

If you are interested, here is the repo. Feedback is welcome. :blush:

Over the years, deep learning frameworks and libraries can do most of the heavy lifting for us. We don’t even need to fine-turn cats and dogs classification anymore. Knowing the impacts and reasons for picking certain parameters/loss functions are far more important.

How useful or unexpectedly useful Excel has been for you?

Additionally, I use Excel extensively for project management (general projects or even deep learning projects) in my corporate career. I use it to:

  • develop checklists based on the concept of Work Breakdown Structure
  • keep brief minutes that contain decisions and actions only (a tab for each meeting, so I can follow up on actions items every meeting and make my team accountable for their tasks)
  • keep track of project deadlines, milestones and leaves
  • data collection registration (since we needed to collect our own ground truth dataset)
  • explore best visualisation options (much easier to change chart types in Excel than in Python)
  • mock up model specifications (breaking down into input, process, and output) to avoid misunderstanding and using the predefined output for User Acceptance Test later. (Very important for system customisation projects to ensure projects are delivered on time and on budgets)

I successfully applied the above with a multi-disciplines team, located in four different time zones, to deliver a deep learning project - using computer vision for digital pathology. Last year, my team published the finding in Nature’s Scientific Report. Most of the techniques we used were covered in Part 1. But, how to apply existing problems and execute them within limited resources is still challenging.

In summary, if Jeremy did not use Excel in his teaching, I would not contemplate learning deep learning at all. Without fastai, I might possibly still use Excel and work in Finance/Accounting. But now, fastai opens up a whole new world for me to explore.

PS. Thank you for all your detailed notes. They are very helpful. 深度碎片,加油!

9 Likes

Wow, this reply is a huge gold mine, thank you so much Sarada! I am digging into it now

1 Like

To reproduce this list in excel step by step should be a great exercise for me!

Wow, this is very nice! More advanced and challenging exercise!

Very true! It’s one thing to try the newly implemented techniques and say it works better by lower error rate, it’s another to ‘see’ how the new techniques behave differently and how the difference may contribute to better result.

This sounds very interesting! Love to learn more of how to build a dataset professionally using Excel from you someday.

So, you have actually implemented in Excel every step of your deep learning project from input to model building to output? Wow! That’s amazing! Love to learn more of the story of how you did it!

Congratulations! This is huge!

Yeah, Jeremy definitely should teach us his tips in using Excel particularly for deep learning. I guess we should give Jeremy more positive feedbacks on Excel with deep learning.

2 Likes

Sorry, in this case, not for the deep learning model. But, I used this approach to outline the specifications for proprietary software customization and delivered the project on the first date of a new financial year. So, possibly out of scope here.

3 Likes

I see, thanks for clarification.

1 Like

It is so intringing that one excel workbook can let you experiment on the foundations (forward, backward propogation or gradient descent) and essential techniques on improving gradient descent (e.g., momentum, Adam etc) of deep learning on a single neuron architecture (i.e., a simplest linear function y = a*x + b).

The first thing I want to experiment in the excel spreadsheet is to build a two-neuron architecture y = c*(a*x+b)+d (one linear function on top of another) instead of one neuron y = a*x + b in the excel. I don’t know much calculus, but what Jeremy showed us in the excel and lecture 5 2018 1:34:00 makes me willing to try.

I have no school learning in calculus, but I sort of understand the basics of derivative through fastai videos and Ng’s videos. So, I am really thrilled to see that Jeremy put both numerical and analytical derivatives of a and b with respect to error in the excel and showed that there is not much difference between these two types of derivatives in the example. (I further tested that running SGD with estimated/numerical derivatives of a and b, the error rate going down with very similar values to using analytical derivatives.) This way, even with 4 weights in 2 neuron architecture, I don’t need to worry about searching online for 4 analytical derviative formula, instead I can calculate their numerical/finite derivatives with almost a single formula.

\frac{\partial e}{\partial a} = \frac{(((a + 0.01)*x + b)*c + d -y)^2 -((a*x + b)*c + d -y)^2)}{0.01}
\frac{\partial e}{\partial b} = \frac{(((b + 0.01) + a*x)*c + d -y)^2 -((a*x + b)*c + d -y)^2)}{0.01}
\frac{\partial e}{\partial c} = \frac{((a*x + b)*(c + 0.01) + d -y)^2 -((a*x + b)*c + d -y)^2)}{0.01}
\frac{\partial e}{\partial d} = \frac{((b + a*x)*c + (d + 0.01) -y)^2 -((a*x + b)*c + d -y)^2)}{0.01}

Are these correct numerical/finite derivatives of y = c*(a*x+b)+d? If not, what are the correct formula? I am not confident about it because error get exploded too fast.

Below is my worksheet named “basic SGD 2 neurons” in the workbook you can download from here

1 Like

Another question I have is about momentum.
From Basic SGD workshee, we calculate the updated parameter b by new b = old b - 0.0001 * de/db. Whereas in momentum, we use new b = old b - 0.0001 * a modified de/db, and the modified derivative of de/db is calculated using modified de/db = 0.1*de/db + 0.9 * its previous modified de/db. But where do the first modified de/db (-18.33) and the first modified de/da (98.246) come from? Jeremy didn’t mention them in the lecture above 1:53:44?

1 Like

I guess a good night sleep helped.

3 Likes

y = c*(ax+b)+d, where a, b, c, d are constant
y = a
cx + (cb+d), so ac and (cb+d) are constant

Therefore, it is the same as y = a*x + b

Walk away and think about what you tried to achieve. Don’t give up. We are here to help. :slightly_smiling_face:

PS. In your example, the constants are a=2, b=30, c=3 and d=10. With learning rate=0.0001, and initial value for constants are a=1,b=1, c=1 and d=1. It will take you many many iterations before you can reach b=30.

2 Likes

Based on your “data” tab, I replaced x as a sequential integer starting from 1, you can see their relations, just different slops(a vs ac) and interactions(b vs cb+d).

2 Likes

Tips: Trace Precedents and Trace Dependents are debuggers in Excel. Use Remove Arrows (a bottom below) to remove blue lines.

3 Likes

Thanks a lot Sarada!

You are very right, and it does boil down to a simple linear function with just larger constant ac and (cb + d). Also maybe the exploding error from early on probably is due to the large difference between ac and my initial value 1.

So, what was I trying to do? How did I come up with this y = c*(ax+b)+d?

I want to build a slightly more complex model to find our target y = 2x + 30. Jeremy used the simplest model y = ax + b with two parameters a and b, and I want my slightly more complex model to have more than 2 parameters. If we can think of the model in terms of neurons, then I can picture the simplest model as a single neuron with a and b as weights (without activation function). If so, could a slightly more complex model be two connected neurons (one has a and b and the other has c and d) without activation functions?

I wonder whether the missing of an activation function is the reason why two neurons are collapsed into one neuron essentially.

Then if I do add one activation function such as a ReLU to the first neuron, would it prevent the collapse? I wonder how would I prove it? If it does not collapse, then the formula to calculate derivatives below should be working I assume. (I will keep experiment to explore on these questions)

1 Like

Great tips, thanks a lot!

1 Like

Let me try to prove it via visualisation. Let Z as ReLu of a*x+b. In Excel, that is max(a*x+b, 0) or cell C6 in the example below. So, y = c*(ax+b)+d is a bigger linear function with “a bended elbow”. With ReLu, the formula should be: y = c*[max(a*x+b, 0)] + d

ReLu looks like below (with a 45-degree upward straight line when x is greater than zero). So, constants (a, b, c and d) impacts the slops (two different slops applied when x is greater than zero and x is less than zero) and intersections.
image

Does that make sense to you? Can you “see” the power of visualisation? That is how I learned and did all the experiments. :slightly_smiling_face:

3 Likes

I heard somewhere (maybe in one of Jeremy’s previous lectures) that if we don’t have a non-linearity at each layer , the result would be as if we had just one layer. I can’t remember where I heard that.

A neural network is an Affine-ReLU sandwich and you need the delicious ReLU “filling” in between the slices of Affine “bread” :smiley: :sandwich:

2 Likes

Thanks for the reply, Mike. You are absolutely right! The experiments I explored in the excel also confirm that models with ReLU makes training easier than without.

I will share more of the excel experiments later.

1 Like

In fact, based on my experiments, having two neurons (4 weights) without ReLU perform worse than a single neuron model, when the target function is a simple linear function with 2 weights.

1 Like