# Glossary of Deep Learning Terms for fast.ai

Terms are organized alphabetically
For abbreviations see the fastai Abbreviation Guide

This is a work in progress and I hope people contribute. My vision is that this is a place for short descriptions of each term: about 3 lines or sentences. If people want to add links to blogs and papers that would be cool too. Maybe we can add some sort of voting tally to put the best blogs at the top of any lists. Cheers!

# A

## Activation

This is a number which gets calculated. It’s a result of either an affine function(matrix multiplication more generally) or an activation function(nonlinearities like ReLU). They don’t always come out of matrix multiplication, but also from activation functions.

## Activation Function

The activation function is any non-linear function applied to the weighted sum of the inputs of a neuron in a neural network. The presence of activation functions makes neural networks capable of approximating virtually any function. Commonly used functions include ReLU (Rectified Linear Unit), tanh, sigmoid and variants of these. These are element-wise functions. They never change the dimensions of the matrix only the contents.

Adam is an adaptive learning rate algorithm. Updates are directly estimated using a running average of the first and second moment of the gradient and also include a bias correction term.

## Affine Functions

These are linear functions. If you are multiplying things up and adding them together that is a linear function. They are not always exactly matrix multiplication. Convolutions are Matrix Multiplications with some of the weights tied, so they are more of affine functions. If you add an affine function on top of another affine function it still is an affine(linear) function. So adding some nonlinearity like Relu in between will make more sense and give you a really deep neural network.

## Augmentation

See Data Augmentation

# B

## Backpropagation

Backpropagation is the primary algorithm for performing gradient descent on neural networks. First, the output values of each node are calculated in a forward pass. Then, the partial derivative of the error with respect to each parameter is calculated in a backward pass through the computation graph.

``````parameters ﹣= parameters.grad * learning rate
``````

# C

## Categorical variable

categorical variables are mainly strings. You can create date variables also at categories. eg. day of the week, month, day of the month

## Channel(s)

Aka dimensions
For images, the colors: red, green, blue (3) or black and white (2)
For language data: different types of embeddings

## Continuous variable

Includes continuous numbers eg, age, price, quantities

## Cross entropy

• cross entropy loss

# D

## Dataframe

Data organized in a table or a .csv file. Dataframe is a term from Pandas

## Dataset

The data that you feed the algorithm.

## Dependent

For example, a dependent variable tensor is

## Embedding

A mapping of a word or a sentence into a vector
E.g. word2vec

# E

## Epoch

When the standard deviation becomes equal to infinity.

Used in CNNs

See “Layers”

# G

A technique to prevent exploding gradients in deep neural networks, commonly used with RNNs.

# I

## Independent

For example, an independent variable tensor is

x

## Iterate, Iterator

To go through objects (add types of objects?) one at a time.

# L

## Label

A tag applied to an object.
For example, the name of an image.

## Layers

There are only and exactly 2 types of layers either containing parameters or activations.

• Input - Special activation layer at the start of the neural network. It doesn’t get calculated. It’s just there.

• Output - Activation layer at the end of the neural network. It contains a set of calculations from the previous layer’s activation function (either sigmoid or softmax)

• Hidden - in between layers can be convolutions/matrix multiplications(linear) or ReLUs(nonlinear)

• Dense

• Sparse

• Fully Connected

## LSTM

Long Short-Term Memory networks use a memory gating mechanism that helps prevent the vanishing gradient problem.
paper:

# M

## MNIST

A data set of 60,000 training and 10,000 test examples of handwritten digits. Each image is 28×28 pixels large.

# N

## Named Entity Recognition (NER)

Named-entity recognition (NER) (also known as entity identification, entity chunking and entity extraction) is a subtask of information extraction that seeks to locate and classify named entity mentions in unstructured text into pre-defined categories such as the person names, organizations, locations, medical codes, time expressions, quantities, monetary values, percentages, etc

# O

## Optimizer

• layer group learning rate optimizer

# Parameters -

Parameters include weights and biases. These numbers get initialized randomly first, but better to initialize them with kaiming initialization (init.kaiming_normal() from torch.nn). Then our model learns them. We update parameters through gradient descent algorithm.
These parameters are used to multiply with input activations resulting in a matrix product.

``````parameters ﹣= parameters.grad * learning rate
weights = weights - weights.grad() * learning rate
``````

## Pooling

• Max-pooling
• Average-pooling

# R

## ReLU, Rectified Linear Units

Often used as activation/non-linear function.

``````f(x) = max(0, x)
``````

# S

## Semi-supervised learning

See supervised learning

## Sigmoid

Mostly used in the last layer for calculating outputs.
It predicts values between 0 to 1. Any number of values can be high near1 at a time. No need to add up to 1. So it’s used in multi-label classification, like planet dataset from Kaggle where a single image can have agriculture, road, and water at the same time.

## Softmax

Only used in the last layer for calculating output probabilities. It predicts values between 0 and 1 and these values have to add up to 1. Softmax likes to predict a single label.
We use it in a multi-class classification where you have to choose between mutually exclusive classes like MNIST, CIFAR or dog breeds competition on Kaggle.

Jeremy explains it in the excel spreadsheet

In the multi-class scenario like above, we have to take exponential of each output/prediction value. We then have to add all exponential values(99.81 here). To take softmax of a class, we then divide the exponential value of that class by the sum. Mathematically it is defined as follows,

## Supervised learning

• semi-supervised learning

# T

`y`

## Tensor

n-dimensional array designed to work on GPU for accelerated numerical computations.

## Test Driver Development (TDD)

A software development process that relies on the repetition of a very short development cycle: requirements are turned into very specific test cases, then the software is improved to pass the new tests, only. This is opposed to software development that allows software to be added that is not proven to meet requirements. (An excerpt from the Wikipedia).

Not exactly a deep learning concept. However, it was mentioned several times during Part 2 so worth to mention.

Used in NLP

# U

## Universal Approximation Theorem -

Matrix multiplications and Relus stacked together 1 after another has this amazing mathematical property called UAT. If we have big enough weight matrices and enough of them, it can solve any arbitrarily complex mathematical function to any arbitrarily high level of accuracy.

# V

## Validation

• Model Validation
• Cross Validation

## Validation set

Standard deviation going to zero.

# W

## Word2vec

18 Likes

Looks good! Probably it is worth to create a Wiki from this post.

1 Like

Done.

2 Likes

It’s a good start.

Just curious, how in depth do you plan to go? Are you trying to keep the glossary specific to Deep Learning and for fast.ai only?

I frequently visit these online resources when I need to refer to commonly used terms:

We need a constantly updated deep learning glossary. Thanks.

3 Likes

I believe that probably this thing once could be moved into fastai docs or even something like `https://glossary.fast.ai` for a wider audience. (And the fastai fellows, of course). I think it has all chances to be frequently updated, thorough and comprehensive enough taking into account how many people are involved in the community.