(Will be updated with edited video later)
- Welcome to lesson 4
- NLP review
- Text classification use categories
- Example: during week, lawyer, classifying legal texts
- 3 steps in NLP
- Problem, we have 25000 reviews in dataset, for each one we have one bit of info
- Our neural networks are matrix multiplications + non linearities and start random. Not enough to learn how to speak English
- Until recently neural nets didn’t do well with this classifications, not enough info => trick is transfer learning
- Pretrained language model, learns to predict the next word in a sentence. Need knowledge of English and the world
- Ngrams vs Deep Neural Net
- Much info if you train nets to predict the next word => you learn roughly how to speak English
- Wikipedia dataset + built a language model from all of Wikipeda, a billion tokens
- Ideally it would learn something like this sentence
- Start by training Wikipedia and make it available to everybody for transfer learning
- Then take that and do transfer learning and get a model that is good at predicting movie reviews
- Then it will understand that my favourite actor is Tom ___, movie names and if directors are bad
- No labels, self classified model…(check name)
- q: Does it work for text and forums, informal english, slang and domain abbreviations => yes you can finetune it with slang etc
- Language models can be powerful, blogpost from Swiftkey predictive keyboard
- Andrey Karpathy example, text in LaTex documents to make automated papers
- IMDB notebook - looking at the Databunch
- What happens behind the scenes, looking at the labels, tokens
- Numericalization, Vocab, replace words with vocab id
- Restrict vocab to a reasonable size => results in unknown tokens
- Datablock api
- Another approach, what is your list type…
- Looking at the whole IMDB dataset
- Start with the language model
- 10 % split, we only need to put the labels aside, you can use independent variables. => The texts in test set can be used for training
- Labeling
- Learner -> language_model_learner -> RNN
- Dropout below zero to avoid underfitting
- fit
- Layers
- Create a classifier (much good info here)
- Examples of SOTA
- Tabular data
- q: where does the learning rate come from, magic number 2.6 (to the fourth) -> discriminitave learning rates
- Tabular data
- List of Tabular analysis use cases
- Pinterest more accurate, less maintenance
- Until now, no easy way to train nets => now it is easy with fastai.tabular
- Example how easy it is (taken from documentation)
- q: what are the 10% of cases where you would not default to neural nets => try both random forest and neural nets and use whatever works
- Looking at the Tabular example notebook
- q: how to combine tokenized data with metadata => not up to that yet => same as having cat and cont variables => text to rnn, image to cnn, connect them later for end to end
- q: scikit and gradient boost learn outdated in future? it’s hard to predict, we’ll see
- 4 questions
- layers[200,100]
- Collaborative filtering, who bought what, who liked what… ex 2 col data
- q: does fastai tabular work with big csv data => pandas chunks or dask format, or spark
- collaborative filtering, movielens dataset
- cold start problem
- q:text with emojis or languages with other characters
- fastai has model zoo for more languages and domain areas
- q: time series on tabular data? RNN? => next week, but not rrn but exctract extra columns, weekend hpliday, was store open etc…
- q: source for cold start problem
- break and halfway point of the course, next will be digging deeper and theory and code…
Break
- back from break, excel to explore data
- tip: ctrl + arrow key goes to the end of the block
- model building in excel using data that is a table of most active users and most watched movies
- matrix multiplication in excel starting from two random number matrices, and 5 random numbers per user and movie… dot product for every possible thing in the table
- matrix multiplication in excel starting from two random number matrices, and 5 random numbers per user and movie… dot product for every possible thing in the table
- loss function in excel using SUMXMY2
- gradient descent in excel
- install add-ins (solver)
- data->solver
- get_collab_learner
- (might be edited away, font size change etc…)
- live stream offline
stream swap
- back, collaborative filtering notebook
- the function that was called: get_collab_learner => dig into source code => know your editor
- EmbeddingDotBias model
- PyTorch models, nn.Model
- forward and init, pytorch takes care of your gradient
- An embedding is a matrix of weights
- add a single number, called bias terms to account for movies that are popular and users that watch many or few movies => very simple, well woring linear model
- tweak at the end: min & max score
- sigmoid asympthotes to 5 and zero in this case, not necessary but makes it easier for our model, can spend more of its weights to predict what we care about, makes it easier to learn the right thing
- This wolds most boring nerual net turns out to give close to SOTA performance, adding sigmoid makes a big diff
- q: VIM setup? folded + jumping (tags in vim), vscode also good. Don’t just look on the code on Github … local:vscode, terminal: vim or emacs
- closing with: how are we going to build on top of collaborite filtering for next lessons… concepts we need to learn about
- what happens when you are using a neural net to do image recognition… looking at a single pixel, 0-255 (actually with mean of 0 and standard dev of 1)
- how much linear algebra do i need to know: matrix products, what do they do? how do the dimensions match up
- activation function such as relu ( => same size vector)
- second last weight matrix size (-10 sec)
- terminology: parameters in pytorch, or weights (less accurate, because there can be biases as well)
- activations are results of calculations (-10sec)
- all that does calculations are called layers and results in a set of activations
- special layer at start: input layers, at the end are a set of activations called outputs, outputs of a net are actually the activations of a layer
- in colab filter we added an extra activation function at the end, scaled sigmoid between 0 and 5. it is often at the end an activation function
- inputs, weights, activations, activation fun (-20 sec)
- loss function
- next week: loss function (cross entropy) activation fn (softmax), plus tinetuning > unfreexe trans
- q: how can we apply language models for other stuff
- q: how many layers to put into each lauyers
- coming back to that
- thanks
- end of lecture