(This is a wiki - feel free to edit.)
<<< Wiki: Lesson 10 | Wiki: Lesson 12 >>>
Lesson resources
- Lesson video
- The Imagenet localization challenge has all the classification data too
- French/English Training Data - 2.4 GB
- Lesson notes from @hiromi
Lesson papers
- Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation
- Neural Machine Translation by Jointly Learning to Align and Translate - Original paper explaining the attention approach explained in class
- DeViSE: A Deep Visual-Semantic Embedding Model
- Grammar as a Foreign Language - concise summary of attention in this paper
- Papers and SoTA results
Timeline (incomplete)
- (0:00:00) 1 cycle policy blog
- (0:03:58) Google demo of seq to seq
- (0:05:40) seq to seq models - machine translation
- (0:07:20) Why are we learning about seq to seq?
- (0:08:40) Four big wins of neural Machine translation
- (0:09:20) BiDirectional GRU with attention
- (0:09:55) Introducing the problem
- (0:13:15) Think about language modeling vs neural translation
- (0:13:35) neural translation from seq to seq model
- (0:14:22) concat pooling
- (0:18:00) seq to general purpose seq
- (0:18:20) Pre-requisite Lesson 6
- (0:19:50) Char Loop concat model
- (0:21:40) Stacking RNN on another
- (0:22:46) Translation start
- (0:23:25) French to English questions instead of language translation
- (0:42:40) Separate training and validation set
- (0:43:30) Creating Data Loaders and Sampler trick to sort the sentences into similar sized ones
- (0:47:06) First encoder-decoder architecture. Uses a GRU RNN.
- (0:50:28) PyTorch module has a weight attribute. The weight attribute is a variable that has a data attribute.Finally, the data attribute is a tensor
- (0:54:28) Question: If we just keep all embeddings for training, why don’t we keep all words embedding in case we have new words on the test set?
- (0:55:35) Using vocabulary bigger than 40 thousand words
- (1:00:50) Explaining the decoder architecture
- (1:11:00) Results of the first architecture
- (1:13:09) PAUSE
- (1:14:00) Question about regularization techniques on seq2seq models and the AWD-LSTM architecture
- (1:16:40) Bidirectional LSTMs architecture
- (1:21:00) Question: Why do you have to have an end to the loop
- (1:22:39) Teacher forcing architecture
- (1:31:03) Attentional model
- (1:40:11) Second explanation of attention in an RNN
- (1:55:51) Devise
- (2:11:48) nmslib: Super fast library for finding nearest neighbors on high-dimensional spaces
- (2:13:03) Searching wordnet noun classes on imagenet
Other resources
Helpful stuff
- Stephen Merity talk - Stephen Merity’s talk on Attention and Memory in Deep Learning Networks
- Precision vs Performance in approximate nearest neighbours
- Benchmarking nearest neighbors
- Colin Raffel’s talk about the Attention Mechanism
- Blog post explaining DeViSE; model deployed on AWS to play with; resources that show how to Dockerize and deploy a PyTorch model
Additional papers
- Understanding BLEU Score
- Poincaré Embeddings for Learning Hierarchical Representations
- Unsupervised Machine Translation Using Monolingual Corpora Only
- Unsupervised Neural Machine Translation
Other libraries
- Tensor2Tensor - A google DL mini-library with many datasets and tutorials for various Seq2Seq tasks
Useful function to transform Pytorch nn.module Class to fastai Learner Class
rnn = Seq2SeqRNN(fr_vecd, fr_itos, dim_fr_vec, en_vecd, en_itos, dim_en_vec, nh, enlen_90)
learn = RNN_Learner(md, SingleModel(to_gpu(rnn)), opt_fn=opt_fn)