Part 2 lesson 11 wiki

jeremy · April 10, 2018, 1:13am

(This is a wiki - feel free to edit.)

<<< Wiki: Lesson 10 ｜ Wiki: Lesson 12 >>>

Lesson resources

Lesson video
The Imagenet localization challenge has all the classification data too
French/English Training Data - 2.4 GB
Lesson notes from @hiromi

Lesson papers

Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation
Neural Machine Translation by Jointly Learning to Align and Translate - Original paper explaining the attention approach explained in class
DeViSE: A Deep Visual-Semantic Embedding Model
Grammar as a Foreign Language - concise summary of attention in this paper
Papers and SoTA results

Timeline (incomplete)

(0:00:00) 1 cycle policy blog
(0:03:58) Google demo of seq to seq
(0:05:40) seq to seq models - machine translation
(0:07:20) Why are we learning about seq to seq?
(0:08:40) Four big wins of neural Machine translation
(0:09:20) BiDirectional GRU with attention
(0:09:55) Introducing the problem
(0:13:15) Think about language modeling vs neural translation
(0:13:35) neural translation from seq to seq model
(0:14:22) concat pooling
(0:18:00) seq to general purpose seq
(0:18:20) Pre-requisite Lesson 6
(0:19:50) Char Loop concat model
(0:21:40) Stacking RNN on another
(0:22:46) Translation start
(0:23:25) French to English questions instead of language translation
(0:42:40) Separate training and validation set
(0:43:30) Creating Data Loaders and Sampler trick to sort the sentences into similar sized ones
(0:47:06) First encoder-decoder architecture. Uses a GRU RNN.
(0:50:28) PyTorch module has a weight attribute. The weight attribute is a variable that has a data attribute.Finally, the data attribute is a tensor
(0:54:28) Question: If we just keep all embeddings for training, why don’t we keep all words embedding in case we have new words on the test set?
(0:55:35) Using vocabulary bigger than 40 thousand words
(1:00:50) Explaining the decoder architecture
(1:11:00) Results of the first architecture
(1:13:09) PAUSE
(1:14:00) Question about regularization techniques on seq2seq models and the AWD-LSTM architecture
(1:16:40) Bidirectional LSTMs architecture
(1:21:00) Question: Why do you have to have an end to the loop
(1:22:39) Teacher forcing architecture
(1:31:03) Attentional model
(1:40:11) Second explanation of attention in an RNN
(1:55:51) Devise
(2:11:48) nmslib: Super fast library for finding nearest neighbors on high-dimensional spaces
(2:13:03) Searching wordnet noun classes on imagenet

Other resources

Helpful stuff

Stephen Merity talk - Stephen Merity’s talk on Attention and Memory in Deep Learning Networks
Precision vs Performance in approximate nearest neighbours
Benchmarking nearest neighbors
Colin Raffel’s talk about the Attention Mechanism
Blog post explaining DeViSE; model deployed on AWS to play with; resources that show how to Dockerize and deploy a PyTorch model

Additional papers

Other libraries

Tensor2Tensor - A google DL mini-library with many datasets and tutorials for various Seq2Seq tasks

Useful function to transform Pytorch nn.module Class to fastai Learner Class

rnn = Seq2SeqRNN(fr_vecd, fr_itos, dim_fr_vec, en_vecd, en_itos, dim_en_vec, nh, enlen_90) 
learn = RNN_Learner(md, SingleModel(to_gpu(rnn)), opt_fn=opt_fn)

bhollan · April 10, 2018, 1:38am

For the acronym-obsessed like myself: BiLingual Evaluation Understudy (BLEU)

yinterian · April 10, 2018, 1:44am

How long are the sequences used to train translation models? Sentences?

aza · April 10, 2018, 1:44am

There are two cool papers from Oct of last year that show how to do neural machine translation w/o parallel sentences!

hamelsmu · April 10, 2018, 1:46am

Was an attention layer tried in the language model? Do you think it would be a good idea to try to add one and see what happens?

aza · April 10, 2018, 1:56am

How would we start with pre-trained models of French and English and then fine-tune in this case (I.e., use the “Jeremey special” method)?

SHAR1 · April 10, 2018, 1:57am

Google’s neural machine translation system.

amritv · April 10, 2018, 1:58am

Here is the link to my notes from lesson 6 if anyone is interested in a refresher.

wdhorton · April 10, 2018, 1:59am

Just heard him mention that we divide num_cpus by 2 because with hyperthreading we don’t get a speedup using all the hyperthreaded cores. Is that just based on practical experience, or is there some underlying reason why we wouldn’t get additional speedup from hyperthreading?

rsrivastava · April 10, 2018, 2:04am

Why are we not starting with language model of English and French then starting the translation from scratch.

rsrivastava · April 10, 2018, 2:06am

How should we tokenize audio or video files.

divyansh · April 10, 2018, 2:07am

why he didn’t add bos token?

username_not_found · April 10, 2018, 2:08am

How would you keep the pip- installed git version of fastai up-to-date? Would you just rerun the pip install command, or would you do some git pull command?

emilmelnikov · April 10, 2018, 2:08am

Do dimensions of both language embeddings have to be equal?

danielhunter · April 10, 2018, 2:09am

Perhaps pip install [library name here] --upgrade?

nafizh · April 10, 2018, 2:12am

pytorch official examples are not as good as the Keras official examples. Because most users of pytorch are researchers, I think they did not put too much importance on the best practices there assuming people know these.

username_not_found · April 10, 2018, 2:13am

So install would be ! pip install git+https://github.com/fastai/fastai.git
and would update be: ! pip install git+https://github.com/fastai/fastai.git --upgrade

danielhunter · April 10, 2018, 2:14am

I’m not sure – I was just guessing, but that may work.

Interogativ · April 10, 2018, 2:17am

why bs=bs*1.6?