Does the bi-directional RNN concat 2 RNNs or does it “stack” them on top of eachother?
Why can’t we have returnSequences=True for the second bidirectional LSTM?
Are we doing this because it’s not feasible to pass the entire stack of embeddings into a CNN?
I wouldn’t think a 16x120 dimensional input vector would be too large.
I guess this stems from an underlying question I have regarding why we don’t just treat text problems the same way we do images. Images have relationships between pixels and shapes that are complex and rely on positional information. Why doesn’t that work with word or phoneme embeddings?
Can you pls repeat why do we need bidirectional LSTM?
@janardhanp22 bidirectional is because you want to know the phonemes that came both before and after a given phoneme
When return sequences = True, does the output have to be exactly 1 time step ahead? Can the RNN do n steps ahead instead?
Also what happens exactly when you have two RNNs stacked on top of eachother with return sequences = True? Do you get nested sequences?
For translation problems, how does the network know when to stop?
Say my input is
PAD PAD hi how are you
is the output (in another language)
PAD PAD hi how are you PAD
Does the neural network have to learn when to stop?
@harveyslash yes, typically an end-of-sentence token EOS is added in the training set
so its possible for a badly trained network to just continue the output and fill out the entire max len output?
Is it possible within this dataset to separate out proper nouns from non-proper nouns, and check accuracy on each group? Since it seems like proper nouns (like McConville, or Missoula) may have more non-standard spellings if they are basically relics of former spellings of things.
Q: Do you think trying beam search instead of replicating the last layer might result in better results.
@harveyslash yes, that could happen
@cody that’s a reasonable hypothesis. you could probably use a dictionary dataset to label nouns as improper/proper in the original training set
Won’t the weightings be heavily impacted by the padding done to the input set?
is “a” shared among all i-j pairs ? or do we train a separate alignment for each pair ?
That was our friends from Kaiser - thanks for the offer; I’ll definitely take you up on that as more progress is made.
Oops - I forgot to discuss this! Will do next week.
We’ll discuss beam search next week.
Note: the complete collection of Part 2 video timelines is available in a single thread for keyword search.
Part 2: complete collection of video timelines
Here’s the Lesson 12 video timeline, probably the most theoretical lesson so far
Lesson 12 video timeline:
00:00:05 K-means clustering in TensorFlow
00:06:00 ‘find_initial_centroids’, a simple heuristic
0012:30 A trick to make TensorFlow feel more like Pytorch
& other tips around Broacasting, GPU tensors and co.
00:24:30 Student’s question about “figuring out the number of clusters”
00:26:00 “Step 1 was to copy our initial_centroids and copy them into our GPU”,
"Step 2 is to assign every point and assign them to a cluster "
00:29:30 ‘Dynamic_partition’, one of the crazy GPU functions in TensorFlow
00:37:45 Digress: “Jeremy, if you were to start a company today, what would it be ?”
00:40:00 Intro to next step: NLP and translation deep-dive, with CMU pronouncing dictionary
00:55:15 Create spelling_bee_RNN model with Keras
01:17:30 Question: "Why not treat text problems the same way we do with images’ ? "
01:26:00 Graph for Attentional Model on Neural Translation
01:32:00 Attention Models (cont.)
01:37:20 Neural Machine Translation (research paper)
01:44:00 Grammar as a Foreign Language (research paper)