Lesson 6 discussion

(Chase ) #41


I’m struggling to wrap my head around the stateful LSTM. I’ve been running some experiments with time series data (think sine wave) and this is what I’m seeing.

Experiment A: Input data is (N_observations, 1 time step, M_features)
Experiment B: Input data is (N_observations, 8 time steps, M_features)
stateful=True, shuffle=False

The results show Model B out-performing model A by a comfortable margin, but if the hidden state is stateful, why does it matter how many time steps I show the model per observation? In Experiment A, shouldn’t the most recent weights of the hidden state capture the information from the previous 8 (or what it has learned to be important) time steps?

(Jeremy Howard) #42

The stateful RNN keeps the hidden state across batches, but doesn’t backprop between batches (truncated BPTT).

(David Kagan) #43

Thanks for that! I believe @jeremy forgot to add np.squeeze".
model.fit(np.squeeze(np.stack(xs,1)), y, batch_size=64, nb_epoch=8)
This removes the extra unnecessary dimension.

The result will be the same as with using np.concatenate.
model.fit(np.concatenate(xs,axis=1), y, batch_size=64, nb_epoch=8)

(Niyas Mohammed) #44

You have 2 separate issues:

  1. Your Notebook freezes when you run training
  2. Your model is always predicting a space, as described by @genkiro

The first issue is addressed here:

I run into this on some machines and what I do is just wait for a while (a little over whatever the ETA says when it froze) and then open the same notebook in a new tab. Then instead of running the same cell, I run the next cell. Turns out that its not the kernel, but the webpage that dies. The kernel keeps training in the background.

Regarding the second issue, no one has given us a reasonable explanation here yet, but clearly some people (including me) are running into this. I see that this is just for the 3-char model. My subsequent models (Section: Our first RNN!!) are predicting well like Jeremy’s model.

I replicated the issue with the complete collection of Sherlock Holmes here: https://github.com/niazangels/courses/blob/lesson-6/deeplearning1/my-nbs/lesson-6.ipynb

My wild guess is that this is because our 3-char model does not have context. Our text does contains a lot of spaces, so perhaps it thinks that predicting a space is generally a good idea. I could be completely off, though.


that sounds resonable, thanks for your solution, perhaps running RNN model would often come up with the first issue.

(Niyas Mohammed) #47

I did some more digging today… I tried printing the top 10 predictions instead of just one. It turns out the model is overconfident in predicting a space, and space is almost always the top prediction. The order of the next best predictions seem to change depending on the input seeded.

I also tried to predict a running string of 500 characters based on a sliding window of 3 characters. This also gave me 500 spaces. Here is my notebook

@jeremy @rachel could you give us any pointers? I think we spent well over half an hour on this problem :slight_smile:

(Jeremy Howard) #48

It looks like the model hasn’t really learnt anything other than the relative frequency of chars. Try different optimizers, architectures, etc to see if you can get it to learn something about the ordering of chars as well. And check your input data to ensure that it is in the format you’re expecting.


Hello everyone,

I finished the course but I didn’t have much time for experiments. Now I’m on them, basically making sure that I understand the whole process of CNNs for different datasets, and also Collaborative Filtering and RNN.

Now I’m with RNNs, and I tried to replicate the text generation example with my own text/dataset. Whenever I input a sequence, the next char is predicted fine, as I would expect. With this in mind, I exteneded the function to predict the next char in order to generate text (so I used the function in the course but removing the first char in the seed string and appending the new predicted char at the end of the same seed string throughout a 1000 steps iteration, to generate 1000 new chars of text).

The problem I have is that the generated text tends to loop infinitely whenever it gets stuck in a sequence of chars (which is actually very soon after the seed ends). For instance, if I give the model the seed “My dog was”, the model is very likely to generate a text like this: “My dog was barking at the house was barking at the house was barking at the house…” untill it reaches the end of the loop.

So I thought it could be a problem of sequence training length, as it used only the 8 previous chars to predict the 9th, and it could seem a short memory for this kind of applications. However, I tried with chars of length 20 and 50, getting the same problem as described above.

Therefore, I’m starting to think I’m missing something, but I can’t figure it out since I’m not an expert at all… I was tempted to try other codes in the internet to do this, but I want to understand why this is happening and what I am missing.

So I would really appreciate if some of you could share experiences with similar problems or provide reasons or even solutions for my problem. Thanks in advance!!

(Niyas Mohammed) #50

@justinho @genkiro @kelin-christi @idano @ozym4nd145

Thanks, Jeremy- this worked! I changed the activation of the dense_hidden layers to relu. The model also converged faster. The loss at the end of first 4 epochs was 2.8690.

Just to make sure, I printed out the top 10 predictions for the seed:

Full section here: https://gist.github.com/niazangels/372248eb0a5aa8163bffe200f76f67a5


great job!

(Teemu Kurppa) #52

I bumped to the same problem with 3 char model predicting always space ' ', both when I rewrote the notebook for Python 3, and when running the original notebook by @jeremy in the AWS instance.

Turns out it is this line that messes up the learning:


Skip it, and just use default learning rate, and even after just 4 epochs, the model is able to predict

'phi' -> 'l'
' th' -> 't'
' an' -> 'd'

(Corbin Albert) #53

I am sure that you have figured this out within the last 6 months :smile:, but for the sake of anyone else wondering this same thing, Jeremy was switching between 2 different notebooks-- char-rnn and char-dnn. All of the code in char-dnn can be found in the Lesson 6 Jupyter Notebook.

(Corbin Albert) #54

When I was looking at the notebook, the jump to returning sequences was causing a bit of confusion. As teaching and explaining is a sure way to learning, I thought I would post an explanation here by going step by step of what the model is doing.

So We’ve got two places where information is going to be entering in from. The bottom of Jeremy’s graph needs to be initialized as zeros. This array of zeros has 75,110 rows of 42 columns. The number of rows correspond to the amount of full sets of 8 predictions we will be making, so it has the same amount as any of the other 8. If this doesn’t make sense, hopefully it will soon.

But, let’s just go on a character by character journey. So first Let’s take one single row of the zeros and pass it through a dense_in layer. You will remember that this will use glorot initialization and activate with relu, outputting a vector with 256 hidden nodes. Because we initialized with zero, these are still all going to be zero, go figure. So we’ve essentially changed a vector of 42 zeros into a vector of 256 zeros. Nice? Nice.

So now we’ve got some sort of hidden state of zeros. Entirely separate from this action, we pass our first character into an Input layer which then goes into an embedding layer where this character is now represented by 42 weights. These weights are initialized uniformly.

In this case, because we are using a mini-batch size of one, and Embedding layers return a 3D tensor of size (batch_size, sequence_length, output_dim), it will be of size (1, 1, 42). This embedding layer is then flattened, which has no real consequence other than to reduce the needless dimensions that were output and allow us to pass it to a dense_in layer. Again, this will take the 42 uniformly initialized latent factors and matrix multiply by glorot initialized weights into a 256 node vector.

Now–remember that zeros hidden state that we initialized? Just for shits and giggles, we’re going to pass that through another dense layer, this time a dense_hidden, which will use an identity matrix for it’s initialized weights and output the same vector of 256 zeros. Why? Because it’s in the code. And Hinton says so.

This new vector of 256 zeros (which is the same as the old vector of 256 zeros) will then be merged with the aforementioned character vector of 256 values (recall: created by multiplying the 42 uniformly initialized embedding weights with the glorot initialized weights from the dense-in layer before going through relu) by adding the two vectors together. I think it is helpful to think of this as a completed hidden layer for the first character. This layer will be the layer which continually gains “state” and will apply to all subsequent characters.

So, this hidden state is going to be preserved for the next character, but before we get to that, we need to estimate the first output, which is guessing the second character off of the context of the first. To achieve this, we pass it through a dense_out layer, with glorot initialized weights, and use a softmax function to estimate the next character. This Softmax dense_out layer ONLY BELONGS TO THE SECOND CHARACTER GUESS. All associated weights will only affect the first guess.

It can then back propagate to try to reduce it’s error. So now these initialized weights will have been changed a bit in, hopefully, the right direction. The backprop that will have occurred for the first characters’ layers will not have had any affect on the subsequent characters’ layers, except for some of the context that is building up in the recurrent hidden layer.

Alright, now character two is going to go through its own embedding layer, again with its own initialization, which is then passed to its own glorot initialized, relu activated dense_in layer just like char one was.

The PREVIOUSLY MERGED hidden layer from the first character with zeros will go through its own dense_hidden layer with relu activation and identity initialization. It then merge again with the output of the second character’s embedding having gone through it’s dense_in.

A second output layer, unique to predicting the 3rd character given the first two, with glorot initialized weights, will use softmax again to predict the 3rd character. Backprop ensues.

The third character comes in, goes through dense_in, the previously merged hidden goes through another hidden dense layer with identity initialization. Why? Who knows. These two merge again. Another unique output layer predicts the 4th character, and backpropagates.

So on, so forth.

Now, when the characters go through again, they will go through the exact same process, but have had lots of slightly tuned parameters from each pass through and continue learning in this way. Bear in mind that the hidden state will be reinitialized with zeros. So the state that has built up from the characters being input will be lost, but the weights in the embeddings and other layers will have moved.

And that, I sure as hell hope, is how the “returning sequences” section works.

(Corbin Albert) #55

So I am trying to build the “Theano Only” RNN, but it isn’t learning–the error stays pretty much the same the entire time. Hoping for a bit of assistance.


Hi @yashkatariya, @rteja1113 - I faced similar issue as mentioned by Yashkatariya. I tried to use the code snippet you have mentioned, and that helped me fix it. But I still face a different issue, when I try to use the MixIterator object, in a model fit, I get the below error.

new_model.fit_generator(mi, mi.N, epochs=8, validation_data=(val, val_labels))

ValueError: output of generator should be a tuple (x, y, sample_weight) or (x, y). Found: None

Can you pls let me know if you faced any similar issue? Thanks and appreciate your help.

(Yash Katariya) #57

The data the generator is getting is NULL. Use Lambda checkpoint callbacks to debug.

(Ravi Teja Gutta) #58

Hi @MLNewbie, can you share some of your code ?


Hi @rteja1113 - Here is the code snippet and error information. Pls let me know if you need more details.

Code: (Just trying to mix val data into training data, since am having low validation accuracy)
mi_trn_batches = gen.flow(trn, trn_labels, batch_size=batch_size0.75)
mi_val_batches = gen.flow(val, val_labels, batch_size=batch_size
mi = MixIterator([mi_trn_batches, mi_val_batches])
new_model.fit_generator(mi, mi.N, epochs=8, validation_data=(val, val_labels))


TypeError: ‘MixIterator’ object is not an iterator
ValueError: output of generator should be a tuple (x, y, sample_weight) or (x, y). Found: None

Can you pls guide me on how to give MixIterator object as input to fit_generator?

Pls let me know if you need more details. Appreciate your help…

(Ravi Teja Gutta) #60

Hi @MLNewbie, can you show code for MixIterator ?


Hi @rteja1113 , I used the code that you had shared to yashkatariya. There was error showing for ‘.N’ as not available and one of the other threads in the forum stated to change to ‘.n’. Apart, it’s same code you shared.