Purpose of RNNs and Theano


(chris) #1

I have to confess that I’m struggling to see the killer application for Recurrent Neural Networks. Can anyone suggest some additional reading I should do? So far the applications I’ve seen remind me of the Markov chain parody generators like Chomskybot. I was an early adopter of Swiftkey, so I get that, but honestly I’d rather speak than use a keyboard.

For the second part of the lesson, you took me inside the sausage factory (Keras) to see how the sausage is made. I’m not sure I want to know how the sausage is made! I was perfectly happy eating sausages. Before we go any further into the sausage factory, why are we here?


#2

Yeah, I was hoping to apply RNN’s to a Kaggle competition just as we did with CNN’s. Or, using RNN’s to understand and generate speech would be cool.


(Jeremy Howard) #3

Good questions. In general, you (probably) need an RNN any time you could benefit from long term dependencies, and/or want to handle sequences that very a lot in length. Currently, RNNs are the state of the art in speech recognition and language translation. Both of these tasks require generating a sequence, which is something I haven’t seen done effectively with CNNs.

For those that like sausages, you may like to try other flavors. However, at this stage, few people are making good sausages, so in many situations the flavor you want isn’t available - so for now, at least, you’ll often find you need to know something about how to make them, to get the kind you need.

Also, it helps to understand how they’re made, in order to make your own, better, sausages (which you’ll often want to do, whilst the existing menu is so sparse and of sub-optimal quality!)

PS: I changed the thread title so I can start the lesson discussion with links to the resources - hope that’s OK…


Lesson 6 discussion
(learner) #4

I invested some time on Karpathy’s minimal RNN python code, because understanding the implementation helps me to understand when people discuss deep learning architectures assuming a detailed understanding of the inner workings.
After some refactoring I managed to get it working on nietzsche, but the results are poor, after 100K steps it can only write 5-letter words in proper english, anything longer than 5 letters is not a word. This is not because of a bug, but simply because the purpose of the original code example is to teach, not to perform.
I would appreciate some guidance on what this to try to improve performance (batches, stacking up layers, LSTM, normalization?)

here is the gist link (If github fails to open the gist, the gisto app opens it fine.)


(Jeremy Howard) #5

I’d suggest trying the trick I mentioned in the lesson for simple RNNs: using an identity matrix to initialize your hidden state, and use relu instead of tanh.

Also, try scaling your random weights in your other matrices using glorot initialization.

What input sequence size are you using?

PS: Great project to work on! We’ll be building something similar from scratch next week.


(learner) #6

ok, I implemented ReLU instead on tanh, and added the identity and glorot initialization as suggested. I also implemented gradient checking and was able to verify that the gradients are correct.

After implementing ReLU things turned worse, after 20 steps, the network grows over-confident, the softmax predict the next char with p=0.999 and most other probabilities are <1e-100, -log-probability of those shots up to infinite, the loss shots up to inf and after a while the loss turns to nan, and that gets backpropagated everywhere.

sound like I need some sort of regularization, which kind would be appropriate in this case, L2?


(learner) #7

I figure a few more things:
the recurrent multiplication by state-to-state transform (W_hh) makes the state vector grow always in the same direction until it its elements get very big.

That the purpose of the tanh is to keep the components of the hidden state vector from growing beyond [-1, 1].

Now that I replaced the tanh with a ReLU the values in the hidden stat vector are growing unbounded and eventually reach values above +1000, when those values hit the softmax and get exponentiated the result is floating point overflow, which gets backpropagated and eventually takes the whole computation into the ‘nan’ realm.

I read a paper where they don’t recommend to apply L2 nor dropout to the state-to-state transform because it erases the memory of the RNN, so I guess we need the tanh or a sigmoid there in the state-to-state transform to hold things together.

I am using these hyperparameters
hidden_size = 100 # size of hidden layer of neurons
seq_length = 20 # number of steps to unroll the RNN for
learning_rate = 1e-5

I came a cross a paper where they apply batch normalization to the state-to-state transform (https://arxiv.org/pdf/1603.09025v4.pdf) so I am looking for the formulas to implement it, specially the formula of the analytic grading of teh batch normalization step to implement it in the backpropagation algorithm.


(Jeremy Howard) #8

Try this paper for understanding how to get the relu/identity trick working: https://arxiv.org/abs/1504.00941


(learner) #9

thanks!, I’ll try reproducing their results on the NLP task: IRNN (1 layer, 1024 units + linear projection with 512 units before softmax)
There are some differences in the dataset though.
I wounder if I should use character embeddings instead of the one-hot encoding of the char index in the vocabulary (as I am currently using)?
The main difference that I see at first read of the paper is that they set the bias vectors to zero, and they use more hidden units, but that may be because their vocabulary is larger.


(anamariapopescug) #10

hi @chris - in addition to the NLP applications Jeremy mentioned, RNNs are useful for any task which involves tagging parts of a sentence (e.g. labeling opinions or opinion targets in reviews, labeling names of biomedical entities in medical text, etc.). But from my reading and talking to people, you want to use LSTMs if you have enough data - if you don’t, in the words of Richard Socher “get more data!” :slight_smile:


(chris) #11

@anamariapopescug yes having spoken with one of my doctors and listened to a radiologist talk about brain MRI I can see how a sentiment analysis could be useful. The radiologists use a language that even the clinicians don’t understand and they’re also very cagey, i.e. they tend not to say what they think unless there’s zero chance of being sued.


(Jeremy Howard) #12

That’s likely to be a good idea.


(Xinxin) #13

Just saw a cool application of LSTM in disease progression pattern, if you are working with sequential data (i.e. long-term clinical research data ), RNN is super useful.


(Jeremy Howard) #14

@xinxin.li.seattle got a link? Sounds interesting!


(Xinxin) #15

@jeremy Subtyping Parkinson’s Disease with Deep Learning Models, they are the winner for 2016 PPMI Data Challenge Winner. I learned about their work briefly in today’s webinar.

They are working with MJF to make the code available, but it’s going to be a process. I will contact the authors about their publications because I am very interested in their work too.

You can get the data in this research with a fairly easy review process.


(Alejandro) #16

I recall from Lesson 6 you were talking about how using embeddings in char rnn was better than using one hot encoding. I wonder how much better it really was for you?


(jerry liu) #17

I recently deployed LSTM models for prediction and anomaly detection to time series data from IoT sensors.

From my experience so far, prediction is straight forward but anomaly detection with autoencoder, variational autoencoder needs a bit more work.

For prediction, LSTM models are probably comparable to statistical ML methods such as ARIMA (conjecture based on structure of data, but I don’t know for sure – it looks complicated!). I think it could be interesting to see if models generated from one user’s device can be used to bootstrap models for another device.

In real practice, the process of labeling states or anomalies is very difficult and expensive! I’m still experimenting with different unsupervised and semi-supervised methods for time series data.


(Darshan Bagul) #18

Hello all,

Can you please help me understand the derivation for equation of derivative of cost function with respect to input weight (dJ/dWi) in RNN. Thanks!