GRUs acting forgetful?

Hi,

I’m trying to train Baidu’s speech recognition model, Deepspeech. Basically I made five GRU layers on top of a 1D Convolution layer of size five and filters of 1000. I log word error rate in each mini-batch and realized word error rate is much lower just after training mini-batch containing that sample. Look at this:

Truth: “his elevation above everyone the identity of his sober interests with those of the state at large is calculated to make him the people’s natural representative his word has therefore a genuine authority and his ascendency not being invidious is able to secure internal peace even when not enlightened enough to insure prosperity or to avoid foreign wars”

After mini-batch prediction: “his alvation above every one the adentity of his soler interest with those of the state at large is cowculated to make him the puople’s notural representative his word as there for agenuane othority and his assenasy nop being invideous is able to secure interval peace even when notinlikeingd anough to insure prosperity or to avoid forgn wars” WER: 0.08

Before mini-batch prediction: “his elmation above every one the adentitvy of his soler anterest withous of the stay at large is cowlklated to make him theppeople’s natural representative has word has their for a genwone of thority and his assenzi not being imvidius is able to secure in trul peace even when not an liked tha nough to in sure prosperity or to afvord forn wors” WER: 0.13

At the end of epoch: “his elevation above every one fea denidty of hi so re interest withthose of the stayi t large is cowculatid to make him bhap people’s natural representative has worn as therfore jenwon fority and his assaysy not being imbillious is able to secure internal pece even when not in lined than ough to hind sur prossperetly or two avoid form mors” WER: 0.15

At the end of next epoch:“his elometion above every one be a dentinty of his solp ery interest with those of the stay ot large is cowculated to make him thet people’s natural represenantive has word as there fore a genuwine forty and his ascency nut being invillious is able to seccure in trol peace even when not in lined than nough to himsure prosperity or to a void form mwores” WER: 0.14

Last three predictions were made after loading weights. Notice that prediction made after mini-batch training is much better than before and after that batch, I guess GRUs (each GRU has 1000 units) are forgetting, but I’m rather new to RNNs and looking for help to find out what’s going on :wink:

For more information, I’ve sorted first epoch in reverse of samples durations. Baidu suggested to sort by durations but my GPU RAM will overflow If I do so; I believe by searching in reverse I still have advantages of putting same size samples in same mini-batch. They sort just first epoch and I do same.

Thanks