I was reimplementing the BABI MEMNN as described in the lectures, and noticed that the line
substory = [[str(i)+":"] + x for i,x in enumerate(story) if x]
parse_stories function is super-critical to performance. This is obvious in retrospect – the way that the the network attends to sentences is independent of time, wheras the correct answers to the questions are not. Adding the extra token lets sentences get encoded differently based on position. Interestingly, the Keras
babi_memnn.py examples does not have the location token, yet still (apparently) reaches 100% accuracy (after lots of epochs).
I thought it’d be interesting to re-implement the memory network s.t. we didn’t need the location tokens, which seem sortof artificial to me. Code is below – basically I’m just running the substory vectors through an LSTM to give the network some concept of order, and then attending over the hidden states of the LSTM. This network gets to ~100% accuracy in ~4 epochs – a little slower than Jeremy’s implementation, but maybe would be more robust to other ways to frame the problem.
# X_train -> vectorized stories # Q_train -> vectorized questions # y_train -> (sparsely) vectorized answers # -- # Computing attention inp_q = Input(shape=Q_train.shape[1:]) emb_q = Embedding(input_dim=vocab_size, output_dim=emb_dim)(inp_q) emb_q = LSTM(emb_dim)(emb_q) emb_q = Reshape((1, emb_dim))(emb_q) inp_x = Input(shape=X_train.shape[1:]) emb_x1 = TimeDistributed(Embedding(input_dim=vocab_size, output_dim=emb_dim))(inp_x) emb_x1 = Lambda(lambda x: K.sum(x, axis=2), output_shape=(max_n_sentences, emb_dim))(emb_x1) emb_x1 = LSTM(emb_dim, return_sequences=True)(emb_x1) att_mask = dot([emb_x1, emb_q], axes=2) att_mask = Reshape((max_n_sentences,))(att_mask) att_mask = Activation('softmax')(att_mask) att_mask = Reshape((max_n_sentences, 1))(att_mask) # -- # Applying attention mask emb_x2 = TimeDistributed(Embedding(input_dim=vocab_size, output_dim=emb_dim))(inp_x) emb_x2 = Lambda(lambda x: K.sum(x, axis=2), output_shape=(max_n_sentences, emb_dim))(emb_x2) emb_x2 = LSTM(emb_dim, return_sequences=True)(emb_x2) att_emb = dot([att_mask, emb_x2], axes=1) att_emb = Reshape((emb_dim, ))(att_emb) out = Dense(vocab_size, activation='softmax')(att_emb) model = Model(inputs=[inp_x, inp_q], outputs=out) model.compile(loss='sparse_categorical_crossentropy', optimizer='rmsprop', metrics=['acc'])
There are obviously lots of ways to improve the base memory network – but thought this was one that addresses a sortof major pitfall.
Another pitfall would be that the question is only used to compute the attention mask – so if there were too candidate answers in the same substory, there’d be no way to tell them apart. Probably we should be using the question encoding again after the attention mask is applied, eg maybe right before the
out layer in the above network.