Kaggle NLP Competition - Toxic Comment Classification Challenge

sermakarevich · February 25, 2018, 3:50pm

Very simple example of words polarity analysis based on Logit Regression coefficients.:

https://www.kaggle.com/sermakarevich/words-polarity-based-on-lr-weights

hiromi · February 25, 2018, 7:30pm

Here are attempts by classmates to load up the dataset with multiple labels (towards the bottom of the thread) if you find it helpful.

treyko · February 25, 2018, 9:20pm

Thank you so much for the help. I’m going to check out that discussion!

balajib26 · February 26, 2018, 6:28pm

I am training a Bidirectional LSTM with pretrained GLOVE embedding using Crestle GPU. It is taking 1 hour to train per epoch. Is it normal ?
When I had trained CNN with pretrained GLOVE embedding it took only 1 minute per epoch.

sermakarevich · February 26, 2018, 8:04pm

CuDNNLSTM 1 epoch takes 2-3 minutes to run on GTX 1080 Ti with 300-x embeddings.

balajib26 · February 27, 2018, 5:06am

How much time did it train to train CNN ? Was it significantly less ?

sermakarevich · February 27, 2018, 5:13am

Maybe 4-6 minutes for same 1 bidirectional layers LSTM with single FC layer of size 128.

balajib26 · February 27, 2018, 5:20am

Is anything wrong with this code ?

#MAIN Create LSTM model
model=Sequential()
model.add(Embedding(vocab_size, 300, weights=[embedding_matrix],
input_length=500, trainable=False))
model.add((Bidirectional(LSTM(50,dropout=0.2,recurrent_dropout=0.2 ))))
model.add(RepeatVector(500))
model.add((Bidirectional(LSTM(50,return_sequences=True ))))
model.add(GlobalMaxPool1D())
model.add(Dense(50, activation=“relu”))
model.add(Dropout(0.1))
model.add(Dense(6, activation=“sigmoid”))

compile the model

Adam_opt = Adam(lr=0.0001, beta_1=0.9, beta_2=0.999, epsilon=1e-08, decay=0.0)
model.compile(optimizer=Adam_opt, loss=‘binary_crossentropy’, metrics=[‘acc’])

early_stopping = EarlyStopping(monitor=‘val_loss’, patience=5, mode=‘min’)
save_best = ModelCheckpoint(’/home/nbuser/toxic.hdf’, save_best_only=True,
monitor=‘val_loss’, mode=‘min’)

history = model.fit(X_train, y_train, validation_data=(X_eval, y_eval),
epochs=1, verbose=1,callbacks=[early_stopping,save_best])

Bodhi94 · February 27, 2018, 5:35am

Looks okay
I tried with 2 bi-directional LSTM layers but the model did not improve as much as I expected.

balajib26 · February 27, 2018, 5:43am

I used CuDNNLSTM instead of LSTM, that helped. Brought down the training time

sermakarevich · February 27, 2018, 6:04am

Well, it is just deeper and wider. I use input_length 100-250, you use 500. I use 1 LSTM, you use 2 LSTMs. I do not use RepeatVector which, I assume, make output of LSTM1 500 times deeper. +LSTM instead of CuDNNLSTM. This might be the difference.

balajib26 · February 27, 2018, 6:10am

Yes reducing 1 LSTM and using using CuDNNLSTM made it better.
Is 1 LSTM enough to learn a good representation ?

sermakarevich · February 27, 2018, 6:23am

Hard to say what kind of architectures guys use in this competition. Some declared they can achieve 0.987 public and 0.99+ CV with a single model. My best GRU model gets 0.9811 and 0.987 CV. This very simple GRU gets 0.983 on public leaderboard but I have no idea about its CV score.

balajib26 · February 27, 2018, 6:28am

The number of words in my sentence(maxlen)=500. How much maxlen did you keep ?

sermakarevich · February 27, 2018, 6:30am

I tried 100-500. GRU I shared in previous reply uses 100. On a forum I seen a recommendation to start with 100.

balajib26 · February 27, 2018, 6:31am

Ok. Thanks !!

devm2024 · February 28, 2018, 7:44am

Anyone planning to train embeddings for this competition?

treyko · March 1, 2018, 8:20am

I’m having trouble setting the sequential attribute for a field to False. My code includes the following.

LABEL = data.Field(sequential=False)

But when I check whether sequential is False or not…

hasattr(LABEL, 'sequential')

I get “True”. When I try to set the attribute…

LABEL.sequential = False
hasattr(LABEL, 'sequential')

I still get True!

Any help would be much appreciated!

sermakarevich · March 1, 2018, 11:09am

Sources that I found useful for this competition:

hiromi · March 1, 2018, 2:58pm

From python documentation:

hasattr(object, name)
The arguments are an object and a string. The result is True if the string is the name of one of the object’s attributes, False if not. (This is implemented by calling getattr(object, name) and seeing whether it raises an AttributeError or not.)

So it is not checking whether the variable has the value of True or False - it is simply checking whether that variable exists. Hope that helps!