I cant use the whole data set as I keep on running out of memory so I used code to break the training and testing datasets into 5 chunks and then training each chunk (approx approx 35000 rows) and then saving the output predictions. I hence have 5 prediction files and was using the method above to join them all into 1 file.
Were you able to use the whole dataset without running out of memory or were you also using smaller chunks?
@amritv I was able to train with all of the data. What GPU are you using?
Do you mean that you trained on each class separately? Or you just split it up randomly?
If you split training data randomly into 5 chunks and still predicted on all of the test data (all classes) then it wonât affect the shape of your test predictions and you could still use the reference code I provided before. Personally, I donât think you should predict on only part of the test data each time (if you arenât splitting by class) unless you ran into memory issues with test data as well, but I donât think that should take too much memory to just run predictions.
If you split training and generated predictions separately by class then you could still create a new submission file from scratch without any concatenations. If you are having memory issues then I highly recommend using bcolz to save and reload your prediction arrays.
Below is an example of how that process might look.
Use the below bcolz functions for saving and loading prediction arrays.
Youâll want to save pred arrays each time after you run predictions on each chunk save_array('toxic_preds.bc', toxic_preds)
Once you have finished saving all 5 prediction arrays then reload them all back into one submission file. You could run this in a new notebook and make sure the training notebook is shutdown to save memory.
Or if you just wanted to average your test predictions and hopefully get better overall predictions you could still use bcolz for that to save the arrays separately first and then reload them back in to take the average before saving the final average predictions into one submission file.
Finally cross theshold 0.98 threshold with a single model (pytorch LSTM with fasttext embedding using fast.ai goodness to take advantage of SGDR, etcâŚ).
Iâve done minimal pre-processing and curious what folks here are doing with their comment text. In particular, are you âŚ
Removing stop words?
Using lemmatization in your tokenization?
What are you removing in your cleanup and/or tokenization process? For example, Iâm doing my best to remove URLs, usernames and replace emojis with their textual equivalent.
Also, Iâve found thus far that a bi-directional LSTM outperforms a bi-directional GRU (all else being the same). Thought that was interesting.
Hi everyone, I hope whoever participated in the competition did well on the private leaderboard!
I joined this competition a bit too late, so my first priority is about getting better in Pytorch. Did anyone write their scripts using Pytorch and willing to share it publicly? It seems most of the public kernels on Kaggle were written in Keras.
@VishnuSubramanian Are you the author of Deep Learning with PyTorch? Your book has been helpful for me to get started with Pytorch!