How to get best results?

Hello everyone,
I recently completed FAST.AI’s Machine Learning and Part-1 of Deep Learning Course. After that I tried my hand on two of the competitions:

(Both of these kernels are properly indexed. I would like you to give me an upvote if you like it. And I am open to any kind of review good or bad.)

  1. Digit-Recognizer (Current standing: 355 - Top 15%) - Accuracy: 99.571
    My Kernel : Training your own CNN using Pytorch
    I have made my own DataLoader, LR Annealing class (cosine with warm updates), Data Augmentation and have added a tSNE plot, have shown convolution in action.

TL;DR my kernel, here is what I want to ask:

a) The main bottle-neck in speed is data Augmentation. Actually I am doing random Augmentation on whole dataset every epoch (rotation + translation). I am using opencv for it, still I am not getting much speed. Isn’t it supposed to be faster than torchvision, PIL etc… My implementation is slower (about 10 times) than a keras implementation, that I have seen in other kernels that finishes an epoch in about 2-3 seconds with Data Augmentation, whereas mine takes about half a minute. Still keras implementation gives better result (about 97-98%). What I am missing here??
b) What can I do to improve my accuracy more? I am normalizing, using 4 convolutional layers with BatchNorm, Dropout…

  1. Costa Rican Household Poverty Level Prediction (195/619 - Top 32%) - accuracy : 42.1% (Highest: 44.8%)
    My Kernel : Learning to make diff models +DataClean +FeatEng

TL;DR my kernel, here is what I want to ask:

a) I have used RandomForest, LightGBM and NeuralNetwork with Downsampling, Upsampling and original data. RandomForest (acc: 0.420) and lgbm (0.421) gives the best result with Upsampled data. Neural Network gives very poor results with both upsampled (more worse- 0.20) and downsampled data (0.33), it doesn’t matter if I use 2 hidden layers or 5. Is there no way to make Neural Network learn on data where only few data points are available??
b) During competition I came across SMOTE. According to you what is the best method to increase data points for some class with small number of data points…
c) Which algorithms are best to be used in stacking??
d) Will stacking work for this small dataset, or for dataset where one class has small number of data points than others??

I have also written a post on medium on a topic taken from Project Ideas from course Numerical Linear Algebra taught by @rachel :
Speed Up your Algorithms Part 1 — PyTorch
If you like it, give it claps and do follow me. I will be writing more soon.

If you liked my work, I would like to connect to you on:
Twitter (I have recently created this account), Medium, Kaggle

Though I have done Andrew Ng’s course and Kirill Eremenko’s course and they both are good in their own right, but I was able to jump into Kaggle competitions because of FAST.AI only. And I also learned some of the latest and state of the art techniques such as LR Annealing, Finding Initial LR, data visualization techniques and many more.
The course on Numerical Linear Algebra is also great! And I think it is important to have some knowledge about how thing are happening under the hood, how some of the techniques like SVD, LR decomposition etc. work and their applications.
Thank you @jeremy and @rachel for making this MOOC!!

@rachel , @jeremy
I wrote one more article on Numba: