Deep Learning on Kaggle tabular data

(Reijer) #1


I am unable to predict on the whole test set within a reasonable amount of time on Kaggle.

The project I am currently working is the PUBG dataset on Kaggle. My goal is to just train for a single cycle, put the outputs in a csv and commit the output to the competition. Whenever I try to create a submission CSV, it will run for a few hours and then the kernel will crash. This is not entirely unreasonable to me since there are 1.9 million matches to predict and the Kaggle kernel does not seem very fast. However, using that Kernel is the only way to submit. Is there something that I should be doing differently in order to get my predictions in a reasonable amount of time (less than 3-4 hours)?

Additionally, I am only training using part of the data since otherwise I run out of RAM.

My Kaggle notebook is on github because for some reason it does not commit properly on Kaggle.

I realize this might be a question for Kaggle, let me know if I need to move it to the forum there.



(Zachary Mueller) #2

Try to make your entire prediction dataset into a databunch object or a data loader. That way you can call get_preds() and it will work a lot faster. Look into the source code for it and you should see it work better.

1 Like

(Pavel) #3

Hello. I haven’t been working on kaggle engine. Does it use GPU?
If so you can dramatically increase you batchsize in databunch(bs=32)) especially as long as you use such a small model layers=[200,100]
When I worked on a desktop with my 1070 GPU (8Gb RAM) I’ve managed to achieve bs of 16k and more. That can dramatically increase your speed.
It’s not very clear why the kernel crashed. If you reach some kind of safety timeout, increasing batchsize can help (which is obviously not true if RAM is the problem).
Also I would suggest to look closely to your embeddings in learn.model. As cat_names = ['Id', 'groupId', 'matchId'] are supposedly unique values, your embeddings can reach maximum size (600 by default for every of these column if I remember correctly) which (embeddings), in turns could exceed the complexity of the rest of your model (layers=[200,100]), increasing RAM consumption without really helping much (I would suggest throw away unique values or turn it into continuous ones)

1 Like