Another treat! Early access to Intro To Machine Learning videos

patrickluvsoj · August 28, 2018, 8:26pm

Hoping to get advice/guidance on show to handle large files so that I can run random forest.

The data is 7GB and its from a Kaggle comp called TalkingData AdTracking Fraud Detection Challenge I was able to load the data by specifying the data type in a dictionary and passing that to read_csv() but as soon as I started trying to process the data, I started hitting memory errors. Specifically, I tried running add_datepart() and to_feather() For additional context, I am using Gradient on Paperspace with a GPU which has 30GB RAM & 8 cores. Given, this I was wondering what’s the best way to process large files and run Random Forests.

From what I searched on other forums threads, it seems like they are splitting files but I was hoping someone encountered a specific example that they can share here. Thank you!

Update!! - Found the following post which gave me the answers. Not sure why I didn’t find it earlier: Most effective ways to merge “big data” on a single machine