During last night’s class, @jeremy was discussing the curse of high dimensionality. So we can concentrate on the code, TLDR; high dimensionality is having so many features that it makes predictions impossible. To resolve this, you drop features that are less important. The curse is solved because our code can decide if it’s essential or not.
However, the inconvenience seems to be that having many features can equate to holding much more data in our precious memory (I have 2x 1080s with 8gb in my rig). It seems my memory is already eaten up trying to play with this Kaggle competition with the training set.
Does it make sense to run a test set and determine which features to drop?
Is it better to keep features and split the test sets?
As @jeremy and @yinterian mentioned in the Lecture 1, reducing the mini batch size should help when you’ve more data than your GPU memory. It was also demo-ed in the lecture, if you missed it somehow.
It is also useful running a PCA (Principal Component Analysis) in order to reduce the number of features, though sacrificing some information and it is not in the Deep Learning trend of “learn your features”.
See the first Machine Learning lecture: I don’t believe in the “curse of dimensionality” at all. I don’t suggest (almost) ever dropping features, unless you’ve found after using them in the model that they didn’t help. We’ll look at different ways to regularize a model.
I wouldn’t suggest doing this - PCA is a linear technique, so using it to feed a non-linear model (i.e. a neural net) is likely to be throwing away valuable information.
My fault for not explaining better. The problem is that I don’t get far enough along in the code for batch size. The Panda will error out because the CSV is so large and the kernel will die.
There are 125,497,041 lines in the one train.csv alone
That is even before I start adding features, like the other items or the date split, the kernel dies. Perhaps this is a problem with my system’s RAM size (32gb) not the gpu. If I split the data to about a million, everything appears to be okay. The other datasets load fine since they are well under 125MM.
There is chunksize param in pandas which helps you read in huge files in chunk. This, though will read in data serially. Question is how to create stochastic batches. You might have to do some preprocessing of the csv outside your notebook to randomize the data.
filename = '/tmp/test.csv'
chunksize = 100
for chunk in pd.read_csv(filename, chunksize=chunk):
do_something_with(chunk) # chunk is a pandas DataFrame with 'chunksize' records