Handling Memory: Reduce features, split test sets or user err

During last night’s class, @jeremy was discussing the curse of high dimensionality. So we can concentrate on the code, TLDR; high dimensionality is having so many features that it makes predictions impossible. To resolve this, you drop features that are less important. The curse is solved because our code can decide if it’s essential or not.

However, the inconvenience seems to be that having many features can equate to holding much more data in our precious memory (I have 2x 1080s with 8gb in my rig). It seems my memory is already eaten up trying to play with this Kaggle competition with the training set.

  1. Does it make sense to run a test set and determine which features to drop?
  2. Is it better to keep features and split the test sets?
  3. Am I mishandling the memory in PyTorch?
3 Likes

As @jeremy and @yinterian mentioned in the Lecture 1, reducing the mini batch size should help when you’ve more data than your GPU memory. It was also demo-ed in the lecture, if you missed it somehow.

Mini Batch size discussion during the lecture

(a search for ‘batch’ on the In class discussion thread should land you with these conversations.)

2 Likes

It is also useful running a PCA (Principal Component Analysis) in order to reduce the number of features, though sacrificing some information and it is not in the Deep Learning trend of “learn your features”.

See the first Machine Learning lecture: I don’t believe in the “curse of dimensionality” at all. I don’t suggest (almost) ever dropping features, unless you’ve found after using them in the model that they didn’t help. We’ll look at different ways to regularize a model.

3 Likes

here is the code you use if you want to lower of bach size (bs)

data = ImageClassifierData.from_paths(PATH, tfms=tfms_from_model(resnet34, sz), bs=32)
learn = ConvLearner.pretrained(resnet34, data, precompute=True)
learn.fit(0.01, 1)

2 Likes

BTW @yinterian when posting code, try doing this:

```python
data = ImageClassifierData.from_paths(PATH, tfms=tfms_from_model(resnet34, sz), bs=32)
learn = ConvLearner.pretrained(resnet34, data, precompute=True)
learn.fit(0.01, 1)
 ```

That renders like this:

data = ImageClassifierData.from_paths(PATH, tfms=tfms_from_model(resnet34, sz), bs=32)
learn = ConvLearner.pretrained(resnet34, data, precompute=True)
learn.fit(0.01, 1)
1 Like

I wouldn’t suggest doing this - PCA is a linear technique, so using it to feed a non-linear model (i.e. a neural net) is likely to be throwing away valuable information.

1 Like

My fault for not explaining better. The problem is that I don’t get far enough along in the code for batch size. The Panda will error out because the CSV is so large and the kernel will die.

There are 125,497,041 lines in the one train.csv alone
29 AM

That is even before I start adding features, like the other items or the date split, the kernel dies. Perhaps this is a problem with my system’s RAM size (32gb) not the gpu. If I split the data to about a million, everything appears to be okay. The other datasets load fine since they are well under 125MM.

There is chunksize param in pandas which helps you read in huge files in chunk. This, though will read in data serially. Question is how to create stochastic batches. You might have to do some preprocessing of the csv outside your notebook to randomize the data.

filename = '/tmp/test.csv'
chunksize = 100
for chunk in pd.read_csv(filename, chunksize=chunk): 
    do_something_with(chunk) # chunk is a pandas DataFrame with 'chunksize' records

2 Likes

OK, thank you!

Since we are working with images you most of the time have a folder with the training images and you don’t load all of them at the same time. The “DataLoader” (http://pytorch.org/docs/master/data.html) is helping with the process of creating batches. The DataLoader needs a dataset. Datasets are defined here https://github.com/fastai/fastai/blob/master/fastai/dataset.py.

2 Likes

Check the Kaggle kernels for that comp - it shows how to read that file in <4GB RAM.

This kernel helps to load the dataset within memory.