[Adv] Significant changes to fastai just pushed

I just made some major changes to fastai. Would love help testing them out, getting comments, etc. They should fix some problems and open up some new options. To use it, git pull and then conda env update.

Most importantly, I’ve removed opencv entirely, and replaced it with PIL. All image transforms now receive a PIL object, not a numpy array. (@yinterian I haven’t tested whether any of your coords transform stuff is impacted. Note in particular I’ve changed how resize() behaves - I think the copying of the background probably needs to be optional to work with the coords code).

Since I removed opencv, that meant that I could go back to using pytorch’s DataLoader. This should therefore fix all memory problems that people have been having. It also reduces the GPU memory usage (opencv was setting aside some GPU memory for every process).

I’ve made a conda package containing pillow-simd, but I named it just ‘pillow’, and put it in a new ‘fastai’ channel, which the environment now makes the top priority channel. Therefore you should find your pillow lib gets replaced with the much faster pillow-simd version I created. I’ve compiled both MacOS and Linux versions.

I discovered that inceptionresnet-v2 and inception-v4 were not training well on dogs v cats after unfreezing. I think I’ve tracked it down to an interesting issue with batchnorm. Basically, updating the batchnorm moving statistics causes these models to fall apart pretty badly. So I’ve added a new learn.bn_freeze(True) method to freeze all bn statistics. This should only be called with precompute=False, and after training the fully connected layers for at least one epoch. I’d be interested to hear if people find this new option helps any models they’ve been fine-tuning.

Finally, I’ve changed the meaning of the parameter to freeze_to() so it now refers to the index of a layer group, not of a layer. I think this is more convenient and less to learn for students, since we use layer groups when we set learning rates, so I think this method should be consistent with that.

Phew - that’s a lot! Don’t worry if none of this make sense to you - we’ll cover all these concepts in time. But feel free to ask if you’re interested in learning more about anything I’ve mentioned.

38 Likes

Thank you so much for removing the opencv dependency. It was the hardest thing to install and get in line. I tried lesson1.ipynb in the P2 Instance and ran into this error at the same line even after repeated re-starts. The nvidia-smi GPU was fine. It might be something with multiprocessing? Previously it worked fine. I have not tried with AMI. My setup uses Docker and nvidia-docker. I can try AMI and report back as well.

1 Like

Seems a little weird, but I did git pull and conda env update on Crestle, opened up lesson1.ipynb and the kernel keeps dying. Later, I will try the AWS AMI as well.

For Crestle, do I need to do anything else to get the notebook up and running with these updates?

[Edit]: Works perfectly on AWS!

1 Like

I tried using AWS AMI and it worked fine. May be something with my setup with Docker on the error above. I will look more into it. But the AWS AMI on P2 seemed to work fine.

2 Likes

You rock !

Resolved my Error through the Post - https://discuss.pytorch.org/t/imagenet-example-is-crashing/1363/2

I was using Docker and had to use --ipc=host for the multiprocessing to work correctly (https://github.com/pytorch/pytorch#docker-image). Glad that was a false alarm :slight_smile:

1 Like

@jeremy just tried this updated version (w/ opencv replaced) on AWS p2.xlarge with resnext50 notebook and it went through smoothly without crash as happened before with the previous version. It has improved a lot in terms of memory consumption. It seems also possible to increase the batch size now.

1 Like

I’m running this both on an AWS p2.xlarge and Mac OS Sierra local system with a Titan X card and I’m able to run lesson 1 so far on both systems.

Previously on my Mac, it was getting stuck executing this code:

learn.fit(1e-2, 3, cycle_len=1)

after one epoch.

On the Mac, I have been getting this warning before and after the changes:

/Users/kmatsuda/anaconda3/envs/fastai2/lib/python3.6/site-packages/torch/nn/modules/container.py:67: UserWarning: Implicit dimension choice for log_softmax has been deprecated. Change the call to include dim=X as an argument.
  input = module(input)

It just sounds like the implicit dimensions are deprecated and the print out is just a warning, but I haven’t yet figured out how to print a backtrace on a warning in python. I did build Pytorch from sources though to run on the GPU though, since pip install on the Mac is CPU only.

Can you use conda install to get a GPU pytorch on Mac?

Sorry, I meant to say conda install, but couldn’t edit my post. I used conda install for everything except things like iso_week, etc. But that is correct, conda install couldn’t install GPU for the Mac and on the PyTorch website, the only choice for GPU on the Mac was installing from sources. (ah, I didn’t see the edit screen before)

@kmatsuda are you using an external GPU enclosure or something? (Because there aren’t any CUDA compatible Macs otherwise, AFAIK).

Just be aware that you’re in uncharted territory there! I don’t think many people have tried Pytorch in that environment, and we’ve certainly not used fastai there. If you want to avoid odd issues, I’d suggest using AWS or Crestle.

Yes, that is true. It is not a normal configuration. I have been using both AWS and my local system and I had switched to just using AWS for the class until I saw your update. I’m using an older Mac Silver Tower with a Titan X that has been updated to run EFI.

Most of the time it works fine, but I figured I would mention it if anybody else was crazy enough to try it. :slight_smile:

@jeremy I ran the inceptionresnet_2 on dogbreeds datasets and got the following results. Seemed that after bn_freeze and unfreeze, the accuracy actually dropped. Am I doing this the right way?

Thanks for all the updates! I’m running the updated repo and everything seems to be working well on my machine (gtx 1070, Ubuntu 16.04).

One thing I noticed though is that it looks like I am not able to load any model weights or precomputed activations that were saved prior to the update. I’m guessing there might be some incompatibility issues there.

Rock star !

I am using Crestle and have made sure that I am using the newest version with git pull. It shows that I do have the lastest version:

This is the error I am getting

There is no .ipynb checkpoint in the folder listed. Is there something I am doing wrong, it worked fine on Crestle prior to the pull.

There is, but you can’t see it because Linux hides directories that start with ‘.’ by default. Use ls -a to see it. Then you can remove that directory.

1 Like

Thanks @jeremy!

There was a bunch of ‘.’ files, I must have uploaded them by mistake when I was uploading my dataset.

Hi, in AWS changes run smoothly and fine. But after updating Paperspace today I get this error when running imports of lesson1, any ideas on how to fix this?

I’ll check it out. But we may need help from @dillon on that one…

1 Like