This is pretty cool - my initial plan was to use spot p2.xlarge instances and mount a persistent EBS volume as the notebook root, but I’ve had trouble getting p2 instances to stick around more than an hour or so, and this appears to be a little cheaper than I was getting spot instances for anyway.
My initial experience with it mostly good - it was really easy to get it setup and get into the box. The only thing that appears to be slightly unfortunate is the I/O - unzipping the dogscats.zip took over 10 minutes on this box, but is under a minute on a p2.xlarge with an attached EBS volume. Not sure why it is so slow or if that is something that can be improved, but it isn’t terrible.
I tried to get the first lesson from this course running on it, and probably due to how quickly some of these libraries are changing, I had to make a handful of changes to the environment and still didn’t get it fully working…
Here’s what I had to do (I’m fairly new to both Python and deep learning, so let me know if there is anything I’m missing):
- Switch to the Python 2.7 env:
source activate py2.7-env
- Install a bunch of missing libraries:
conda install Pillow
conda install scikit-learn
conda install bcolz
- Remove the installed versions of Keras and Theano and re-install them via pip:
conda_remove_keras
conda_remove_theano
pip install theano==0.9.0
pip install keras==1.2.2
- Tell Keras to use Theano instead of Tensorflow (got some error running Vgg16 with TF, not sure if it supported or not, but I know in the Amazon AMI it is setup to use Theano)
After all of that, I get the following now when running it:
> Exception: ('The following error happened while compiling the node', GpuDnnConv{algo='small', inplace=True}(GpuContiguous.0, GpuContiguous.0, GpuAllocEmpty.0, GpuDnnConvDesc{border_mode='valid', subsample=(1, 1), conv_mode='conv', precision='float32'}.0, Constant{1.0}, Constant{0.0}), '\n', 'nvcc return status', 2, 'for cmd', '/usr/local/cuda/bin/nvcc -shared -O3 -Xlinker -rpath,/usr/local/cuda/lib64 -arch=sm_37 -m64 -Xcompiler -fno-math-errno,-Wno-unused-label,-Wno-unused-variable,-Wno-write-strings,-DCUDA_NDARRAY_CUH=c72d035fdf91890f3b36710688069b2e,-DNPY_NO_DEPRECATED_API=NPY_1_7_API_VERSION,-fPIC,-fvisibility=hidden -Xlinker -rpath,/home/ubuntu/.theano/compiledir_Linux-4.4--aws-x86_64-with-debian-stretch-sid-x86_64-2.7.14-64/cuda_ndarray -I/home/ubuntu/.theano/compiledir_Linux-4.4--aws-x86_64-with-debian-stretch-sid-x86_64-2.7.14-64/cuda_ndarray -I/usr/local/cuda/include -I/home/ubuntu/anaconda3/envs/py2.7-env/lib/python2.7/site-packages/theano/sandbox/cuda -I/home/ubuntu/anaconda3/envs/py2.7-env/lib/python2.7/site-packages/numpy/core/include -I/home/ubuntu/anaconda3/envs/py2.7-env/include/python2.7 -I/home/ubuntu/anaconda3/envs/py2.7-env/lib/python2.7/site-packages/theano/gof -L/home/ubuntu/.theano/compiledir_Linux-4.4--aws-x86_64-with-debian-stretch-sid-x86_64-2.7.14-64/cuda_ndarray -L/home/ubuntu/anaconda3/envs/py2.7-env/lib -o /home/ubuntu/.theano/compiledir_Linux-4.4--aws-x86_64-with-debian-stretch-sid-x86_64-2.7.14-64/tmpYDFYaZ/ea4e203b6529466794536f8a1bfa77ae.so mod.cu -lcudart -lcublas -lcuda_ndarray -lcudnn -lpython2.7', "[GpuDnnConv{algo='small', inplace=True}(<CudaNdarrayType(float32, 4D)>, <CudaNdarrayType(float32, 4D)>, <CudaNdarrayType(float32, 4D)>, <CDataType{cudnnConvolutionDescriptor_t}>, Constant{1.0}, Constant{0.0})]")
I’m guessing I’m missing/have the wrong version of some of the Cuda libraries/drivers, but when I tried to follow the steps from install-gpu.sh, I ran out of disk space (looks like the root volume only has 20 GB) - I’ll try freeing some space by deleting the python 3 env and attempt to get it to work again tomorrow.