Hello, I’ve been attempting to get started with deep learning for almost 6 months now. I’ve not had much luck on my own and I’m hoping someone can help me get over the initial hurdle of getting a desktop system up and running for basic experiments.
I’ve limited time and the inertia involved is such that I’ve realized no forward progress is being made.
Its difficult to find information that’s relevant to getting a system up and running and making sure its repeatable. I’ve found several posted how-to’s after deep searching and all of them to date lack some critical step that prevents it from working. I’ve dumped maybe a few hundred hours into this project now and I’m no closer to starting than I was 6 months ago.
I’m simply trying to get a very basic desktop system set up to use my Nvidia RTX 2070 Super for some deep learning projects with pytorch and fastai libraries. I’m running 18.04 LTS Ubuntu. I’m an IT linux professional, several years of programming experience, and reaching the upper level in Math as background (i.e. taking differential equations).
I’ve tried purchasing a nano and getting set up, and that was plagued with technical issues as well as restrictive EULA’s.
I’m aware that its possible to setup temporary cloud nodes to follow along with the fastai course either through google or elsewhere, and I’ve personally found sessions with these services to continually be troubled.
This is why I would like to setup a system that will just work where if it doesn’t I can just reimage it to a known working state. I’ve had weird errors where the node goes down, parts of the page appear cached and other inconsistencies with those services that do not match what’s expected for the lessons sabotaging my efforts.
Is there a place where I can post a bounty or something to get some help with this? I’ve some very creative and interesting ideas which I would like to investigate in Deep Learning but I have to have a reliable and consistent rig before I can do anything and the cloud services simply don’t offer that and finding the relevant information to do this on a standalone workstation is just not readily available.
Google results are absolutely horrible, I’ve been up to page 50 for all related keyword searches I can think of related to this and its almost all completely junk or misleading. Can someone help me?
@than3 Thanks for sharing your experience and indeed it’s totally overwhelming-and a good reminder for us that someone with your level of experience can also get frustrated.
We like to call this the secret walk through fire that all of us have to face while working on DL Problems and once you enter the club, you don’t tell anyone about how hard it is to setup a box
I’m j/k-it actually took me 13 days IIRC to setup everything the first time so I can completely relate.
Since you’ve a strong machine and have weighed in the other options, I would be happy to help debug, here are the outlines of the steps that work for me, happy to provide more details or debug:
- Install Ubuntu (fresh)
- Download CUDA (latest) and install just that (It also installs NVIDIA Drivers for you, just follow the instructions)
- Unzip CuDNN
- Set Paths (I always forgot the export step)
- Restart and install anaconda, followed by Conda installing fastai
@init_27 Thanks. I appreciate any help you or anyone else can provide. I’ve made it as far as getting the nvidia runtime installed with docker and some progress on a dockerfile but have not yet gotten a working repeatable docker image set up.
Initially there were some problems with the host BIOS which prompted switching hardware from AMD to Nvidia Jetson and then to a standard Nvidia GPU setup when that failed and Nvidia support was unwilling to help, and then disabling the APU (CPU is a Ryzen 5 and GPU wasn’t working properly until I disabled this most likely due to ACPI table bugs with ASUS).
I’ve been through more than a few iterations of fresh installs trying to isolate the issues so I’ve automated and set up some scripts to assist. I’ve included them as links to pastebin.
I have a caching drive setup where an SSD is paired with a several mechanical drives in writeback mode which is why I move docker to that large drive pair. It is battery backed to prevent unexpected data loss.
The Dockerfile I’ve modified/been trying to cobble together to get a working instance is here. Its based off a post I found, here.
The obvious challenges are the docker image is huge because of the modified permissions which when not set fail the build at an earlier stage, and then it doesn’t build. I’ve tweaked the script more than a few times isolating undocumented dependencies which were not installed.
Ultimately I hit a brick wall on Step 13 of the build with an error which is where I’ve been stuck for the past month. Debug Output. Edit: Specifically the error claims nodejs needs to be installed and fails but nodejs is installed in an earlier step.
Had some time today to work on this, made my way a bit further to the next rabbithole.
Corrected the nodejs error, conda doesn’t install the correct version of nodejs for the required dependency, I also noted some other cosmetic issues as TODOs for later.
I’ve got the image built but I’m now stuck again.
There are two issues, the fastai.utils submodule missing when installed using conda. Its needed since its one of the few documented ways to determine if the underlying components are working as expected during troubleshooting.
It also appears that the documented steps to install the nvidia runtime are incorrect and/or outdated.
It appears for whatever reason the nvidia-smi isn’t available to the docker image when using the nvidia/cuda:10.2-base as a base image with the runtime installed.
Here’s my current Dockerfile.
@than3 I’ve never tired installing with Docker so I’m not sure if I’d be able to help debug.
Since this path seems painful, have you considered doing a native install instead?
Hi @than3, I know you explicitly asked for instructions to build your own machine, but have you tried DataCrunch.io, where you rent a dedicated server? I haven’t tried it myself, but imagine it to be more reliable than for example the Google Colab environment?
@init_27 Sorry for the delayed response, its been a bit of a crazy week.
I’ve tried to stay away from native installs when at all possible primarily because I’ve had several bad experiences where package dependency updates ended up clobbering parts of the system forcing an OS reinstall after unstable behavior was introduced, but not correctable.
Docker in my experience has allowed the needed isolation to avoid many of these types of issues which are common to deployment. Repeatability is also a reason I favor docker, with an image I can drop the needed software/configuration on any local system from a cloud endpoint and it should work the same as any other with minimal effort needed to get up and running with all the requirements.
@johannesstutz, Thanks for the suggestion, I hadn’t considered, or heard of Datacrunch but I would imagine that I would run into the exact same issues I am currently experience with the current docker image unless they provide a working configuration out of the box.
Right now I’m leaning more towards solutions that reduce troubleshooting complexity, and running on a third party server would increase the possible points of failure which would increase complexity. I’ll take a look at them but I don’t think it would be a good fit for me, at least at this time.
Ok, after more trial and error than I was expecting, here it is so people don’t have to reinvent the wheel with setting up a repeatable ML environment.
In case anyone else is following this with similar issues I’ve posted links for my current dockerfile and associated scripts for setting this up below on Ubuntu 18.04 LTS.
- Following safe practice, as always, if you don’t understand the scripts don’t run it. I’ve attempted to document some of the details that prevented this from working previously as comments within the scripts.
The docker image is now working properly, there was a slight hardware/mainboard issue related to the BIOS and Ryzen 5 APU being visible for the ASUS B450 Prime series mainboard. Disabling the video card APU in the BIOS was mandatory due to BIOS bugs. The software portion appears to be working good enough for now to run basic tests reliably and consistently.
Step 1: Setup Docker
# I have my docker base located elsewhere from the default, if this isn’t needed comment out the appropriate sections where it sets up the symbolic link (ln).
bash -c “DOCKER_BASE=path/to/base docker_install.sh”
Step 2 : Install Nvidia Runtime
# Nvidia provided installation instructions are inconsistent between previous versions of the docker image and requirements for runtime installation instructions. Use this instead.
Step3: Build Dockerfile
# number of adjustments made to original script due to updates and known issues in condaforge modules directory run:
docker build . --t test
Step 4: Get other files
The installation uses several files which were provided by the repo philiplies runs. The post is here.
# --shm-size must be set explicitly as the default provides nowhere near enough shared memory
# and it will choke with a non-descriptive error. It also cannot be adjusted from inside the container.
docker run --rm -p 8888:8888 -v /virt/ai/persist:/opt/notebooks --runtime=nvidia --shm-size 5G test
Notes: The dockerfile is both built and run from the /virt/ai base folder. The base folder should contain the following files:
Dockerfile, ipython_config.py, run.sh, seed.py, torchtest.py as they will be needed during the build process. the persist directory holds persistent files.
Happy holidays students and researchers.
@init_27 @muellerzr Thanks everyone for your assistance with getting this resolved.
Well, its been about a year. I published the dockerfile for my working environment, and within a few months it was broken.
For those looking at possibly using this setup, currently, the dockerfile build is broken.
From what I’ve been able to gather, Jupyterlab updated to a new extension model in 3.x which broke tensorboard.
The issue is mentioned in the JupyterLab repo as an open issue, a year later there are still no PR’s that have been accepted that resolve the issue, the workaround is to downgrade JupyterLab to 2.x, but doing that causes issues in other packages earlier in the build and Tensorboard doesn’t actually work (500 error).
AFAIK, there currently isn’t a solid workaround.
Unfortunately, I simply don’t have the time to go further down another rabbit hole to get a basic environment working again. If anyone is interested in revising the docker file so it builds, have at it.