Build a Deep Learning Box that runs fast.ai for $99 ish

How I built a system that runs Fast.Ai for $99 ish:

A month or so ago I ordered a couple of NVIDIA Jetson Nano Development Kits for $99 each. They advertised that the SBC had 4GB of RAM, an ARM CoreA-57 processor, and best of all, runs CUDA 10.0. Not only that, they said it runs Pytorch 1.0. Wow, I thought this could run Fast.Ai. So I ordered the systems from SparkFun Sparkfun and waited (because they were on back order and hadn’t shipped yet), until last Thursday when they arrived. The system boots from a micro SD card, so I bought a few 64GB class 10 cards for about $15 each, it supports USB 3.0 and Gigabit Ethernet.

I download and flashed the image according to these instructions , plugged the card in and booted the GUI of what looks like the Ubuntu 18.04 desktop. Since the board only has 4GB of RAM for use by both the processor and CUDA, I knew a swap drive was needed. I created one of 8GB (see below). Since I’m using this box for fast.ai, I can SSH in and save myself all the RAM used for the GUI (about 400MB), so I got rid of the GUI using another script I wrote (which also deletes some memory hog programs I wasn’t using.

I then naively tried to follow the instructions of the fast.ai website and these forums to create fast.ai. No such luck.

The Good, the Bad and the Ugly:

Unlike conda (Anacona 3.X) installing packages using pip (and in this case pip3) is no fun. The pip install regime uses python’s setup tools, which is imperfect at best. When you specify dependencies (packages your package needs to run) (and fastai and pytorch have a lot of them), It looks here and there on your machine, then gives up and tries to download and build the necessary packages from pypy (the python website). You can guess how well that’s going to work in an environment where you’re running on brand new hardware. The good news is that NVIDIA provides a pip wheel (a kind of almost pre-built package) for pytorch 1.0 (I’m not including the URL on purpose, keep reading for more good news). The bad news is that the fast.ai dependencies completely ignore that package (and yes, I tried changing the version dependencies in fast.ai setup.py). So for the next two solid days, I found apt packages that the machine would install to fulfill the dependencies, and figured out how to build the dependencies that didn’t have apt packages. The ugly news is that that when you install fast.ai using pip you must first install all this stuff and then use pip fast.ai –no-dependencies. This means that when fast.ai changes it will be up to you to find and fix broken packages. But even though I didn’t install virtualenv (a virtual environment like conda) because I was in a hurry to see if could get this working, I’ve built and tested an install script (see below) that does this all for you automatically (but it does take a couple of hours to run).

Setting up Jupyter Notebook:

After installing fast.ai, I wanted to test its performance using the same fastai course V3 part 2 notebooks I’m running on my big linux box (Core I-9, 64GB RAM, two NVIDIA GTX-1080s). But first I had install jupyter. That also proved to be an interesting install, but again after I was done I wrote an install script so you won’t have to do things like; change the default listening IP to 0.0.0.0 from locahost, setup a password, etc.) and now I was almost ready to how well the notebooks ran.

Memory isn’t everything, but it’s definitely something:

Back in the old days (of say 2010), 4GB was a lot of memory. And If you’re not using the GPU on this board, it is enough to get your notebooks running well (the 8 GB of swap file helps quite a bit). But if you’re using CUDA, it doesn’t run on the swap disk, so you need each and every byte of that 4GB. To get that, it’s time to jettison the GUI and run via a remote console using SSH. So with a few more steps (see below), I was able to get the total memory for linux and jupyter notebooks down to around 500MB, leaving around 3.5GB for CUDA.

And now the Results:

Notebook:		Section		Jetson Time		Big linux box time:	Notes:
00_exports		N/A			Runs			Runs			uses only python
01_matmul		initial 	4.52 sec		554ms		python only
01_matmul		broadcasting		1.2ms			182us			python only
01_matmul		Einsum		324us			154us			uses pytorch
01_matmul		pytorch op		153us			50.7us		uses pytorch
02_fully_conn	Foundations		949ms		2.74ms		uses pytorch
02_fully_conn	Gradients		?? gave up		1.1sec			uses (10.2GB), 5.7GB of swap space (takes forever)
02_fully_conn	Refactor fwd		4.77sec		11.5ms		fits in RAM
02_fully_conn	Refactor bwd 	20mins		1.115 sec		uses (9.9GB),  4.5GB of swap space
02_fully_conn	Without einsum	19 sec			151msec
02_fully_conn	nn.Linear		10.1 sec		1.1sec
06_cuda_cnn_h	ConvNet		1min 11 sec		1.87 sec		uses 1.3GB fits in RAM
06_cuda_cnn_h	CUDA			50.3 sec		2.27 sec		uses CUDA 2.3GB fits in RAM
06_cuda_cnn_h	CUDA refactor	30.9 sec		2.29 sec		uses CUDA 2.3GB fits in RAM

I was also able to run some of the other notebooks, but didn’t save the timings.

OK, it’s a lot slower, but it works!

Will this work for YOLO, Resnet50, GANS, maybe for inference, but probably not for training. But to model a small problem, quickly and for free, definitely! This is not a toy system.
OK, so it’s possible, now how do I get one?

Follow the instructions here. The only step missing is how to download the fastai_docs notebooks from git, because that post is not in this private (for now) section of the fast.ai forum.

25 Likes

Great writeup!

Do you have a repo / docker image of the whole setup?

I’ve experimented with the Movidius on an RaspberryPi and as you mentioned, dealing with dependencies on lean edge systems is no fun.

Very cool, thanks so much for going through the pain, and sharing the outcome! I’ve wanted to play with one of these since I saw them announced, very interesting to see the performance. You’re right, it’s not a toy! I’m generally interested in making models more mobile/embedded/edge friendly but haven’t done a ton of experimentation on it (yet). I’m convinced this is the area that needs development to truly see the wide distribution and rollout of DL’s potential. A huge part of the world can’t afford a google cloud machine or a Linux box, and many currently untapped application areas (agriculture, remote medicine, field science, aerospace) would really benefit from powerful embedded DL systems.

2 Likes

Thanks, I agree. My next step is to build embedded systems based on fast.ai with these. @twairball asked about docker and a repo. Sorry, but if you follow the instructions here it includes complete build instructions and all of the scripts you need in a handy zip file. Not a repo but close enough to get you up and running!

5 Likes

For a great post describing the Jetson Nano performance (not for deep learning), see this post

1 Like

This is amazing work. That’s exactly what I was going to do when I got mine. I waited a bit too long to order mine, and I now I have to wait until next month.

It’s possible to make a docker image to run on ARM, but it doesn’t look like there’s an official nvidia-docker image on arm64 yet according to this thread

There is this project which works on the TX2, however which might be worth investigating.

1 Like

Can the JN boot from a USB port? I’m wondering if an NVME SSD can be attached via USB.

Update 11-Apr-2019: I just read the syonyk link mentioned above. He does use a USB SSD (256GB) to boot and shows benchmarks compared to SD card. He doesn’t mention the SSD model or interface. I suspect performance will be somewhat better for a low latency NVME SSD drive using a recent USB interface chip.

Agreed, I’m waiting to see if he has any luck with that.

My experiences with docker have not been great, but if you manage to get one working let us know.

Latest Results:
Decided to run a few more examples and quantify just how much slower the Jetson Nano really is.

I think the primary advantage of these kinds of systems is the ability to use them for inference on large complex models which would normally be slow to infer on. I don’t think they’re expected to be used for inference.

Although - side thought - I wonder if you could make a cluster of them & do distributed learning that was comparable with a large desktop GPU, but less expensive, or more power efficient.

Anyway - I feel like the “correct” application of these things is to dramatically improve the performance of real time image processing on video for self-driving cars, or drones, for example. Basically allowing you to use a full ResNet50 architecture with large weights, which e.g. an RPi or Arduino could never keep up with.

What are the chances of you doing some testing on inference times on desktop cpu vs. desktop gpu vs. jetson cpu vs. jetson gpu? :slight_smile: I know it’s a lot to ask…

2 Likes

The chances are good, If you have a notebook you’d like me to run on both, give me a gist of it and I’ll be happy to run it, side by side and post the results.

1 Like

Thanks for taking all the time to work this out @Interogativ

The instructions worked perfectly.

I miss not having nvidia-smi to monitor GPU usage, but stumbled on this great tool called jtop which you can get at https://github.com/rbonghi/jetson_stats

It gives you nice stats like:

And you can even plot your GPU utilisation in the terminal:

2 Likes

Oh, and if you’re looking for an enclosure and have access to a 3D printer, then I recommend this model. I got a friend to print it out for me and it works and looks great.

image

This particular variation has holes to mount wifi antennas, though at the moment I’ve just going with membrane ones.

I bought the M2 wifi / bluetooth card recommended in this post and it’s been pretty solid. Though the price of the card and the antennas is almost as much as I paid for the Nano in the first place!

Have you considered using Archiconda?