GPU-enabled Swift for TensorFlow build for Nvidia Jetson devices

bradlarson · May 5, 2019, 1:59pm

Within the install guide discussion, @Interogativ and I had been discussing how to get Swift for TensorFlow working on Nvidia’s Jetson devices, and I believe I finally have a fully GPU-enabled build operational. I wanted to pull this out into its own topic, in case anyone else was interested.

The Nvidia Jetson single-board computers are interesting for exploring inference at the edge, because they combine a relatively low-power ARM64 processor with CUDA-compatible mobile Nvidia GPUs in a small package. In particular, their new $99 Jetson Nano provides a Maxwell-based GPU supporting CUDA 10.0 and cuDNN 7.3.1 along with a quad-core CPU (9/10/2020: this is now CUDA 10.2 and cuDNN 8).

EDIT 9/10/2020: After a lengthy hiatus (due to issues with my build system), we finally have a new Swift for TensorFlow toolchain available that supports all Jetson devices running JetPack 4.4 (CUDA 10.2). This new toolchain can be downloaded here, and this mailing list announcement has more details and benchmarks.

I’ll leave the below for posterity, but you no longer need any special instructions for building a Swift for TensorFlow toolchain for the Jetson devices. The Swift toolchain and TensorFlow components now properly recognize and build for the ARM64 devices, and the JetPack installation provides a much more stable build environment than it did a year ago.

Here’s what I’d previously written:

Traditionally, it had been difficult to get a Swift toolchain building correctly on ARM64, but Neil Jones’ repository here has instructions on how to make that work now. Their latest builds didn’t have TensorFlow support or CUDA enabled, but with a few slight changes I was able to get that building. Here are two toolchains I’ve built and temporarily hosted:

Both of these work on the Jetson devices I’ve tried them on (Jetson Nano, Jetson Xavier), but they do require the latest Jetpack (Nvidia’s OS / tools image). CUDA 10.0 and cuDNN 7.3.1 are pre-installed by Jetpack, so you can skip over those install steps in the guide. I also found that I needed to install the following packages:

sudo apt-get install python3-venv python3-dev libcurl4-openssl-dev libfreetype6-dev

to get the Swift Jupyter kernel to install correctly. I may be missing a package or two in there.

While the Jetson Nano has enough processing power and a CUDA-compatible GPU for doing training, it does have a problem with memory. It only has 4 GB of memory onboard, and shares that between CPU and GPU. On the Nano, once I’ve loaded up the Jupyter notebook server and Chromium browser, the system only has ~500 MB of available memory left. As a result, once I try to load a large CUDA tensor (such as is created when loading the MNIST dataset in one of the notebooks), the GPU runs out of available memory and allocation fails. This shouldn’t be as much of a problem on the more powerful Jetson devices, like the TX2 with its 8 GB of memory or the Xavier with 16.

The Jetson Nano wasn’t going to be the optimal training computer, but for $99 for a full computer capable of running accelerated Swift for TensorFlow it could be a good entry-level platform for experimentation. It’s certainly useful for edge inference, and it should be easy to transfer Swift for TensorFlow code and models developed elsewhere to these single-board computers. The TX2 and Xavier provide a lot more processing power for robotics and other applications.

I posted my build process over in the Swift for TensorFlow mailing list, for reference. Some of that is now obsolete, because no patches are needed to get the current Swift for TensorFlow toolchain to build on Jetson devices.

Interogativ · May 5, 2019, 3:58pm

Thanks Brad. i’ve been traveling for the last few days, so I haven’t had a chance to try your build. i’ll try it when I get back later this week. Thanks for the hard work!

sjama · August 2, 2019, 5:10pm

@bradlarson I followed your instruction and successfully installed Swift for Tensorflow with GPU support on my Nvidia Jetson Nano.

But because your build was based on S4TF 0.3, I tried to get the latest and greatest by building it my self. I failed miserably.

Any chance you will be willing to update this for 0.4 release? I think you have the more powerful jetson device at hand.

I would really appreciate.

bradlarson · August 2, 2019, 5:35pm

Absolutely, I’ll spin up the Jetson Xavier I used for this and see if I can’t get a build going for 0.4. It should be fairly straightforward, but just takes a little time for the compilation. Unfortunately, I don’t have this at the office yet, but I’ll get it going tonight.

sjama · August 2, 2019, 6:05pm

Thanks @bradlarson looking forward to it.

sjama · August 2, 2019, 6:12pm

Also nvidia released JetPack 4.2.1 recently and that’s what I’m running on the nano. Not sure if that’s relevant.

bradlarson · August 6, 2019, 1:31am

As a quick update, this might take a little longer than I thought. A lot changed internally between 0.3 and 0.4, so the patches I had aren’t working out of the box. I need to devote a little more time to this, so it might be a few more days to get this updated.

Sorry for the delay.

sjama · August 6, 2019, 5:38am

No worries, I was planning to work on it on the weekend anyway.

bradlarson · August 13, 2019, 9:18pm

I can’t edit the main post (it’s over two months old), but here’s a new CUDA-enabled Swift for TensorFlow toolchain for Jetson devices (1.2 GB). It is based on a nightly snapshot as of August 11, 2019, so it is slightly newer than the 0.4 release.

Sorry it took so long, had to update a few elements of my build system, and it’s a ~16 hour build cycle on a Jetson Xavier. What’s nice is that no patches are required anymore to build for Jetson aarch64, the native Swift for TensorFlow toolchain now builds cleanly for these devices.

sjama · August 14, 2019, 3:29am

Excellent. Thanks @bradlarson really appreciate.

In my failed attempted to build it on the nano I did notice that patches were no longer necessary.
16 hours on the Xaviar? No wonder the nano copt out after a few.

sjama · August 15, 2019, 2:40pm

@bradlarson not sure which models you use to test, but I have not been able to successfully run any model. This the output from running the swift-api readme example.

Fatal error: CUDA runtime implicit initialization on GPU:0 failed. Status: unknown error: file /mnt/data/Development/swift-source/tensorflow-swift-apis/Sources/TensorFlow/Core/Runtime.swift, line 262
Current stack trace:
0    libswiftCore.so                    0x0000007f99ca1294 swift_reportError + 64
1    libswiftCore.so                    0x0000007f99d02df8 _swift_stdlib_reportFatalErrorInFile + 128
2    libswiftCore.so                    0x0000007f99a41248 <unavailable> + 1405512
3    libswiftCore.so                    0x0000007f99c2489c <unavailable> + 3385500
4    libswiftCore.so                    0x0000007f99a40a8c <unavailable> + 1403532
5    libswiftTensorFlow.so              0x0000007f9909c6f0 dumpTensorContent<A>(_:_:) + 0
6    libswiftTensorFlow.so              0x0000007f98f48a54 checkOk(_:file:line:) + 444
7    libswiftTensorFlow.so              0x0000007f990997dc _ExecutionContext.init() + 2460
8    libswiftTensorFlow.so              0x0000007f990995f4 _ExecutionContext.__allocating_init() + 52
9    libswiftTensorFlow.so              0x0000007f990995d8 <unavailable> + 2586072
10   libpthread.so.0                    0x0000007f9c739c00 <unavailable> + 60416
11   libswiftCore.so                    0x0000007f99cbeb60 swift_once + 112
12   libswiftTensorFlow.so              0x0000007f98f488c8 _ExecutionContext.global.unsafeMutableAddressor + 32
13   libswiftTensorFlow.so              0x0000007f98f48694 TFE_Op.init(_:_:) + 264
14   libswiftTensorFlow.so              0x0000007f98f583e4 static _ExecutionContext.makeOp(_:_:) + 308
15   libswiftTensorFlow.so              0x0000007f98f5832c makeOp(_:_:) + 168
16   libswiftTensorFlow.so              0x0000007f98fd72ac static Raw.mul<A>(_:_:) + 160
17   libswiftTensorFlow.so              0x0000007f990be290 static Tensor<>.* infix(_:_:) + 56
18   libswiftTensorFlow.so              0x0000007f98f3f6b0 static Tensor<>.* infix(_:_:) + 168
19   libswiftTensorFlow.so              0x0000007f98f3ed3c Tensor<>.init<A>(randomUniform:generator:lowerBound:upperBound:) + 884
20   libswiftTensorFlow.so              0x0000007f98f4093c Tensor<>.init<A>(glorotUniform:generator:) + 360
21   libswiftTensorFlow.so              0x0000007f98f4750c Dense.init<A>(inputSize:outputSize:activation:generator:) + 344
22   libswiftTensorFlow.so              0x0000007f98f47dfc Dense.init(inputSize:outputSize:activation:) + 408
25   swift                              0x00000000004d5c8c <unavailable> + 875660
Stack dump:
0.      Program arguments: /home/saeed/s4tf/latest/bin/swift -frontend -interpret simple.swift -Xllvm -aarch64-use-tbi -disable-objc-interop -color-diagnostics -module-name simple 
1.      Swift version 5.1-dev (LLVM 200186e28b, Swift 4df23c8e12)
/home/saeed/s4tf/latest/bin/swift[0x3f5e780]
Trace/breakpoint trap (core dumped)

jeremy · August 15, 2019, 9:41pm

@bradlarson I made the top post a wiki post so you should be able to edit it now.

bradlarson · August 16, 2019, 3:13pm

@sjama - Do you see any output in the console above that which would indicate memory exhaustion? I had real troubles with that on the Nano.

A secondary possibility is that there is a mismatch in CUDA / cuDNN versions between JetPack versions. I’m using an older JetPack, and I had assumed that linking against the cuDNN on that would allow for forwards compatibility with newer JetPacks. Maybe that’s not the case. I didn’t want to wipe my build system to upgrade to a new JetPack, but maybe I’ll need to try that.

@jeremy - Thanks for the wiki status change, I’ll keep the main post updated with new builds.

sjama · August 16, 2019, 3:33pm

@bradlarson that was the entire output from simple readme model.
Here’s the output from MNIST model, although I never expected MNIST to successfully train on the nano as you previously eluded too.

Reading data from files: train-images-idx3-ubyte, train-labels-idx1-ubyte.
Constructing data tensors.
2019-08-16 16:25:21.232663: W tensorflow/core/framework/cpu_allocator_impl.cc:81] Allocation of 188160000 exceeds 10% of system memory.
2019-08-16 16:25:21.493828: W tensorflow/core/platform/profile_utils/cpu_utils.cc:98] Failed to find bogomips in /proc/cpuinfo; cannot determine CPU freque
ncy
2019-08-16 16:25:21.494532: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x5597e80380 executing computations on platform Host. Devices:
2019-08-16 16:25:21.494588: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (0): <undefined>, <undefined>
2019-08-16 16:25:21.500000: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2019-08-16 16:25:21.517124: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:973] ARM64 does not support NUMA - returning NUMA node zero
2019-08-16 16:25:21.517308: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties: 
name: NVIDIA Tegra X1 major: 5 minor: 3 memoryClockRate(GHz): 0.9216
pciBusID: 0000:00:00.0
2019-08-16 16:25:21.517357: I tensorflow/stream_executor/platform/default/dlopen_checker_stub.cc:25] GPU libraries are statically linked, skip dlopen check
.
2019-08-16 16:25:21.517487: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:973] ARM64 does not support NUMA - returning NUMA node zero
2019-08-16 16:25:21.517727: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:973] ARM64 does not support NUMA - returning NUMA node zero
2019-08-16 16:25:21.517817: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0
Fatal error: CUDA runtime implicit initialization on GPU:0 failed. Status: unknown error: file /mnt/data/Development/swift-source/tensorflow-swift-apis/Sou
rces/TensorFlow/Core/Runtime.swift, line 262
Current stack trace:
0    libswiftCore.so                    0x0000007f82ea6294 swift_reportError + 64
1    libswiftCore.so                    0x0000007f82f07df8 _swift_stdlib_reportFatalErrorInFile + 128
2    libswiftCore.so                    0x0000007f82c46248 <unavailable> + 1405512
3    libswiftCore.so                    0x0000007f82e2989c <unavailable> + 3385500
4    libswiftCore.so                    0x0000007f82c45a8c <unavailable> + 1403532
5    libswiftTensorFlow.so              0x0000007f832f86f0 dumpTensorContent<A>(_:_:) + 0
6    libswiftTensorFlow.so              0x0000007f831a4a54 checkOk(_:file:line:) + 444
7    libswiftTensorFlow.so              0x0000007f832f57dc _ExecutionContext.init() + 2460
8    libswiftTensorFlow.so              0x0000007f832f55f4 _ExecutionContext.__allocating_init() + 52
9    libswiftTensorFlow.so              0x0000007f832f55d8 <unavailable> + 2586072
10   libpthread.so.0                    0x0000007f82a39c00 <unavailable> + 60416
11   libswiftCore.so                    0x0000007f82ec3b60 swift_once + 112
12   libswiftTensorFlow.so              0x0000007f831a48c8 _ExecutionContext.global.unsafeMutableAddressor + 32
13   libswiftTensorFlow.so              0x0000007f831a4694 TFE_Op.init(_:_:) + 264
14   libswiftTensorFlow.so              0x0000007f831b43e4 static _ExecutionContext.makeOp(_:_:) + 308
15   LeNet-MNIST                        0x000000556769472c <unavailable> + 157484
16   LeNet-MNIST                        0x00000055676927e0 <unavailable> + 149472
17   LeNet-MNIST                        0x00000055676931f4 <unavailable> + 152052
18   LeNet-MNIST                        0x0000005567692950 <unavailable> + 149840
19   LeNet-MNIST                        0x000000556786699c <unavailable> + 2066844
20   libc.so.6                          0x0000007f684eb600 __libc_start_main + 224
21   LeNet-MNIST                        0x000000556768c494 <unavailable> + 124052
Trace/breakpoint trap (core dumped)

bradlarson · August 16, 2019, 3:43pm

Thanks for checking, I’ll flash my Nano with the latest JetPack and see if I can reproduce this on my end. This is sounding less like memory exhaustion and more like a CUDA problem. Could be the cuDNN mismatch, although that usually throws a different error. I’ll see what I can find to fix this.

bradlarson · September 10, 2020, 2:31pm

I’m very sorry about how long it took me to return to this, but intermittent issues with my Jetson Xavier build system kept slowing me down. Those have all been resolved, and advancements in the Swift toolchain, the TensorFlow components we use, and the JetPack OS images have made building new CUDA-enabled toolchains for the Jetson devices a lot simpler. I’ve edited the main post, but we have a new hosted Swift for TensorFlow toolchain for Jetson devices that currently aligns with our 0.11 release. This requires JetPack 4.4 on your Jetson device (CUDA 10.2, cuDNN 8). I’ve posted some performance benchmarks in the mailing list announcement here.

This includes full support for our new XLA-backed X10 backend, which should help models better squeeze into the limited memory of the Jetson boards. To my knowledge, I think this might be one of the first examples of XLA running on an ARM64 device. I will caution that we’re labeling this toolchain as experimental, because it hasn’t been exhaustively tested.

It sounds like there’s community interest in maintaining these builds going forward, so we can see about releasing these more regularly alongside our normal point releases.