Memory Leak Training Imagenette/Hairbrain?

aleDL · July 14, 2019, 1:15pm

I was trying to run the https://github.com/fastai/harebrain example from the lessons with swift run.

When training, memory usage keeps increasing when running either on the CPU or GPU until it fails. (Titan RTX with 24GB and 64GB RAM)

I’m running the latest Swift Toolchains 0.4rc2 … and am getting the same issue on the Mac and Linux.
What’s the best way to debug this?

Below is the error when runnning on the GPU, it always fails on epoch 21.

Fatal error: OOM when allocating tensor with shape[128,64,64,64] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc: file /swift-base/tensorflow-swift-apis/Sources/TensorFlow/Bindings/EagerExecution.swift, line 299
Current stack trace:
0 libswiftCore.so 0x00007fa266e7b8b0 swift_reportError + 50
1 libswiftCore.so 0x00007fa266eeaaa0 swift_stdlib_reportFatalErrorInFile + 115
2 libswiftCore.so 0x00007fa266e12ace + 3738318
3 libswiftCore.so 0x00007fa266e12c47 + 3738695
4 libswiftCore.so 0x00007fa266be0c4d + 1436749
5 libswiftCore.so 0x00007fa266de7a78 + 3562104
6 libswiftCore.so 0x00007fa266be00a9 + 1433769
7 libswiftTensorFlow.so 0x00007fa26728b8e0 + 2672864
8 libswiftTensorFlow.so 0x00007fa2670f0a60 checkOk(:file:line:) + 461
9 libswiftTensorFlow.so 0x00007fa2670f7b90 TFE_Op.evaluateUnsafe() + 506
10 libswiftTensorFlow.so 0x00007fa2670f8400 TFE_Op.execute( + 132
11 libswiftTensorFlow.so 0x00007fa267101094 + 1056916
12 libswiftTensorFlow.so 0x00007fa26714c020 static Raw.div(: + 791
13 libswiftTensorFlow.so 0x00007fa2672f8d20 static Tensor<>./ infix(:_ + 54
14 libswiftTensorFlow.so 0x00007fa267368395 + 3576725
15 libswiftTensorFlow.so 0x00007fa2673ccbae + 3988398
16 libswiftTensorFlow.so 0x00007fa2673baeb5 + 3915445
17 libswiftTensorFlow.so 0x00007fa2673e8185 + 4100485
18 libswiftTensorFlow.so 0x00007fa267396ff9 + 3768313
19 run 0x00005635050e6ed0 + 798416
20 run 0x00005635050f020c + 836108
21 run 0x00005635050c5b1d + 662301
22 run 0x0000563505101872 + 907378
23 run 0x00005635050f2592 + 845202
24 run 0x00005635050f2712 + 845586
25 run 0x00005635050bfb67 + 637799
26 run 0x00005635050ed7d2 + 825298
27 run 0x0000563505146692 + 1189522
28 run 0x000056350514d221 + 1217057
29 run 0x00005635051463b1 + 1188785
30 run 0x000056350514d289 + 1217161
31 run 0x000056350512dbe3 + 1088483
32 run 0x000056350507375e + 325470
33 run 0x0000563505076cc2 + 339138
34 run 0x000056350506bcc6 + 294086
35 run 0x0000563505074651 + 329297
36 run 0x000056350506c5b4 + 296372
37 run 0x0000563505082ed2 + 388818
38 run 0x000056350506c219 + 295449
39 run 0x000056350506baec + 293612
40 run 0x000056350506c7b3 + 296883
41 run 0x0000563505082892 + 387218
42 libswiftTensorFlow.so 0x00007fa26738bbfa + 3722234
43 libswiftTensorFlow.so 0x00007fa2673d0544 + 4003140
44 run 0x0000563505071ee2 + 319202
45 run 0x00005635050764a5 + 337061
46 run 0x000056350514b304 + 1209092
47 run 0x000056350514e1a1 + 1221025
48 run 0x000056350514addc + 1207772
49 run 0x000056350514e1f1 + 1221105
50 run 0x00005635051445a3 + 1181091
51 run 0x00005635050a1b7a + 514938
52 run 0x00005635050a2ba4 + 519076
53 run 0x00005635050a178a + 513930
54 run 0x00005635050a2c72 + 519282
55 libswiftTensorFlow.so 0x00007fa2672603cc + 2495436
56 libswiftTensorFlow.so 0x00007fa26746a333 + 4633395
57 libswiftTensorFlow.so 0x00007fa26725f990 Differentiable.valueWithGradient(in:) + 1554
58 run 0x0000563505097921 + 473377
59 run 0x0000563505098d33 + 478515
60 run 0x0000563505099f01 + 483073
61 run 0x000056350518e530 + 1484080
62 libc.so.6 0x00007fa24b012ab0 __libc_start_main + 231
63 run 0x000056350504fa4a + 178762
Illegal instruction (core dumped)

aleDL · July 14, 2019, 1:35pm

Tried running with the 0.3.1 toolchain but got a ton of code errors.

vguerra · July 15, 2019, 11:39am

hello @aleDL,

Recently there was a fix for a memory leak during differentiation phase which is supposed to address TF-621 which might be the same leak you are experimenting. What I am not sure about is if the rc2 contains this fix.

aleDL · July 15, 2019, 1:38pm

After checking out the bug issue you pointed to, I also tried running the 09_optimizer notebook with the nightly toolchain and had the same problem when starting learner.fit() in the notebook.

rahulbhalley · July 15, 2019, 2:47pm

Do you have any knowledge of when will the OOM fixed version of S4TF will go live on Google Colab? Or do know where this kind of plan is discussed online? I am stuck. I was training a GAN for my book on S4TF and it used to train on 0.3.1 (I used Colab for GPU, can’t run locally on MBP) but now it just goes OOM.

vguerra · July 17, 2019, 11:26am

hello @rahulbhalley,

Richard is working on solving this issue ( ref: https://github.com/tensorflow/swift-apis/issues/364 )