Kernel crash when fitting model in swift

(Thom M) #1

Not quite sure where to start debugging this one…

I’m replicating the budding Audio library in Swift, and have managed to load audio into Tensors and put them into a Databunch, but the kernel dies (reliably) when I try to train the model.

I’m using the model arch from 08_data_block, and when I call learner.fit(1), the progress bar pops up, and the kernel crashes.

To be clear, the model trains properly in both 08_data_block and 11_imagenette on my setup.

In the notebook I get this error:

Fatal error: No algorithm worked!: file /swift-base/swift/stdlib/public/TensorFlow/CompilerRuntime.swift, line 2108
Current stack trace:
0    libswiftCore.so                    0x00007fa079e12e00 _swift_stdlib_reportFatalErrorInFile + 115
1    libswiftCore.so                    0x00007fa079d5b06c <unavailable> + 3035244
2    libswiftCore.so                    0x00007fa079d5b15e <unavailable> + 3035486
3    libswiftCore.so                    0x00007fa079ba2a12 <unavailable> + 1231378
4    libswiftCore.so                    0x00007fa079d27d42 <unavailable> + 2825538
5    libswiftCore.so                    0x00007fa079ba1ef9 <unavailable> + 1228537
6    libswiftTensorFlow.so              0x00007fa077219022 <unavailable> + 598050
7    libswiftTensorFlow.so              0x00007fa077217770 checkOk(_:file:line:) + 508
8    libswiftTensorFlow.so              0x00007fa07723be70 _TFCCheckOk(_:) + 81
9    libswiftTensorFlow.so              0x00007fa07723be60 _swift_tfc_CheckOk + 9

In the jupyter console I get this error:

$ python: /swift-base/swift/include/swift/SIL/AbstractionPattern.h:299: void swift::Lowering::AbstractionPattern::initSwiftType(swift::CanGenericSignature, swift::CanType, swift::Lowering::AbstractionPattern::Kind): Assertion `signature || !origType->hasTypeParameter()' failed.

Any ideas of where to start? Notebook is in the fastai_audio branch here.

1 Like

(brett koonce) #2

I got this as well when trying to use resnet34 on lesson 11 earlier, for what it’s worth.

0 Likes

(Thom M) #3

Bumping this one. The notebook is merged into the fastai_docs master branch (I didn’t really mean to include it in the PR, and would be happy to remove it until it works!). The crash is still happening with the 0.3.1 release, with the same error messages.

For what it’s worth, this is the function in SIL/AbstractionPattern.h that’s crashing:

  void initSwiftType(CanGenericSignature signature, CanType origType,
                     Kind kind = Kind::Type) {
    assert(signature || !origType->hasTypeParameter());
    TheKind = unsigned(kind);
    OrigType = origType;
    GenericSig = CanGenericSignature();
    if (OrigType->hasTypeParameter())
      GenericSig = signature;
  }

…but I don’t know which Swift code is calling this function. TBH I’m most interested in how I could debug this. I’m going to try setting this up in xcode with a non-CUDA toolchain and see if that gets me anywhere, but not sure whether that’s the best path.

Would appreciate any help with either educated guesses about what’s going wrong (my gut says something about correct tensor types) or tips towards an effective debugging mechanism (e.g. are there log files or something?).

Cheers again :slight_smile:

0 Likes

(brett koonce) #4

Sorry, wasn’t clear the other day on what I meant. When I switched from xresnet18 to the xresnet34 variant I got this error as well, which suggests to me that we’re running out of memory and then it’s just dying as a result of that and throwing a weird error.

Would suggest you use the cpu-only version with jupyter and see if your code runs on that to make sure the above is not an issue. Tried to do it here but my install is acting up.

ps. your code is using shellCommand, which isn’t imported by default in your notebook right now. You might do a PR where you add it/make sure 00_load_data.swift gets imported somewhere.

0 Likes

(Thom M) #5

Cheers Brett. FWIW it still crashes if I change it to an xresnet18.

I’ll give the CPU-only version a crack.

ps. your code is using shellCommand,

Yep, it’s just that I haven’t updated it in a few weeks, now it’s using the .shell extension.

0 Likes

(Marc Rasi (S4TF Team)) #6

The error information that you get in jupyter when you hit a compiler bug is not great. Trying to compile this as a swiftpm swift exectuable might get you some more useful information.

@pcuenq recently added a feature to https://github.com/latenitesoft/NotebookExport that lets you export notebook cells to a swiftpm swift executable quite easily. You could try using that to create a swiftpm swift excutable, and then swift run <name> it.

3 Likes

(Thom M) #7

Thanks Marc. I’m just getting time to look at this again. I’ve wrapped the code up into an executable and (after a little bit of tweaking) it successfully compiles. When run, unfortunately, it throws pretty much the same error as I saw in the notebook earlier:

Fatal error: No algorithm worked!: file /swift-base/swift/stdlib/public/TensorFlow/CompilerRuntime.swift, line 2123
Current stack trace:
0    libswiftCore.so                    0x00007fbb021eb4a0 _swift_stdlib_reportFatalErrorInFile + 115
1    libswiftCore.so                    0x00007fbb0213330c <unavailable> + 3035916
2    libswiftCore.so                    0x00007fbb021333fe <unavailable> + 3036158
3    libswiftCore.so                    0x00007fbb01f7a6c2 <unavailable> + 1230530
4    libswiftCore.so                    0x00007fbb02100292 <unavailable> + 2826898
5    libswiftCore.so                    0x00007fbb01f79ba9 <unavailable> + 1227689
6    libswiftTensorFlow.so              0x00007fbb01024572 <unavailable> + 599410
7    libswiftTensorFlow.so              0x00007fbb01022cc0 checkOk(_:file:line:) + 508
8    libswiftTensorFlow.so              0x00007fbb01047ad0 _TFCCheckOk(_:) + 81
9    libswiftTensorFlow.so              0x00007fbb01047ac0 _swift_tfc_CheckOk + 9
Illegal instruction (core dumped)

Is there a way for me to go deeper into this? I (unsurprisingly) can’t find a heap of linux command line Swift debugging/stack tracing tools… @pcuenq or @asparagui, do you know of any Swift compilation flags I can set, or logs I can check, which might point me in the right direction?

In meantime, I’m off to try to set up the CPU-only version.

@asparagui, as for running out of memory, I don’t see how I could be hitting that limit; these tensors are very small, I’ve tried running with a batch size of 8 and it still crashes, and nvidia-smi doesn’t show any pressure before the kernel crashes…

1 Like

(Thom M) #8

Minor detail, but a possible clue, it looks like the error is happening “below” Swift proper. In order to get the notebook to compile & run as a module, I had to wrap the fit line thus:

do {
    try learner.fit(1)
} catch {
    print("Unexpected error: \(error).")
}

But the error I got in the shell didn’t include that “Unexpected error:” line. So it looks like it deferred down to TF (or whatever’s just underneath the surface…) which is what actually barfed.

A bit of a chink in the armour of the promised “nice easy error messages that link directly to your code” advantages of S4TF…! (It’s OK I’m still a fan)

1 Like

(Brennan Saeta (S4TF Team)) #9

Hey ThomM!

Yeah, S4TF right now runs on top of the existing TensorFlow C++ runtime. This runtime doesn’t have the diagnostic information we want properly plumbed through everywhere. In the 2nd half of this year, we’re going to be re-doing significant fractions of the software stack, based around MLIR. MLIR is designed to have perfect source location tracking, which will very much help with exactly the error messages you’re encountering. Hope that helps provide a bit more context. (In short, you’re absolutely right that the software stack today is nowhere close to fulfilling its promise. We’re still excited about the future; it is just going to take us a little while to get there…) If you have any further questions / concerns, please don’t hesitate to reach out! :slight_smile:

All the best,
-Brennan

3 Likes

(Thom M) #10

Thanks Brennan, that is useful context! All the more reason to be excited for the future. I look forward to it.

In the meantime I’m going to leave this alone for a little while, there’s enough non-modeling work to be done around audio to keep me busy until things have settled down on the NN layer front!

1 Like