Swift Multithreading: Very poor performance. Beware

stephenjohnson · May 24, 2019, 8:05pm

One of the upsides of having FastAI in Swift would be the ability to better utilize multithreading than what can be accomplished in Python, however, while I was working on a small project, I encountered extremely poor performance (multiple times slower) trying to add multithreading to it to speed it up. The single threaded version far outperformed my multithreaded version even when the multithreaded version was utilizing all seven of my i7 cores. In my application the threads were sharing a common array (struct) with each thread just reading the data and performing some other work based on the array’s contents. So no writing and no added synchronization since no updates. However, for some reason I have yet to figure out this results in the extremely poor performance. By having each thread make it’s own copy of the array or passing a copy to the thread, the issue is resolved. In looking at the Swift source, I don’t see any type of threading synchronizations going on but I do see a lot of checks for reference uniqueness but that uses a Builtin that i haven’t yet cracked open to try to understand. I mention this issue because I would assume that any multithreaded implementation of FastAI may shared DataSources, DataLoaders, etc. and may suffer the same ill-fate. If anyone knows the underlying cause of this issue, that would be great.

Here is a simple sample program which illustrates the issue:

import Foundation

class SomeThread : Thread {

    private let numbers: [Double]

    public init(numbers: [Double]) {
        self.numbers = numbers//.map { $0 } // Remove the first comment to fix the issue
    }

    override func main() {

        // do some calculations
        for loop in 1...5 {
            var sum = 0.0
            for num in numbers {
                sum += num
            }
            
            var min = numbers.first!
            for num in numbers {
                if num < min {
                    min = num
                }
            }

            var max = numbers.first!
            for num in numbers {
                if num > max {
                    max = num
                }
            }

            var squareRoots = [Double]()
            for num in numbers {
                squareRoots.append(num.squareRoot())
            }
            print("Loop \(loop) done.")
        }
    }

}

var numbers = [Double]()
for _ in 1...1_000_000 {
    numbers.append(Double.random(in: 0...1))
}

var threads = [SomeThread]()
for _ in 1...10 {
    let thread = SomeThread(numbers: numbers)
    threads.append(thread)
    thread.start()
}

let start = Date()
outer: repeat {
    for thread in threads {
        if !thread.isFinished {
            Thread.sleep(forTimeInterval: 0.01)
            continue outer
        }
    }
    break
} while true
let end = Date()
print(start)
print(end)

Here’s the output with the issue. Notice how all the loops coincide with no overlap of loop output. This took almost 2 minutes to run.

root@27e01c28a48d:/swiftthreads# swift run
Loop 1 done.
Loop 1 done.
Loop 1 done.
Loop 1 done.
Loop 1 done.
Loop 1 done.
Loop 1 done.
Loop 1 done.
Loop 1 done.
Loop 1 done.
Loop 2 done.
Loop 2 done.
Loop 2 done.
Loop 2 done.
Loop 2 done.
Loop 2 done.
Loop 2 done.
Loop 2 done.
Loop 2 done.
Loop 2 done.
Loop 3 done.
Loop 3 done.
Loop 3 done.
Loop 3 done.
Loop 3 done.
Loop 3 done.
Loop 3 done.
Loop 3 done.
Loop 3 done.
Loop 3 done.
Loop 4 done.
Loop 4 done.
Loop 4 done.
Loop 4 done.
Loop 4 done.
Loop 4 done.
Loop 4 done.
Loop 4 done.
Loop 4 done.
Loop 4 done.
Loop 5 done.
Loop 5 done.
Loop 5 done.
Loop 5 done.
Loop 5 done.
Loop 5 done.
Loop 5 done.
Loop 5 done.
Loop 5 done.
Loop 5 done.
2019-05-24 19:47:11 +0000
2019-05-24 19:49:07 +0000

With the fix, this runs in 14 seconds and notice how the loops numbers start to interweave.

root@27e01c28a48d:/swiftthreads# swift run
Loop 1 done.
Loop 1 done.
Loop 1 done.
Loop 1 done.
Loop 1 done.
Loop 1 done.
Loop 1 done.
Loop 2 done.
Loop 1 done.
Loop 2 done.
Loop 2 done.
Loop 1 done.
Loop 2 done.
Loop 1 done.
Loop 2 done.
Loop 2 done.
Loop 2 done.
Loop 3 done.
Loop 2 done.
Loop 3 done.
Loop 3 done.
Loop 2 done.
Loop 3 done.
Loop 2 done.
Loop 3 done.
Loop 3 done.
Loop 3 done.
Loop 4 done.
Loop 3 done.
Loop 4 done.
Loop 4 done.
Loop 3 done.
Loop 4 done.
Loop 3 done.
Loop 4 done.
Loop 4 done.
Loop 4 done.
Loop 5 done.
Loop 4 done.
Loop 5 done.
Loop 5 done.
Loop 4 done.
Loop 5 done.
Loop 4 done.
Loop 5 done.
Loop 5 done.
Loop 5 done.
Loop 5 done.
Loop 5 done.
Loop 5 done.
2019-05-24 19:51:01 +0000
2019-05-24 19:51:15 +0000

parkers · May 24, 2019, 8:56pm

This is very interesting. I haven’t quite been able to verify that this is exactly what happens, but it seems like the array is getting copied every time it is accessed. This is why your .map {} solves the problem. I just did some tests and couldn’t find any tricks to make the problem go away. It seems like it might be a problem with swift’s begin_access system. I’ll update if I find anything else.

bradlarson · May 24, 2019, 9:11pm

The first thing I’d note is that you’ll want to make sure you’re building for release with optimizations on (-O) before performing benchmarks on Swift code. You’ll see wildly different performance than for debug builds without optimization (-Onone). For example, on my quad-core i5 I see the following results under both optimizations:

-Onone, no map: 71.6 seconds
-Onone, map: 11.7 seconds

-O, no map: 0.14 seconds
-O, map: 0.18 seconds

The map numbers varied a lot more than the non-map ones, but they generally were higher, as expected for the insertion of that additional operation.

For a true timing, you’ll want to make sure to move your

let start = Date()

to be before

var threads = [SomeThread]()

because work will be starting as soon as you fire off those threads.

As a design suggestion, I’d highly recommend using Grand Central Dispatch instead of manual thread allocation. GCD (Dispatch) uses thread pooling that should scale for available cores on your system, and its operation-based API really simplifies asynchronous code.

For example, you can do an equivalent to the above using the following:

import Foundation
import Dispatch

var numbers = [Double]()
for _ in 1...1_000_000 {
    numbers.append(Double.random(in: 0...1))
}

let semaphore = DispatchSemaphore(value: 0)

let calculationBlock = {
    
    for loop in 1...5 {
        var sum = 0.0
        for num in numbers {
            sum += num
        }
        
        var min = numbers.first!
        for num in numbers {
            if num < min {
                min = num
            }
        }
        
        var max = numbers.first!
        for num in numbers {
            if num > max {
                max = num
            }
        }
        
        var squareRoots = [Double]()
        for num in numbers {
            squareRoots.append(num.squareRoot())
        }
        print("Loop \(loop) done.")
    }
    
    semaphore.signal()
}

let start = Date()

for _ in 1...10 {
    DispatchQueue.global().async(execute: calculationBlock)
}

for _ in 1...10 {
    semaphore.wait()
}

let end = Date()
print("Time elapsed: \(end.timeIntervalSince(start))")

stephenjohnson · May 25, 2019, 12:31am

The example was meant to illustrate the issue being observed. It was obviously very simple and contrived and meant to remove all other possible dependencies that might cause the issue hence not using any other libraries like GCD, etc.

stephenjohnson · May 25, 2019, 12:59am

Indeed, running in release mode which turns on optimizations gets it running in milliseconds instead of almost 2 minutes. I can’t imagine there should be such a wide variance between these two modes.

swift run -c release

bradlarson · May 25, 2019, 3:52am

The difference in benchmark performance between -Onone and -O optimization levels in Swift is something that’s tripped up a lot of people, including myself on more than one occasion. It’s been a topic of discussion since the language was first unveiled.

There’s been some fun exploration of this in Stack Overflow questions like this one. You can see that at the time that was asked the -Onone level would leave in a bunch of retain / release calls that you could see in the profiling output. I could see how this might also turn off or interfere with the copy-on-write optimizations for Array in the above, leading to the odd behavior we see here.