Fast file enumeration in Swift

Hi all.

Jeremy pointed out in passing that the fastai get_files function was enumerating the imagenette files faster than the Swift fetchFiles functions (which uses mxcl’s Path library). I got curious about why and had a look.

First, I checked if removing Path speeded up the Swift script. It did, by about 30%, apparently because the Path initializer which takes URLs was slow. However, this still left the Swift function much slower than the Python one. I did a little more digging and here’s what I found:

  • the Python code is faster on Linux.
  • the Swift code is faster on macOS
  • the Python code is based on os.walk which is based on the scandir family of system calls.
  • the Swift code (and Path) is based on Foundation, which is based on fts_open and friends (at least on Linux. I didn’t check the Darwin version).

Seems like the thing to do is to try manually recursing and using other bits of Foundation API, to write a Swift file enumerator based on scandir, or to update Foundation to use scandir on Linux.

The scripts are here if anyone is curious: https://github.com/algal/get_files

4 Likes

This was bothering me too. I adapted a directory walking function I wrote in C years ago, and this is what it looks like in Swift using C-interop:

import Glibc

public func traverse(_ path: String, recurse: Bool = false, handler: ((String) -> ())) {
    let sb: UnsafeMutablePointer<stat> = UnsafeMutablePointer<stat>.allocate(capacity: 1)
    lstat(path, sb)
    let isFile = sb.pointee.st_mode & S_IFREG == S_IFREG
    let isDir = sb.pointee.st_mode & S_IFDIR == S_IFDIR
    //TODO: realpath() links
    sb.deallocate()
    
    if isFile { handler(path) }
    if isDir && recurse {
        guard let dirp = opendir(path) else { return }
        while let dir_entry = readdir(dirp) {
            let entryName = withUnsafeBytes(of: dir_entry.pointee.d_name) { (rawPtr) -> String in
                let ptr = rawPtr.baseAddress!.assumingMemoryBound(to: CChar.self)
                return String(cString: ptr)
            }
            guard entryName != "." else { continue }
            guard entryName != ".." else { continue }
            
            let newPath = path.appendingPathComponent(entryName)
            traverse(newPath, recurse: recurse, handler: handler)
        }
        closedir(dirp)
    }
}

Despite the obscure pointer types, it’s still quite legible. This is fetchFiles updated to use it:

public func fetchFiles_traverse(path: Path, recurse: Bool = false, extensions: [String]? = nil) -> [Path] {
    var res: [Path] = []
    traverse(path.string, recurse: recurse) { filename in 
        if extensions == nil || extensions!.contains(filename.pathExtension.lowercased()) {
            res.append(Path(filename) ?? Path.cwd/filename)
        }
    }
    return res
}

And these were my results inside my Ubuntu Jupyter environment:

  • Swift - Path.ls(): 738.35ms
  • Swift - traverse: 147.56ms
  • Python - scandir: 42ms (!!)

We are converting C strings to Swift Strings to Paths, but still. I guess that if we used more aggressive compiler optimizations we could squeeze some performance, but this is possibly showing the better performance of scandir(3) vs readdir(3) in Linux.

1 Like

That code looks good @pcuenq. There’s some useful notes on performance of the Python version in ‘Rationale’ here. Your code looks to be doing the right thing (ie no unnecessary stat). Maybe try removing Path and just do strings, and see if that fixes perf? If so, we know where to look…

Very cool. It definitely seems like Python is doing it the right way with scandir. I just cleaned up the little benchmarking repo I setup, so it now times the Python function only and not process setup code. And I can’t reproduce the good Swift macOS timing I got using Foundation before – so I am clearly confused. Anyway, if you’re hacking on your readdir-based version, you might find it a useful for comparisons.

1 Like

I can’t make sense of it, but removing Path and appending results to an array of String has worse performance in my case (~500ms).

I tried to skip "." and ".." before converting to String and it didn’t help. I wanted to try with a tree walker version that uses C strings (and only converts to String when the handler is called), but it’s a pain - sprintf and family are not available; unsafe pointer conversions are confusing; etc.

So to recap, Path.ls() uses Swift Foundation under the hood. I found Foundation’s enumerate method was surprisingly slow on Linux. (Foundation on Linux is provided by the “core libs” library, a completely different implementation from Foundation on macOS and iOS, by the way.)

News: I asked about this on another forum and a gentleman named Mark Rowe got curious, did some investigating, discovered a bug in Foundation itself where it’s calling stat twice, fixed it, found it improved performance by 40%, and raised a PR. It’s here: https://github.com/apple/swift-corelibs-foundation/pull/2193

So in good time file enumeration in foundation should speed up, along with the Path library which depends on it.

Here was the bug: the problem is the tricky API URL(fileURLWithPath:). If you call it without passing an additional argument to indicate if the file system path is a directory, then it hits the file system to check. So to write fast code that creates file URLs, you should tell supply the info of whether it’s a file or directory at initialization time.

It would require more careful benchmarking to assess if this is responsible for all of the performance issues in the Swift version, or if UTF String encoding or some other issue is also relevant.

2 Likes

Sounds like good progress! Note that there shouldn’t be any stat call until/unless you actually need the info it provides, however (see for instance the code from @pcuenq above). That’s why Python has a separate DirEntry class that represents the info available without doing stat.

Yeah. It seems like the underlying NSURLDirectoryEnumerator uses fts_read, which provides per-file information, and does not use fts_children, which like scandir provides stat information on all of a a directory’s contents via one call. I put a note on the PR pointing this out. Maybe someone will have a go at that later. The Linux implementation of the Foundation API is quite new, so I suppose a lot has not been optimized.

You are right, we only need the first stat and then we can leverage the d_type field (which is not supported in all filesystems) in the dirent struct returned by readdir. I made the change and got it down to 140ms, still slower. And if I reset the kernel and re-run the cells, the first time I get ~500ms, so there’s some additional caching going on somewhere.

This is the updated code, in case you are interested. We could make the d_type argument a Swift enum and decorate it with helper functions, but we wouldn’t be gaining much clarity in this example.

public func traverse(_ path: String, recurse: Bool = false, d_type: UInt8? = nil, handler: ((String) -> ())) {
    let isFile, isDir: Bool

    // Links ignored for now. Assume d_type is either nil, DT_REG or DT_DIR
    if let d_type = d_type {
        // We already have the information
        isFile = d_type == DT_REG
        isDir = d_type == DT_DIR
    } else {
        // stat(). We only need to do this the first time
        let sb: UnsafeMutablePointer<stat> = UnsafeMutablePointer<stat>.allocate(capacity: 1)
        lstat(path, sb)
        isFile = sb.pointee.st_mode & S_IFREG == S_IFREG
        isDir = sb.pointee.st_mode & S_IFDIR == S_IFDIR
        sb.deallocate()
    }
    
    if isFile { handler(path) }
    if isDir && recurse {
        guard let dirp = opendir(path) else { return }
        while let dir_entry = readdir(dirp) {
            let entryName = withUnsafeBytes(of: dir_entry.pointee.d_name) { (rawPtr) -> String in
                let ptr = rawPtr.baseAddress!.assumingMemoryBound(to: CChar.self)
                return String(cString: ptr)
            }
            guard entryName != "." else { continue }
            guard entryName != ".." else { continue }
            
            var d_type: UInt8? = nil
            if dir_entry.pointee.d_type != DT_UNKNOWN {
                d_type = nil
            }
            let newPath = path.appendingPathComponent(entryName)
            traverse(newPath, recurse: recurse, d_type: d_type, handler: handler)
        }
        closedir(dirp)
    }
}
1 Like