Reference for parallel scan?

In Lesson 6 of part 1, @jeremy mentioned that scan could be implemented in parallel (also mentioning that it is a surprising and extraordinary result). I agree that it is surprising and extraordinary, and would love to have a reference to the paper.


There is some information here for the prefix sum example:

In general if the operation is associative there is an efficient implementation:

I like this article: . There’s also a CUDA course that covers this: