Reference for parallel scan?

In Lesson 6 of part 1, @jeremy mentioned that scan could be implemented in parallel (also mentioning that it is a surprising and extraordinary result). I agree that it is surprising and extraordinary, and would love to have a reference to the paper.

Thanks!

There is some information here for the prefix sum example:

In general if the operation is associative there is an efficient implementation:

I like this article: https://developer.nvidia.com/gpugems/GPUGems3/gpugems3_ch39.html . There’s also a CUDA course that covers this: https://www.udacity.com/course/intro-to-parallel-programming--cs344