I was reading the Knowledge distillation paper and it said “we trained a single large neural net with two hidden layers of 1200 rectified linear hidden units on all 60,000 training cases”

I thought that each layer had 1 activation. So what’s with the 1200 RELUs?

Each layer has a number of parameters (AKA units). Assuming both hidden layers had the same number of parameters, it would appear that each layer had 600.

So 600 parameters were rectified per hidden layer, of which there were 2. 1,200 total rectified parameters.