Ok, here is the link to the cleaned-up notebook.
Also, I tried one of the new convnextv2_nano models with the same style and content weights and all the GELU layers and got wildly different (and bad) results. I did not have a chance to investigate why, though.