CLIP embeddings + tokenizer

Hello. I was wondering why we use padding="max_length" as opposed to padding=True. The latter gives a shorter sequence of (4, 17, 768) as opposed to (4, 77, 768) in the example of stable_diffusion.ipynb.

In a simlar fashion, we only use the input_ids and not attention_mask was there a reason for this? eg: text_encoder("cuda"))[0].half() was used as opposed to text_encoder(**{k:"cuda") for k, v in uncond_input.items()})[0].half(). Won’t the latter be more accurate as the clip encoder won’t pay attention to the padded tokens?

And this brings me to the final question. I assumed that clip encoders embed text/ images into a single 768 dimensional vector. I assumed this would be done by averaging the output embeddings similar to how sentence transformers work, or like a CLS token in Bert.

p.s. both ways worked regarding the tokenizer. I just think its less load for the GPU to carry if sequence is shorter.