Hello. I was wondering why we use padding="max_length"
as opposed to padding=True
. The latter gives a shorter sequence of (4, 17, 768)
as opposed to (4, 77, 768)
in the example of stable_diffusion.ipynb
.
In a simlar fashion, we only use the input_ids
and not attention_mask
was there a reason for this? eg: text_encoder(uncond_input.input_ids.to("cuda"))[0].half()
was used as opposed to text_encoder(**{k: v.to("cuda") for k, v in uncond_input.items()})[0].half()
. Won’t the latter be more accurate as the clip encoder won’t pay attention to the padded tokens?
And this brings me to the final question. I assumed that clip encoders embed text/ images into a single 768
dimensional vector. I assumed this would be done by averaging the output embeddings similar to how sentence transformers work, or like a CLS token in Bert.
p.s. both ways worked regarding the tokenizer. I just think its less load for the GPU to carry if sequence is shorter.