“It is a truth universally acknowledged, that a stable diffusion user in possession of a good prompt, must be in want of two things: well-formed hands and ungarbled text.”
It has been noted that models like DALLE-2 and stable diffusion can struggle with text. Although the models have become better at rendering details like letters the results are often garbled.
What do we know about text in the context of stable diffusion?
The Parti model from Google notes that spelling capability appears to emerge in models once there are enough parameters.
From Parti: “A portrait photo of a kangaroo wearing an orange hoodie and blue sunglasses standing on the grass in front of the Sydney Opera House holding a sign on the chest that says Welcome Friends!”
As a rough comparison Stable diffusion is currently around 1B params (DALLE2 is 3.5B and Imagen is 4.6B).
So there is the question of if it is possible to improve text and spelling in a smaller model like SD. Is it possible or is there a phase shift around 6.7B parameters that enables outlier features?
There are some short-term workarounds. As long as the model produces text-like markings these can be identified by an OCR model (e.g. pytesseract) and replaced in-place with a known font. If needed these placed words can be blended back into the style by running image to image at a low strength. This has limits as to the orientation and styles that can be achieved, in addition to the complexity and compute cost of adding OCR to the pipeline.
The authors of the Imagen paper noted the importance of text encoding for both fidelity and coherence, so enhancing the model text encoder could improve results for text image generation.
Fine-tuning on more text-oriented data would also likely help as the original model was initially trained on a large subset of LAION-5B, with the final rounds of training done on “LAION-Aesthetics v2 5+”, a subset of 600 million captioned images which an AI predicted that humans would give a score of at least 5 out of 10 when asked to rate how much they liked them. This likely lacks text examples in favor of more artistic results.
One other idea would be to utilize results from an OCR model in training. Would it be feasible to add a loss term to the denoising process that pushes toward OCRable text being produced?
Of course accurate text image generation also creates moderation challenges such as the need for post-generation OCR powered profanity filtering (e.g. see Elmo holding up a sign that says “fluck you” - r/weirddalle).
Has anyone else experimented with improving the text generation performance of stable diffusion?