Isn’t this an object localization problem instead of multi-label classification? As far as I know, MultiCategoryBlock only allows classification whether a category is present in the picture or not, it cannot provide the location or the number of occurrences of a category.
I haven’t tried any of this myself, but maybe you can check out RetinaNet or CRNNs or fastai’s bounding boxes. However, I don’t know how these approaches expect your training data to be labeled, manually providing bounding boxes for each letter seems to be a lot of work