Labeling training data with non-Latin characters

hellogoodbye · November 28, 2019, 1:10pm

Hi! I am using a dataset of museum images from a Korean dataset. When I load the images using the ImageDataLoader, I have a function that labels each image based on the category of the image - but these words are in Korean.

data = ImageDataBunch.from_name_func(PATH/'data/images', fnames, label_func = func, ds_tfms=get_transforms(), size=224, bs=bs).normalize(imagenet_stats) data.show_batch(rows=3, figsize=(7,6))

These are some of the errors I am seeing:
UserWarning: You are labelling your items with CategoryList. Your valid set contained the following unknown labels, the corresponding items have been discarded. 석 gray schist, 나무 나무에 채색, 도자기010 if getattr(ds, 'warn', False): warn(ds.warn) /Users/...../anaconda3/lib/python3.7/site-packages/matplotlib/backends/backend_agg.py:211: RuntimeWarning: Glyph 53664 missing from current font. font.set_text(s, 0.0, flags=flags) ... ...

I changed the func function to return just a random english word instead for the label, and then I don’t have these issues anymore.

Is there a way for me to specify that the label I want is in a non-Latin character?

mrfabulous1 · January 8, 2020, 10:17pm

Hi hellogoodbye hope all is well.
Not sure if you have sorted this problem.
https://docs.python.org/2.4/lib/standard-encodings.html
Could you use some sort of codec? Like the ones in the link to do some conversion.

Cheers mrfabulous1