What’s the best way to go about doing Optical Character Recognition (OCR)?

Hi all,

I am working on an OCR model using fastai. I was able to train on IAM dataset using tensorflow and was able to get 65% accuracy. I’m hoping to get better results using pytorch/fastai…

For now I want to solve it step by step using v3 API…

Here’s my understanding -

  1. Create a databunch where the datasets return <vision.Image.Image, torch.tensor (containing the text in image)>

  2. Create a pytorch model - this should be normal

  3. Create a new stepper to compute batch target lengths that are fed into CTC loss function

  4. Create a learner using above 3 and train.

Am I correct with above assumptions? Can anyone advise me if I’m missing something or if I should do anything differently?

1 Like

So far… I could load all images into a data-bunch as below image with following code

%%time

with open('words.txt', 'r') as f:
    data = f.readlines()[18:]

data = [l.strip().split() for l in data]
data = {l[0]:l[-1] for l in data}

def label(fpath)->'LabelList':
    "Give a label to each filename depending on."
    return list(data[fpath.stem]) # for now we are losing on the order

deg = 5
tr_tfm = [perspective_warp(magnitude=(-0.,0.)), rotate(degrees=(-deg,deg))]
val_tfm = [perspective_warp(magnitude=(-0.,0.)), rotate(degrees=(-deg,deg))]

data = (ImageItemList.from_folder('words')
     .random_split_by_pct()
     .label_from_func(label)
     .transform((tr_tfm, val_tfm), size=(32, 128), resize_method=ResizeMethod.SQUISH)
     .databunch())

I have two doubts

  1. The targets are not in order, what needs to be changed for them to be in order? Is this because I am returning a list of outputs in the label function?

  2. How to ensure that the images’ aspect ratio is maintained (i.e., not stretched?). I have tried ResizeMethod.PAD but there’s an error in concatenating the dataset since the transformed images don’t have the same shape…

Please help.

I think you need to use MultiCategoryList for your labels, the easiest way would be to make your label function return a string with the letters separed by a char (like :wink: and then pass label_delim = ‘;’ when you call label_from_func

For the aspect ratio, try without any transform first, to check it doesn’t come from the warping or the rotation.

I modified it that way to return ';'.join(list(data[fpath.stem])) and added label_delim in label_from_func as you suggested. But now fastai shows a warning like follows on databunch creation -

UserWarning: There seems to be something wrong with your dataset, can't access these elements in self.train_ds: 44595,73249,57586,49432,60372...

and throws an index error on calling data.show_batch

IndexError: index 59178 is out of bounds for axis 0 with size 43886

Please advice…
Thank you! :slight_smile:

Just try to access one element in data.train_ds.y, or look at data.train_ds.y.items and data.train_ds.y.__class__ to see what’s going on.

len(data.train_ds.y) returns 38879 whereas len(data.train_ds.x) returns 92256.

I guess the error is coming from not able to load strings which have a single character… But I can’t confirm it since I checked the source code and can’t seem to find label_from_func using label_delim anywhere. Could you let me know what portion of source code would help me out?!?

It is quite daunting to read source-code when you don’t know where to look.
I’ve just started with the writing the code and there’s a long way to go. Thank you for helping me @sgugger :slight_smile:

But what do you have in data.train_ds.y.items[0] for instance, and what is data.train_ds.y.__class__ (if it’s not MultiCategoryList, the bug is there). Then if you cna load the index 0 or 1, what appends when you try to load in data.train_ds.y.items the indices where it tells you there is a problem?

I can load indices and the output seems to be as expected. __class__ returns MultiCategoryList, and data.train_ds.y.items is returning a list of lists of integers.

But on loading a larger index, there’s a difference between image and the target, which made me suspect not all targets are being loaded properly.

Can you check with data.train_ds.x.items[10000] the filename for this image?

it is the same image as data.train_ds.x[10000]

I meant the filename. Because the label is your function applied to that filename.

It returns PosixPath('words/c06/c06-027/c06-027-01-05.png')

I’ve been loading the target from a dictionary of {fpath: label}, I’ll inspect if something’s awry with it and update this comment…

EDIT:

Originally when label_from_func return signature is

return list(data_dict[fpath.stem])

there is no error. Data loads in a 1-1 fashion. The only issue being target is MultiCategory instead of MultiCategoryList

On changing it to

return ';'.join(list(data_dict[fpath.stem]))

only some targets are loading in train_ds.y as MultiCategoryList whereas all the images loaded in train_ds.x without any problem.

Now that is weird. You could have None for some targets, but having them forgotten is really weird.
What is your latest code?

@yesh @sgugger I’m also interested in this problem but i find it difficult to create a databunch in fastai. All my images are saved in file h5 because there are too many, can’t put it all in one folder.
All I have is a file image.h5 which saves all images (type numpy array) and a dataframe which links filename of image with its label (text)
Could you please give me an insight what can I do to create a databunch from it?

1 Like

Hi Yesh, i am also into OCR with fastai. Did you get it done to run your code?

Not really, I was occupied with other work. If people are interested, I can create a github project with access to a sample dataset of 20000 images of large numbers (like captcha) and the preliminary pytorch code that I have written that is giving around 60% word accuracy. Please let me know so I can add the interested people as collaborators.

Hello, I got interested in using fastai to create an image to LaTeX converter. I suppose the question would be similar to using fastai for OCR? I am aware of the proposal to use MultiCatogoryList, but the dataset im2latex does not have convenient delimiters to easily apply label_delim. Just wondering if you have made progress in your problem.