Questions on OCR using SSD

Lets say I have some newspaper images and I feed that into SSD. I get a brunch of bounding boxes and each bounding box I get one character in it. What are some ideas that I can take that and form a sentence / paragraph back? The last part maybe I can use LSTM to improve the accuracy.

One idea is that maybe I can train a model that takes [(char token, x1, y1, x2, y2) for i in total_characters] and output a list of string in token format?

P.S: I am still on lesson 12.