Neural string parsing

I’m trying to parse a string (a financial product definition) and classify each character. The simplest example is:

10m CHF5y

For this example 10m should be classified as size, the space should be ignored, CHF is the currency, and 5y is the length (aka tenor).

I’m using character embeddings and a bidirectional LSTM but I’m not getting very good results. I’ve had to generate my training data (randomised following some patterns) and I’m using a different set of patterns for training and validation (e.g. the validation patterns may have the currency before the size).

I’m finding that the training loss gets very low on just one epoch of 3000 samples but the validation loss gets worse after about 1500 samples and never gets very good. It’s not overfitting the data as such since I’m only running one epoch but it seems to be overfitting the pattern of the data.

Am I doing anything obviously wrong? Do I just need better data?

Here’s my Colab notebook which has some more info. I can provide the training data or generator script if desired.

https://colab.research.google.com/drive/1hOgcHJCHokos3e8gIuvDZsTzKdO6WSJA

I tried writing a bunch more string patterns and it’s improved the performance quite a lot. I think with lots of examples generated from just a few patterns, the model learns to classify characters from their position within the string rather than the combination of characters within each pattern, and this leads to poor generalisation.

Still, I’d love to get some pointers on ways to improve my model architecture or training method.