I’m trying to parse a string (a financial product definition) and classify each character. The simplest example is:
For this example
10m should be classified as size, the space should be ignored,
CHF is the currency, and
5y is the length (aka tenor).
I’m using character embeddings and a bidirectional LSTM but I’m not getting very good results. I’ve had to generate my training data (randomised following some patterns) and I’m using a different set of patterns for training and validation (e.g. the validation patterns may have the currency before the size).
I’m finding that the training loss gets very low on just one epoch of 3000 samples but the validation loss gets worse after about 1500 samples and never gets very good. It’s not overfitting the data as such since I’m only running one epoch but it seems to be overfitting the pattern of the data.
Am I doing anything obviously wrong? Do I just need better data?
Here’s my Colab notebook which has some more info. I can provide the training data or generator script if desired.