Text generation metrics

Hi, first of all, thanks to this great community. This is actually my first time, and I’m so happy to be part of these fantastic discussions.

I work on a project which mainly contains text generation in the Persian language. I have been working on different architectures, including AWD-LSTM, GPT-base, and so on. the metrics that I’ve used (as I saw in fastai courses) were loss, perplexity, and accuracy.
I also tried fine-tuning models on downstream tasks, but that could take some time.
I wonder if there are other metrics specific to text generation to evaluate different architectures base on the same data set?


A widely-used metric is perplexity, which is used almost exclusively for language models’ ability to predict the next token and doesn’t necessarily correlate with how well they’d be for downstream tasks like classification. There’s also accuracy, but I’ve found perplexity is oft-times a better representation of your model’s generative abilities.

Also, I advise you play around with each model you train to gain a deeper appreciation of their pros & cons. For example, you might realize GPT’s grammar is more superior, but AWD-LSTM outputs more creative text.

By the way, I too speak Persian, I’d be interested to know what preprocessing steps you apply to your data and what results you’ve gotten so far :slightly_smiling_face:.

Have a nice weekend!

1 Like

Thanks @BobMcDear
I’ve used the GPT2 language model and fine-tune it in the Persian language based on this great work Faster than training from scratch — Fine-tuning the English GPT-2 in any language with Hugging Face and fastai v2 (practical case with Portuguese) done by @pierreguillou. The results looked well(keeping context), but the bottleneck is generating long text. Also, I didn’t have many resources, so this too may affect the results.

1 Like