AI Generated Spam Countermeasures

Hey all, I wanted to start a topic to discuss countermeasures against AI generated spam (and other similar problems like mass disinformation campaigns and fake news).

It’s a hot topic right now; OpenAI has announced they are releasing their full GPT-2 model to researchers to study detection and misuse potential. And a project of my own may apparently be one of the first instances of a text-generation model being used in the wild for trolling.

For my particular instance, I’ve added an invisible unicode watermark to the generated text from my model so that naive copy/pasting to other places can easily be detected and filtered if necessary. Unfortunately, this watermark would be trivial to remove (and does nothing to prevent bad-actors from creating their own models and using them for nefarious purposes).

I’ve seen some people working on machine learning detectors for GPT-2 generated text. And xkcd floated the idea of programmatically determining the quality of posts (warning: foul language within!) so that if bots are going to spam at least they’ll be spamming constructive and helpful things.

I’m interested in what other attack vectors there are, what ideas for countermeasures people have, and whether you’ve seen any other approaches tried already.

This seems like an area that is going to need lots of attention in the future.


It looks like OpenAI has published a dataset of GPT-2 outputted text here and provided a baseline of detection models.

I’m curious whether a detector that is good at detecting GPT-2 output is also good at detecting output from other models like AWD-LSTM or whether a new detection model is needed for each generation model.