Finding the real gold in my text

I have thousands of social media texts that describe drug and medication reactions. Many of these are for the same medications, but I am not wanting to model text talking about these specific meds, just to try and determine language that talks about reactions to medications like the ones I have in my data.

That is, for me the “gold” is not a mention of a specific medication, but rather a mention of a reaction to it (or any medication).

My idea is that I should be able to model language that talks about drug reactions, and apply that to any drug/medication. I should be able to download social media feeds that mention my (current) medication of interest - its very easy to use pattern matching for that task. The NLP part is to examine my social media data for language that dicsusses actual personal crises in relation to the medication. This way I hope that I can eliminate news articles, rants, gossip etc.

How do I deal with this data to eliminate an untoward focus on an actual drug name? Do I just replace it with a generic name in the text I feed to the model? But then wouldn’t it “go to town” assigning importance to that generic replacement word? I need something like a tf-idf approach, but I didn’t see how to do that with the techniques @jeremy covered with the ArXiv and ImDB data sets.

If anyone has an intertest in collaborating with me and my wife @Sedigh with this kind of project, we would love to hear from you :wink:


Sounds interesting! If you include text covering lots of other medications as well, and replace all medication names with ‘’, maybe that would suffice?

Thanks @jeremy

If you think that this approach will work then that is a fairly simply thing to do, but would that not end up with the model having an idea about language not requiring a subject? “I took ‘’ last night and now I am feeling terrible”, “my daughter reacted to the ‘’ shot”, “’’ is killing me”, etc.? This is why I asked about whether there is a word frequency weighting/adjustment we might apply to our text in the fastai library, so that it can learn the structure of the conversation without being overly influenced by the specific medication names. Then I wouldn’t need to deal with those med names specifically.

Elsewhere I am asking a question about how to clean out twitter ids etc. from a text stream, and replacing text with an empty string. If I should replace my medication names with an empty string then I need a most efficient mechnism for it, and I am looking for help to clean data of URLs and twitter ids in that post - maybe you have a suggestion?

It’s almost impossible to know answers to questions like this by considering them theoretically. You just have to try them out and see what works!