Dealing with URLs, hashtags, and Twitter IDs in social media data

I would like to do NLP on Reddit submissions, twitter data and the like. I am faced with a lot of noise in the data, particularly URLs, the hashtags, and twitter Ids. They will not neccessarily be removed by eliminating infrequent words.

Is this a case of devising a mechanism to clean the data before-hand, or is there a way of dealing with these sorts of things in the library?

Hey @Chris_Palmer … A friend of mine is also working on Twitter data sentiment analysis regarding drugs and their side effects. He used this software called RapidMiner ( ) for twitter data collection and processing. May be this could help you too.


Thanks Vishal

I have looked into Rapid Miner in the past, but not in relation to text mining. I have the data already, its just the cleansing I need help with, so I will check out Rapid Miner again to see how it can help!

Thanks again :smile:

1 Like

hi Chris, can you say a bit more about what you need exactly? I’ve worked on Twitter data in the past and will be working on some Reddit stuff soon - for our Twitter papers, we usually implemented custom preprocessing of text data. In general the preprocessing is task specific - if you tell me a bit more about what you’re trying to do I may be able to make some suggestions.

Hi @anamariapopescug

Thanks for replying! I have the entire Reddit submissions data set from 2006 until August 2015, which I have imported into a SQL database so that its easier to manage getting data from. I have then extracted data that mentions vaccinations by using a wild card search - and of course this data is not at all clean.

Additionally I have twitter converstations around the same subject.

I don’t want to manually clean it if I can avoid it, I would rather do that programmatically, but before embarking on any processing I thought I would ask if there was anything built in to the library that allows us to deal with words that are obvioulsy a URL, hashtag, or twitter id.

Any help with cleaning is very welcome, but at this stage I want to know if I even need to worry about doing that - perhaps there is a inbuilt mechansim, or flag to tell it to ignore these, or perhaps the nature of the algorithms mean that I need not worry aobut these?

If not in, and I need to clean them out, then maybe there is something available in Python or even in Linux that allows me to put these words aside. Ideally I would want to submit my text to a mechanism that strips these out, maybe replacing them with UNK, and that does a word count per submission after that to determine if there is anything useful left :slight_smile:

To give an example of the scope of the issue look at the following twitter conversation - even if I said to drop words with a frequency of less than 10 it would not eliminate these, and also the useful information in the text is almost non-existent once the twitter ids have been eliminated.

@donttreadonme65 @KCarslin @virleehol @Steven31015146 @I_LUV_IDF Steroid shot ?
@donttreadonme65 @KCarslin @virleehol @Steven31015146 @I_LUV_IDF Lol... They do work well though.. Worth it !
@donttreadonme65 Good here little one, just lunching. You feeling any better? @parkman_sandi @virleehol @Steven31015146 @I_LUV_IDF
@KCarslin @virleehol @donttreadonme65 @Steven31015146 @I_LUV_IDF it's very very short.. I'm going to try from my iPad..
@I_LUV_IDF @Steven31015146 @KCarslin @parkman_sandi @virleehol Awe a sweetie pie:-)
@donttreadonme65 @KCarslin @virleehol @Steven31015146 @I_LUV_IDF in the bumm too.. No doubt ! Those will light your behind up !! Ouch !
@donttreadonme65 @KCarslin @parkman_sandi @virleehol @I_LUV_IDF sorry I'm late, but I pray for your healing
@parkman_sandi @KCarslin @virleehol @Steven31015146 @I_LUV_IDF True whole hip and ass sore lol
@parkman_sandi Done, no laughing. Got it? ;-) @virleehol @donttreadonme65 @Steven31015146 @I_LUV_IDF
@parkman_sandi @KCarslin @virleehol @Steven31015146 @I_LUV_IDF Shot that hurt like Hell but feeling better today TY #tweetiepie #besties
@KCarslin @virleehol @donttreadonme65 @Steven31015146 @I_LUV_IDF yes born in May 
@donttreadonme65 Dispatch, K9 needs backup!!! @I_LUV_IDF @Steven31015146 @parkman_sandi @virleehol
@KCarslin @virleehol @donttreadonme65 @Steven31015146 @I_LUV_IDF @KCarslin DM me your email address and I can..
@KCarslin @donttreadonme65 @parkman_sandi @virleehol
@parkman_sandi @KCarslin @virleehol @Steven31015146 @I_LUV_IDF OMG yes:(
@parkman_sandi Still no luck? @virleehol @donttreadonme65 @Steven31015146 @I_LUV_IDF
@parkman_sandi You'd mentioned getting in October, wasn't sure if a kitten at time. @virleehol @donttreadonme65 @Steven31015146 @I_LUV_IDF
@KCarslin @virleehol @donttreadonme65 @Steven31015146 @I_LUV_IDF @KCarslin too late.. Already laughed. Saw the DM before seeing this ! 
@KCarslin @I_LUV_IDF @Steven31015146 @parkman_sandi @virleehol Lol yea he could you some
@donttreadonme65 @KCarslin @virleehol @Steven31015146 @I_LUV_IDF Sorry to hear that. Sending hugs & positive energy your way ! xoxo
@Steven31015146 With you on that Doc. Get well soon little one. :'( @donttreadonme65 @parkman_sandi @virleehol @I_LUV_IDF
@parkman_sandi Yep, seen yours first also. ;-) Kitteh is still pretty young.?. @virleehol @donttreadonme65 @Steven31015146 @I_LUV_IDF
@KCarslin @parkman_sandi @virleehol @Steven31015146 @I_LUV_IDF Thanks Sandy my grandson had the flu so now I have it:( I have COPD so worse

Hi Chris. I’ll let @jeremy confirm the presence/absence of any specific preprocessing in the library, but generally speaking that kind of processing is custom (e.g. you write Python preprocessing scripts). You have to be a bit careful because for example hashtags can be very informative (e.g. “#pain”, “#headache” - if you’re looking at reactions to medications for example), so you may not want to throw them out. You can try replacing urls and account ids with UNK tokens - like you said, you might not have that much salient content left afterwards :). Not sure how much Twitter data you have, but if it’s very sparse you may want to consider a simpler/different approach (and throwing out infrequent words may actually hurt because reactions to medications may only be mentioned infrequently) …

Yes, you are right about the hashtags and the infrequent words - in fact the focus of the requirement is on infrequent but serious reactions!

Do you know of any Python routines that would help, to get me started at least? I am only getting to grips with Python latterly, my background is in SQL and before that other DBMS languages like Foxpro, so I cannot immediately get stuck into solution mode in Python.

I’d search for “Python string manipulation” or “Python string processing” - there are many nice Python tutorials (i think there was a 1 day workshop before the beginning the course where @jeremy and Yannet offered some links ?). I think you’ll have to unfortunately have to do a whole bunch of pre- and post-processing for any kind of NLPish task, whether in Python, Perl, Java, etc.

oh yes, that Reddit dump! Looking at it for a collaboration as well :). I would say for your task you might find that Reddit data is more fruitful, as Twitter comments can be more idiosyncratic, very short, etc. You can definitely mine them but it may require more work (or a different approach) - again, depends on how big your Twitter corpus is.

1 Like

I would do this with Regex. Read the Data into a Pandas Dataframe and clean up with a Regex substitution. I wrote a Regex to replace all URL http links and @ username with a empty value as a starter -


Pandas also has read_sql, so you can read data from DB directly and then do your cleaning and store the cleaned data back into Database via to_sql or to_csv syntax. That way you only have to do the pre-processing once. @Chris_Palmer


The twitter corpus is not that large - 158,000 records.

As you say not as useful as Reddit data due to the size of tweets, but they are based on published data sets of mentions of adverse drug reactions, which I got from Diego Lab.

Actually, Diego Lab has a number of published collections, the 158,000 drug chatter collection (originally 247,000 records) is not labelled, but they also have smaller, labelled sets - I have one of 6,670 records (after downloading).

WIth the large set I have isolated those that mention vaccines, and have gone back and retrieved the entire conversation, not just the initial mention of the reaction (which their published IDs point to).

And of course I need to be able to mine tweets for this kind of conversation…

If you would like to collaborate on something with the Reddit corpus then we are definitely intertested.

Regarding the Reddit corpus, it was quite a process getting the data out of the massive JSON file into records in SQL, but I have that now and its easy to extract data based on any wild card SQL search. I will publish my process for extracting it to SQL when I get time, hopefully soon, but I am very happy to help anyone that might want to make use of what I have.

Alternatively if you have a way of getting data directly from the JSON file without it taking a month of Sundays, I would love to know about it!

Thanks Ramesh - a nice and concise solution!

I will check it out :smiley:

Let me know if you need any help with the pre-processing step as you work with this data. I would be happy to help out and/or collaborate as you work on this.

If you have the files as JSON, don’t push it to Database. Leave them as JSON and read directly into Pandas via read_json or use MongoDB, if it’s too big.

This is all new territory to me - whereas I know SQL - so took that approach. :slight_smile:

MongoDB sounds like the way to go - there are over 200,000,000 records…

Thanks Ramesh. I will digest your advice so far, and get back to you! I really appreciate your offer :smile:

If you already have it imported to SQL, you could stick with it. Since that’s something you are comfortable with and you have the data there now.

For sure, I have the data now and it’s great to be able to work with it in a familiar technology, but at the time I did this was wishing I knew how to do a “big data” approach to it - so much to learn!