Dealing with URLs, hashtags, and Twitter IDs in social media data

Hi Chris. I’ll let @jeremy confirm the presence/absence of any specific preprocessing in the fast.ai library, but generally speaking that kind of processing is custom (e.g. you write Python preprocessing scripts). You have to be a bit careful because for example hashtags can be very informative (e.g. “#pain”, “#headache” - if you’re looking at reactions to medications for example), so you may not want to throw them out. You can try replacing urls and account ids with UNK tokens - like you said, you might not have that much salient content left afterwards :). Not sure how much Twitter data you have, but if it’s very sparse you may want to consider a simpler/different approach (and throwing out infrequent words may actually hurt because reactions to medications may only be mentioned infrequently) …