Crawling a page for all linked text

msp · April 20, 2018, 9:41am

Hi,

I was describing word embeddings to a friend of mine, and he enthusiastically asked me to run it on his website which contains thousands of links (to html and pdf documents, mostly).

Does anyone have a handy tool for going from a single url to a concatenation of all the linked texts?

Cheers!

suryatk · April 21, 2018, 12:18am

beautiful soup or scrapy can help you with this. with scrapy it is quite straight forward to extract all the links/ html elements from a webpage.

msp · April 23, 2018, 3:52pm

Thanks for the scrapy tip. I also found this:

which seems to be effective for extracting text from PDF.