Crawling a page for all linked text

Hi,

I was describing word embeddings to a friend of mine, and he enthusiastically asked me to run it on his website which contains thousands of links (to html and pdf documents, mostly).

Does anyone have a handy tool for going from a single url to a concatenation of all the linked texts?

Cheers!

beautiful soup or scrapy can help you with this. with scrapy it is quite straight forward to extract all the links/ html elements from a webpage.

2 Likes

Thanks for the scrapy tip. I also found this:

which seems to be effective for extracting text from PDF.