Crawling a page for all linked text


I was describing word embeddings to a friend of mine, and he enthusiastically asked me to run it on his website which contains thousands of links (to html and pdf documents, mostly).

Does anyone have a handy tool for going from a single url to a concatenation of all the linked texts?


beautiful soup or scrapy can help you with this. with scrapy it is quite straight forward to extract all the links/ html elements from a webpage.


Thanks for the scrapy tip. I also found this:

which seems to be effective for extracting text from PDF.