Language: best practices - datasets for text classification

Moody · March 5, 2018, 3:42am

Regarding your issues stated in Readme:

entire sentences are showing up as a word; individual words are not being seperated
need to isolate title, author, affiliation, figure captions etc fields from the extracted text

For 1, I think the issue is in line 7 (not line 14). I used a different approach and it seems work. Now, you should be able to torkenize it at text level.
Niyoji-JACS2008.ipynb · GitHub

For 2, I think you can get all the paper information (title, author, etc) via the repository API. I got the title and summary via arXiv API before with few lines of code.

Hope this help.