Regarding your issues stated in Readme:
- entire sentences are showing up as a word; individual words are not being seperated
- need to isolate title, author, affiliation, figure captions etc fields from the extracted text
For 1, I think the issue is in line 7 (not line 14). I used a different approach and it seems work. Now, you should be able to torkenize it at text level.
Niyoji-JACS2008.ipynb · GitHub
For 2, I think you can get all the paper information (title, author, etc) via the repository API. I got the title and summary via arXiv API before with few lines of code.
Hope this help.