Project idea: Resume parser

I want to work on a project to automatically parse resumes. I don’t have an idea about text mining though. Can someone tell me a good way to approach the project? Especially how to segregate the resume into various sections. I am assuming the resumes are in pdf format and I will be using python library tika to convert it from pdf to text. Any suggestions about the project would be welcome

I’ve done a similar project at work (which I obviously can’t share the details about) but I can give you some pointers:

You can use PDFminer.six to read the resumes if they are OCR’d. This will give you the coordinate boxes for where each letter is. It can be messy but certainly doable for you to piece the letters into words using the coordinates.

If the resume isn’t OCR ready, you’ll need to use Tesseract to extract the text from it. You can choose either plain text or in XML format which the latter might provide additional info like where each character is and how big it is, and you can infer important sections from this.

I’m not sure how deep learning would be helpful here. You might see common resume formats and maybe can use a DL classifier to distinguish them then send them to the respective postprocessor which parses the info out.

Thank you for your response, I will give PDFminer.six a try. The text extraction part is not actually an issue. The main problem is how to efficiently divide the text into various sections of the resume so we can give it to a different processor for further processing. Parsing the full text at once may give problems like how to differentiate dates associated with work experience with those associated with education.

1 Like

Hi , did y find a solution for that problem ?im working on same subject … and im facing the same issue : extract information from unstructerecd text (resume ) !

You can try resume parser or can take a reference from Github.

1 Like

Have you tried Big Help Desk’s resume parsing api service. Very accurate and easy to setup, they have sample code for many different langauges.