Help for my own project - How to combine continuous and NLP input

Hey community,

I hope you are all doing well despite all of the lockdown problems.

Currently, I started working again to dive deeper into deep learning and try to find a solution for the below problem. I hope somebody here can point me in the right direction.

I am helping a friend to automate the processing of PDF from the local harbor. As a lot of these documents have slightly different formats, a coding solution would be too much effort. Therefore, I thought I would try to come up with a NN architecture that would allow me to take the text from a certain document and extract the right information.

What I want to do is to train an NLP NN to classify text as to belong to one or the other category (whether an input is an address or a product category) and in the next step, I wanted to use the location of the word in the document and to estimate the distance the word has to the nearest term of interest (maybe a bit like a k nearest neighbor approach wherein the k the word of interest are or the key category terms).

My question to you is whether you know of any neural nets that can combine NLP and continuous variables. (I have come across the option to include categorical variables as additional embeddings, but I do not think this would work for here).

Another potential solution would be to have several NNs back to back and use the output from the one before. Still, I would ideally like an architecture that can make sense of this in one go.

I highly appreciate any feedback or helpful comments.

If someone can point me in the right direction, I would highly appreciate it!

Have a great holiday.

Cheers

Sorry I did not understand the problem completely based on the description.

This seems like a text extraction from PDF for important fields such as address and product and have them into structured format.

I probably start with spacy pipeline combined with rule matching. Search for spacy PDF text extraction before attempting NN

Hey msivanes,

Thanks for getting back to me.

Spacy PDF extraction might be a good start and for the first few looks at it, it might be a sensible place to start from.
However, the format of the PDFs I am looking at is very disjointed. When a program reads the text from the PDFs it does not recognize which text belongs to each other.

An example is here:

Here, it recognizes the ‘Amount of Insurance’ but then continues it with the ‘INSURANCE - If Carrier…’ row. Then it jumps to the next item in the lower row ‘NIL’ before it continues with the second row of the ‘INSURANCE - If Carrier…’ text. Therefore, a normal rule-based text extraction approach won’t work. So that is why I had wanted to look at a text-based as well as location-based solution.

I hope that makes it clearer.