I have a question in relations to a problem regarding extracting data from a structured data type like HTML and I am hoping that someone might be able to point me in the right direction. I suspect this might be the topic of research as of now since I could find a lot of info on this after couple of hours googling around.
How would one go about extracting data from a structured document using DL / ML? Suppose you have a large dataset of HTML documents all with a very similar structure - and these documents contain information regarding mortgage payments. There might be subtle changes between documents in the HTML structure but they would all be very similar in structure and the biggest change between documents are the mortgage payment numbers.
In the HTML you would find the usual kind of data which a mortgage bill contains, the starting capital, interest, amortization etc. I am wondering if its possible to make a NN learn the structure of the document and then extract the right information? Writing a RegEx command for this wouldn’t work since there would be subtle changes in the HTML structure.
Appreciate all suggestions and comments. Love this site and the community for its openness and collaboration.