NN to extract data from HTML?

steindor85 · April 30, 2018, 7:52pm

Hey guys!

I have a question in relations to a problem regarding extracting data from a structured data type like HTML and I am hoping that someone might be able to point me in the right direction. I suspect this might be the topic of research as of now since I could find a lot of info on this after couple of hours googling around.

How would one go about extracting data from a structured document using DL / ML? Suppose you have a large dataset of HTML documents all with a very similar structure - and these documents contain information regarding mortgage payments. There might be subtle changes between documents in the HTML structure but they would all be very similar in structure and the biggest change between documents are the mortgage payment numbers.

In the HTML you would find the usual kind of data which a mortgage bill contains, the starting capital, interest, amortization etc. I am wondering if its possible to make a NN learn the structure of the document and then extract the right information? Writing a RegEx command for this wouldn’t work since there would be subtle changes in the HTML structure.

Appreciate all suggestions and comments. Love this site and the community for its openness and collaboration.

msmedes · May 2, 2018, 1:06pm

I was actually thinking about building something similar but for text content on news story pages. I currently have a bunch of scrapers for multiple sites and they are a pain in the butt to maintain. If you have a bunch of static HTML docs, however, something like scrapy or beautiful soup might be a better choice.

What I was thinking was an approach similar to the imdb sentiment classifier. First train a “language model” except in this case the language model would be of HTML from various web pages, then process it as per usual (I’m not sure yet if tokenizing all the js and html tags will be a problem). The part I’m hung up on is labeling the dataset, since what would really be happening is determining the probability that an individual token is text and not html or js, and I’m not sure of the approach there. Is an html document an array of embeddings? Probably, but then how do I efficiently make predictions on a token by token basis? is that the wrong approach? idk!

steindor85 · May 2, 2018, 3:00pm

Yes exactly. I have already built this function using beautifulsoup and it works in around 90 - 95% of the cases but some documents have a 3-4 character change in the text structure and then the RegEx isn’t enough and everything breaks.

Regarding the HTML and how to treat it, it is probably best to tokenise it and treat it as text. One would think that the model would learn by using lower weights on the less important html tags since they would probably be much of the same and not contain a lot of data for the classification task. Depends certainly of the type of data you want to extract.

I have a problem with structuring and labeling the data as well. Since when using classifiers, they usually train with a specific list or array of training data, they can only predict the probability of a given document containing values within that specific list of training data.

I need the model to be able to say: I have this 10 - 15 tokens, now I need to extract the next 5 tokens - or something of that kind.

Maybe using images and concatenate them with html might be a way to go? Like pics2code?

Maybe this might be a model that could work? https://blog.floydhub.com/turning-design-mockups-into-code-with-deep-learning/

msmedes · May 2, 2018, 3:39pm

Yeah thought about doing an OCR type thing, but that would probably be way slower than working with characters. You’d have to run a headless browser and take screenshots of every single page which would take much longer with the added bonus they crash all the time when they inevitably spaz out and eat all the memory on the system.

Unfortunately we may be overthinking this quite a bit, I’m honestly not sure if the processing necessary for an entire page of HTML is more efficient than just using regex or a parsing library. Something like:

I need the model to be able to say: I have this 10 - 15 tokens, now I need to extract the next 5 tokens - or something of that kind.

sounds a lot like positive lookbehind in regex, which can approach O(2^m) where m is the length of the regex. Not sure what the time complexity would be on the nn, maybe O(n)?

steindor85 · May 2, 2018, 6:59pm

Yeah, you might be right. Positive look behind looks like something that might work. I’ll try that first before working up a complex DL/ML solution. The right solution is usually the simplest one

Thanks for the help!