I’m building a model to classify clauses within legal documents. Instead of trying to classify the entire document (searching for a needle in haystack), I’m thinking of providing better supervision by training a model to classify per paragraph/text snippet.
How would you suggest splitting a variety of legal documents into its separate clauses? My impression is a solution should exist because it is possible with images (e.g bounding box detection). But NLP seems to work a bit differently.
I’m considering training a seq-to-seq RNN to automatically annotate a document with clause beginning and ending tags. Would that work since legal documents are long texts?
So the input document could be:
1. This is some important clause.
2. Cool guys include:
i) Erik
ii) Sam
iii) Teddy
3. Bad guys include:
i) Gary
ii) Jennifer
But only when Gary is drunk.
And the output of the model would be below. {star} and {/star} are the annotations included by the model to help break the document into separate parts.
{*}1. This is some important clause.{/*}
{*}2. Cool guys include:
i) Erik{/*}
{*}2. Cool guys include:
ii) Sam{/*}
{*}2. Cool guys include:
iii) Teddy{/*}
{*}3. Bad guys include:
i) Gary
But only when Gary is drunk.{/*}
{*}3. Bad guys include:
ii) Jennifer
But only when Gary is drunk.{/*}
Are there any other possible solutions I should consider?
@echan00 did you end up making progress on this? I’m looking at a closely related problem, comparing clauses across similar contracts. Without a large amount of training data, I would (and i dont consider myself believable in this case) think the best way to make the transformation you’re talking about would be to build that into a parser that would restructure the text in such a way that it can be usefully parsed by pre-existing models, one clause at a time, into triples or whatever.
Could you extract the text first, and then deal with the text/clause bounding via a combination of regex and/or NLP tools afterwards? From what I’m reading you’re trying to do the separation on the IMAGES first, and then extract the text once you’ve clipped those images?
Take a look at Spacy and NER. I also wonder these people. How unique are they. There maybe more people with the same name. So identifying by more than just their name is also a concern. Then onece you have identified them can you classify them as bad by what they do. Such as Teddy killed his neighbour (bad). Or more complicate: “Teddy arrived in his car, then he drove over Janice”. This requires co-reference.