Parsing legal clauses from contracts


I’m building a model to classify clauses within legal documents. Instead of trying to classify the entire document (searching for a needle in haystack), I’m thinking of providing better supervision by training a model to classify per paragraph/text snippet.

How would you suggest splitting a variety of legal documents into its separate clauses? My impression is a solution should exist because it is possible with images (e.g bounding box detection). But NLP seems to work a bit differently.

I’m considering training a seq-to-seq RNN to automatically annotate a document with clause beginning and ending tags. Would that work since legal documents are long texts?

So the input document could be:

1. This is some important clause.

2. Cool guys include:
i) Erik
ii) Sam
iii) Teddy

3. Bad guys include:
i) Gary
ii) Jennifer
But only when Gary is drunk.

And the output of the model would be below. {star} and {/star} are the annotations included by the model to help break the document into separate parts.

{*}1. This is some important clause.{/*}

{*}2. Cool guys include:
i) Erik{/*}

{*}2. Cool guys include:
ii) Sam{/*}

{*}2. Cool guys include:
iii) Teddy{/*}

{*}3. Bad guys include:
i) Gary
But only when Gary is drunk.{/*}

{*}3. Bad guys include:
ii) Jennifer
But only when Gary is drunk.{/*}

Are there any other possible solutions I should consider?

Check Python natural language toolkit

1 Like

@clipmaker is there something in the library that you think would help solve this problem? I’m not aware of any

@echan00 did you end up making progress on this? I’m looking at a closely related problem, comparing clauses across similar contracts. Without a large amount of training data, I would (and i dont consider myself believable in this case) think the best way to make the transformation you’re talking about would be to build that into a parser that would restructure the text in such a way that it can be usefully parsed by pre-existing models, one clause at a time, into triples or whatever.

Could you extract the text first, and then deal with the text/clause bounding via a combination of regex and/or NLP tools afterwards? From what I’m reading you’re trying to do the separation on the IMAGES first, and then extract the text once you’ve clipped those images?

Take a look at Spacy and NER. I also wonder these people. How unique are they. There maybe more people with the same name. So identifying by more than just their name is also a concern. Then onece you have identified them can you classify them as bad by what they do. Such as Teddy killed his neighbour (bad). Or more complicate: “Teddy arrived in his car, then he drove over Janice”. This requires co-reference.


I’m also looking for the same solution, please help if you were able to solve this, Thanks