Hello,
I’m building a model to classify clauses within legal documents. Instead of trying to classify the entire document (searching for a needle in haystack), I’m thinking of providing better supervision by training a model to classify per paragraph/text snippet.
How would you suggest splitting a variety of legal documents into its separate clauses? My impression is a solution should exist because it is possible with images (e.g bounding box detection). But NLP seems to work a bit differently.
I’m considering training a seq-to-seq RNN to automatically annotate a document with clause beginning and ending tags. Would that work since legal documents are long texts?
So the input document could be:
1. This is some important clause.
2. Cool guys include:
i) Erik
ii) Sam
iii) Teddy
3. Bad guys include:
i) Gary
ii) Jennifer
But only when Gary is drunk.
And the output of the model would be below. {star} and {/star} are the annotations included by the model to help break the document into separate parts.
{*}1. This is some important clause.{/*}
{*}2. Cool guys include:
i) Erik{/*}
{*}2. Cool guys include:
ii) Sam{/*}
{*}2. Cool guys include:
iii) Teddy{/*}
{*}3. Bad guys include:
i) Gary
But only when Gary is drunk.{/*}
{*}3. Bad guys include:
ii) Jennifer
But only when Gary is drunk.{/*}
Are there any other possible solutions I should consider?