LegalTech: Information Extraction in legal documents

7 min readFeb 7, 2020

Keywords: LegalTech, Information Extraction, text mining, statistical methods, n-grams, named-entities recognition (NER), big data, legal documents

1. NLP and LegalTech

Legal industry is other vertical where NLP technologies have been flourishing during recent years. This vertical is also known as LegalTech. The areas of growth in LegalTech focus on:

· Providing tools or a marketplace to connect clients with lawyers

· Providing tools for consumers and businesses to complete legal matters by themselves

· Data and contract analytics for e-discovery of insightful relationships

· Automation of legal writing or aspects other substantive aspects of legal practice

Technological applications — in contract management, e-discovery, and other high-volume areas– are standardizing, automating, and ‘productizing’ what were once labor-intensive tasks performed by lawyers at law firms.

Law companies manage tens of thousands of cases, which contain related legal documents (usually some contracts or licenses, their amendments, invoices, blueprints, letters, etc.). Currently, looking up some specific information is almost impossible. Our goal is to automate the processing of these cases: to categorize each document and to extract the most critical data about contract subject, mentioned parties, terms or fees. Information extraction along with full text search would enable the owners to filter their documents and easily perform other reporting tasks and analyses.

For example, getting key data elements extracted from contracts is not an easy task. Done manually, you would need to open each and every contract and search through the full document for many different criteria. Time studies have shown that in order to extract any given attribute (namely, parties and parameters of clauses for each type of contract) it takes about 2 minutes per attribute. If you plan to extract, 30 meta data elements (for example: the Landlord, the Tenant, the rent, etc. for lease contracts), it would take a half hour for each contract. If you have 10,000 contracts, it will take 5,000 person-hours.

In this current post, we will focus on how to extract information from a huge data set of legal documents, specifically contracts, by appealing to NLP techniques in order to save lots of person hours.

2. The need for a NLP hybrid approach: statistical methods and named-entities recognition

Machine Learning (ML) techniques, especially those working with new Deep Learning paradigm, proved to be very good at the task of text classification, being widely adopted by the industry. However, those techniques require a training corpus tagged with the criteria the classifier is intended to induce. Thus, ML-based text classification algorithms could easily work for identifying types of contracts (lease, indemnity, license agreements, etc.), but they’d hardly perfom well at extracting legal terms and parties for each one of those types of contracts, unless we get thousands of documents previously tagged for those entities. Besides, each classifier would be domain dependent, so the need for new tagged data sets will continue as long as we face new types of contracts.

Document classification -Image credit: Gunjit Bedit

On the other hand, all contracts and legal documents provide a reliable hint for getting insights about the parties and the terms mentioned in the document: capitalized phrases. From a full-parser’s point of view, those phrases act as if they were Named-Entities (traditionally referring to Persons, Organizations or Places) within the grammatical logic of sentences. That’s why we will call them “false Named-Entities”.

We could take advantage of this writing practice widely adopted in legal documents, as Named-Entities Recognition (NER) is treated as an NLP task with high effective results, easily adaptable to general domain scenarios and at a very low cost of linguistic-resource-consumption, as it does not require training corpus.

Of course, getting an immediate insights of parties and legal terms for each type of contract is not enough to understand complex relationships between those entities: for example, ‘who’ pays to ‘whom’ at ‘what amount’ under ‘which conditions’ for a lease contract. That’s why we will use another technique to induce this kind of relationships by using a statistical approach called n-grams, as the most informative context around those “false Named-Entities” that the collection of document converges on.

Contracts are especially challenging complex long documents. Firstly, they convey a specific structure (whereas, exhibits, definitions, etc.). On the other hand, each genre of legal text tends to have its own typical format. As our approach will take advantage of that specificity for each type of contract (lease, license agreement, share purchase, etc.), we need to prepare different data sets. The following Website offers plenty of resources to gather:

https://contracts.onecle.com/

For the purpose of this current post, we will work with 15 lease contracts of around 50 pages each.

As mentioned above, all contracts appeal to capitalized letters in order to point out legal terms and parties described in the documents. Each type of contract entails specific parties and terms:

· Lessor, Lessee, Tenant, Landlord, etc. for lease contracts

· Indemnitee, Indemnifier, Indemnitor, etc. for indemnity contracts

· Acquired Entity, Common Stock, Series A Participating Preferred Stock, Subscriber, etc. for a share-purchase agreement

We expect the document to be consistently redundant about conditions involving those salient “false named-entities”. Thus, they will act as seeds around which different instances of contracts from the same type will explain same kind of relationships (parameterized at most for each instance).

An example of a parameterized convergent string around the tytpical Maintenance legal clause of a lease contract follows:

“Tenant agrees to reimburse Landlord for such maintenance and service at

the rate of $RATE per 1,000 square feet, in the amount of CURRENCY AMOUNT

per month as additional rent.”

As we can see, “legalese” utterances around legal terms and involved parties follow pretty much a parameterized pattern. Each individual instance of contract may define which legal clauses shall be invoked with which parameters and which ones shall not.

3. Insight Discovery (ID): first stage with “false named-entities”

We will refer to this NLP-based experiment as Insight Discovery (ID). The whole approach involves two stages: identification of “false named-entities” and “induction of n-grams”.

We split the original input text with (‘\n\n’) (two new-lines symbols in a row) separator as some contracts were uploaded into the Website with their sentences truncated by a problematic single new-line symbol inside the string. Then, we iterate sentence by sentence and token by token, extracting and counting those “false named-entities”.

Also, we cut the list below 2 occurrences, as we expect those “false named-entities” to act as seeds for convergent n-grams. Namely, a single unique occurrence of a certain “false named-entity” is not guarantee enough for two or more partially overlapping strings containing it, which is the rationale behind n-gram models https://en.wikipedia.org/wiki/N-gram

4. Insight Discovery (ID): second stage with n-grams with sementities mapping

Once identified, “false named-entities” will act as seeds for n-grams. This statistical approach explotes the natural convergence among strings with overlapping partitions. A useful n-gram is the most significant largest string containing the seed (from bigrams to 20-grams), so that it may capture the syntactic context where the seed tends to occur. Each “gram” is a token slot to be occupy by a token or multi-token phrase.

We will need to boost n-gram model with some linguistic features:

a) Identification of stopwords through a list: Stopwords are useful to articulate linguistic utterances, as if they were “glue dough” between content words. However, they convey no meaning at all, so it is useless to allow them to appear filling the right-most or left-most slots in n-grams. Therefore, they are only allowed in other positions where they clarify the linguistic presentations of content words.

b) Real proper nouns against a dictionary of common words: In some large documents, repeated occurences of proper nouns in the same linguistic contexts would lead to artificial convergence WITHIN a unique document. In other words, if a contract agrees the lease of Norriton Business Campus located in East Norriton Township, Montgomery County, Pennsylvania, it is likely to get that large n-gram as a repeated string only in that very unique contract and it is not expected to show convergence with other lease contracts. That’s why we need to filter out resulting n-grams that contain real any real proper nouns.

c) Sentence boundaries for optimizing the combination of string partitions in order to reduce n-grams combinatory logic.

d) Iteration of n-grams from bigrams (2-grams) to 20-grams via partitions in a string (sentence): Magic here is perfomed pretty much by a Python library NLTK https://www.nltk.org/ and its method nltk.probability.FreqDist

e) Normalized score for n-grams: Shorter n-grams are most likely to occur than larger ones, so the mere occurrence is not a fair criterion to sort resulting n-grams. You may want to take into account the length and also the number of seeds (false named-entities) contained. For example, the following normalized score could lead to a relevant order of the output:

normalized_score = occurrences * log2(length_of_ngram in tokens) * seeds

LegalTech: Information Extraction in legal documents

You may also like:

Written by NaturalTech

Responses (1)