Technology deep dive (but don’t be put off if you are a non-technical person, we will keep it high-level)
Last week we discussed why information extraction and automated document summaries for Due Diligence is a difficult nut to crack.
This week we discuss how it can be done, with AI. But not just any AI. AI is a rather fluid label. Some people even claim that techniques like Regular Expressions constitute AI. While regex certainly has a use case, it has major downsides for information extraction and summary purposes, as we discussed last week.
Sound familiar? Then you must have read our post on Large Language Models for Automated Redaction. Indeed, in essence it is the exact same problem. Whether you want to redact information in a document, or extract information from a document, in both cases you have to find that information first… and traditional methods don’t cut it.
That said, you may still want to continue reading, as it a similar nut to crack, but not the same. The devil is in the detail (isn’t it always?).
Let’s first discuss the key technology underlying both solutions.
Machine Learning and Large Language Models (LLM)
The key technology to be used is machine learning. But again, not just any machine learning. You need a technique that uses the semantics of the text holding the information. And the semantics of the text can only be captured by using the whole, unabridged text, and the sequence of the “tokens” (words, numbers etc.) of the text. We have found that the only way this can be reliably and accurately done is by using “Large Language Models” (the technology underlying ChatGPT).
What is a Large Language Model?
Essentially an LLM is a Neural Network, trained on a large body of text (e.g., Wikipedia, Common Crawl etc.). LLMs can predict the next word (or word in between), given its preceding (or surrounding) sequence of words. They are trained to do that by leaving words out, to fill in the blanks, so to speak. You may know the so-called Cloze test, which tests your ability to fill in the blanks, e.g. “monkeys like to … bananas“. You could say that LLMs are trained to ace that test!
As a result, these models can understand the patterns of languages, such as grammar and style (which makes ChatGPT so eloquent!), but also, if enough training data is used, semantics and logic. This is what makes ChatGPT able to produce informative text, though not always accurately (Interesting read on this topic: ChatGPT is no stochastic parrot. But it also claims that 1 is greater than 1).
These models have the additional advantage that they can be trained without the need for humans to “label” the data first1. This makes it feasible to train LLMs on very large amounts of data with relatively low human effort, other than very smart data scientists, and of course, huge computer resources.
How we use Large Language Models at Imprima
At Imprima, we use an LLM customised for the purpose of information extraction and subsequently create automated summaries, trained on documentation typically found in DD. We have trained it in multiple languages, and it predicts across languages. Accuracy in one language is increased through the utilisation of training data in other languages working in conjunction with each other.
It is designed and trained to categorise single tokens (words, numbers, etc.), whole sentences and/or paragraphs (or clauses) so it can subsequently find the information that needs to be extracted. This has several advantages. As it uses the structure of the text, it relies on the meaning of the items to be extracted, rather than the exact wording of the items. Therefore, it enables you to bypass the major disadvantage of traditional techniques such as search or Regular Expressions (see last week’s blog post.)
That said, its overarching advantage is its accuracy and reliability, which goes far beyond what any other automated information extraction can do. We have shown that in a previous blog post (Smart VDRs – it is all about accuracy), and we will discuss this further in the next blog post.
Conclusion
Traditional search or regex techniques, although often presented as AI, will not do the job. Advanced AI, on the other hand – in particular when based on LLMs –can really transform the Due Diligence process, as it enables the identification, extraction and summarisation of key information relevant for DD.
Though many worry about AI’s potential threats to humanity (Runaway AI Is an Extinction Risk, Experts Warn ), others mostly focus on its benefits for business (Can AI help you solve problems?), and so do we. Therefore, we have embarked on this path, focusing our R&D efforts on AI 5 years ago already.
In next week’s blog post we will discuss how accurate and reliable LLM-based information extraction is, by showing results on actual DD data, and how it works in practice.
Stay tuned!
Are you looking for a VDR with fully integrated AI-DD software? Speak to our sales team or check out our Smart Summaries page here.
Footnotes
- … at least before fine tuning and/or reinforcement learning.