Optimizing Regulatory Monitoring: An Evaluation of RAG RegAssist Performances

Heka.ai
12 min readNov 15, 2024

--

Introduction

Each year, companies face a growing number and scope of obligations. Failing to comply with these requirements exposes them to legal and reputational consequences, with impacts on growth. For a robust and effective compliance function, it is essential to be aware of upcoming rules requiring the design and implementation of appropriate compliance programs.

RegAssist, the chatbot of RegReview — a Heka product proves to be a valuable asset for users in that context. It enables them to automatically collect, qualify, and pre-analyze regulatory updates by asking questions on a large set of documents. This functionality allows users to skip these initial steps and focus on thorough analyses, ensuring a more efficient and comprehensive compliance process.

A regulatory watch search engine like RegReview expounds to users what is new in the regulatory compliance arena. The RegAssist feature enhances the interactivity of the search engine by allowing users to ask questions directly about the text. It also identifies and displays the relevant text passages where the requested information is located.

To better meet the challenges of regulatory monitoring, recent improvements have been made to RegAssist to incorporate time-sensitive responses and source-associated techniques, as developed in this previous technical article.

Given the critical nature of regulatory compliance and the need for timely responses, maintaining optimal performance for RegAssist is crucial. An evaluation process had to be set up to assess the chatbot’s improvements and quantify its relevance. This article outlines our approach to evaluating RegAssist. We will start by presenting a business use case built specifically for this article, followed by a detailed exploration of the overall approach. The evaluation process includes building an evaluation pipeline, developing metrics for relevance assessment and ultimately implementing them.

Elaborating use cases

To strengthen RegAssist and enable it to function like a legal/compliance specialist, we trained it on real-life scenarios drawn from day-to-day work. Here is a list of criteria that were followed to mirror the working methods of operational teams (i.e. RegReview’s users)

  1. Overarching criterion: What kind of questions does a compliance specialist typically ask? The first and general criterion followed was to reproduce questions that are commonly asked by people who work in compliance.
  2. Relevance (reliance on key sources on a given topic): What a compliance specialist needs and wants is not a generic recitation, but rather a clear answer or confirmation backed by and an authoritative source. Our knowledge of both compliance regulations and the work of regulators allowed us to feed RegAssist with what a specialist would look for, allowing us to train it to function precisely like a specialist.
  3. Date of publication: To avoid bias towards only recent publications, use cases were created not only based on articles published shortly before the drafting of the cases, but also from those published year prior or at any point in between. We combined the criteria of publication and authoritativeness to generate the most accurate responses and guarantee that the engine emulates experts’ approach.
  4. Language: The languages used to formulate both questions and answers were English and French to avoid linguistic biases and broaden the scope of source materials.
  5. Latitude of the information sought: The use cases addressed both relatively broad topics, which required lenghtly and complex answers, as well as punctual and specific information — including news items, like an appointment.
  6. Definitions: Although, theoretically belonging to the previous category, questions pointing to definitions are so critical in practice that they warrant their own designation, as they are key for legal/compliance experts.
  7. Information in the news: As compliance specialists need to keep abreast of the ‘breaking news’s in their area, RegAssist was also trained on this type of information.
  8. Complex technical questions: In our training of RegAssist, we also built-in highly technical questions, mixing several notions and projecting them into a scenario (e.g. a specific sector), to introduce an additional layer of complexity.
  9. General neophyte questions: To ensure that RegAssist could assist not only on Subject Matter Experts but also neophytes, we included general questions such as the legal acceptability of a given practice.

Example of Use Case

As an illustration of how these criteria were applied to create use cases, one of them pertained to a broad question, for which the most relevant answer was found in a specific document that could be considered as the key source.

The question was “Are Chambers of Commerce and Industry exposed to risks of corruption?”. Intuitively, one might assume that they are, insofar as risks of corruption are widespread and can virtually affect anyone, natural or legal persons. Yet, a specific document underpins this question: a document published by the French Anti-Corruption Agency on May 15th 2024, a guide for Chambers of Commerce and Industry for the implementation of a program to prevent and detect bribery and similar offences (Guide pratique à l’attention des Chambres de Commerce de d’Industrie pour la mise en place d’un dispositif de prévention et de détection des atteintes à la probité).

The key information appears on page 6 of the guide: “Like every public or private actor, CCIs [Chambers of Commerce and Industry] are exposed to risks of bribery (active and passive) and influence peddling. […]”

As we look at the above-mentioned criteria, this is definitely a question that compliance officers could ask. Given that the guide is issued by the French Anti-Corruption Agency, it stands as a key source — one that compliance officers would likely rank as the most relevant to address their inquiry. The guide was published in French just days before the creation of the use case, yet, despite the question being very general, the answer was specifically provided by one key document whose content offers the information sought and illustrates it with examples.

Implementing evaluation in RegAssist

To ensure RegAssist remains effective and efficient, we must continuously assess its performance. Regular evaluations with relevant metrics help us identify areas for improvement, ensuring the tool evolves to meet its users’ changing needs. The goal is to be able to evaluate RegAssist performances each time improvements are made, that is why we need to use identical tests for accurate before-and-after comparisons.

The primary objective of this chatbot evaluation is simple: assess the retriever’s ability to fetch the expected texts, called chunks, in the chatbot’s responses. Here is a simplified schema of RegAssist chatbot in Figure 1.

Figure 1: Simplified RegAssist Architecture

Let’s see how the assessment pipeline was built throughout the previous example step by step.

  1. Defining Business Test Scenarios for the Retriever

The first step involves defining various business test scenarios that the retriever component of RegAssist needs to handle. These scenarios are designed to cover a wide range of regulatory compliance situations that the chatbot is likely to encounter. By aiming to simulate real user conditions, we can ensure that RegAssist is thoroughly tested against relevant and challenging scenarios.

Constructing a use case involved creating real-life scenarios (as tackled in the previous section) that best reflect users’ behaviors on the RegReview product: for example, the parameters a user selects for the chatbot (time frame, sources, keys words etc.), and the relevant passages that should be retrieved by RegAssist. Then, turn them into usable scenarios to evaluate the chatbot within the pipeline.

These meticulously crafted scenarios enable us to perform consistent and meaningful comparisons of RegAssist’s performance over time, ensuring any improvements are accurately measured and validated.

2. Preparing an ingestion script for use cases provided by the Compliance Team (CPL)

Next, we prepare a script to ingest the use cases provided by the CPL team to store them for future use. This script automates the loading process of use cases into the testing environment, ensuring that all scenarios are correctly and efficiently integrated. This preparation is crucial for creating a realistic testing framework.

The code consists of several key functions and a main execution block forming a system to locate and validate text chunks within articles based on specific scenarios as detailed below in Figure 2.

Figure 2: Scenario ingestion architecture

The `find_matching_chunks` function searches for text chunks within articles indexed in a Pinecone index based on expected expressions. It takes a scenario containing a question, article IDs, and expected expressions, and returns a dictionary with the matching chunks. If no matching chunks are found, it raises an error.

The `find_expressions` function checks if a chunk of text contains an expected expression using a regular expression search, returning `True` if a match is found and `False` otherwise.

The scenarios list is proces sed, each scenario is iterated through during scenario ingestion to retrieve search parameters and article IDs. It calls `find_matching_chunks` to find matching chunks, inserts the processed scenario into the database, and logs successes and errors. If critical errors occur, the ingestion exits.

Here is the format of the scenario provided by the Compliance team

{
"Name": "Finance",
"Search Parameters": {
"Source(s)": " French Anti-Corruption Agency ",
"Topic(s)": "Corruption",
"Keywords": ["CCI", "chamber of commerce and industry", "corruption", "moralization"],
"Min Date": None,
"Max Date": "05/16/2024”
},
"Question": "Are chambers of commerce and industry exposed to risks of corruption?",
"Article ID" : "664556d120300cef055e8714",
"Relevant Passages" : [
"Like any public or private actor, CCIs are exposed to risks of corruption (active or passive) and influence peddling. "
"Given the public service missions assigned to them, CCIs are also exposed to risks of illegal interest-taking and extortion. "
"CCI officials and employees exercising control over entities controlled by the CCI may be exposed to the risk of revolving doors if they move within this entity. "
"Due to their status as contracting authorities, CCIs are also exposed to risks of favoritism. "
"Because their resources are public funds, CCIs are exposed to risks of embezzlement of public funds. "
"These six criminal offenses are grouped in the rest of the document under the generic term 'breaches of integrity.' "
"The purpose of this guide is to provide the consular network with operational tools for preventing and detecting "
"these risks of breaches of integrity. "
"The table below lists various decisions concerning CCIs and highlighting certain risks of breaches of integrity to which CCIs are exposed:"
]
}

Our script, which converts the scenario provided by the CPL team into a format compatible with our database, produces the scenario in this form:

{"question": "Are chambers of commerce and industry exposed to risks of corruption?",
"search_articles_ids": [“664556d120300cef055e8714”]
"expected_chunks_ids":
[“ea338efa-088b-41a5-832c-1e38723066bf", “5b3e4883-452a-41fe-9537-caae1d9270f9",
“61e825bb-84fb-48b3-9c78-69aa4a2a04c5", “7b1000fd-cdff-43df-933f-6180100db78c“,
“58d6c3f5-f35c-463a-b154-97df7bee87d8"]
}

Now we have a scenario containing the important information: the question, the ID of the relevant article, and the text segments’ IDs pertinent to the answer, ready for storage.

3. Creating Endpoints to Access the Scenarios

We then create endpoints that allow access to the test scenarios, serving as interfaces for retrieving and executing them. By having dedicated access points, we streamline the testing process and ensure that scenarios can be easily managed and modified as needed.

2 main types of endpoints are then implemented:

  • Endpoints related to scenarios: used of inserting, deleting, updating and retrieving
  • Endpoints related to results: used to run tests, get evaluations results or get the branches where tests were performed

4. Defining Relevant Metrics (Precision, Accuracy, Recall) for evaluation

In addition to access endpoints, we also develop endpoints for inserting tests results, including relevant performance metrics such as precision, accuracy, and recall as defined below in Figure 3. These metrics are crucial for quantitatively evaluating RegAssist’s performance and identifying specific areas for improvement.

In order to mathematically represent the presence or absence of chunks and calculate these key metrics, we need to convert the list of expected chunks IDs and the list of chunks IDs retrieved by our current retriever into a list of binaries (0 or 1). The variable top_final sets the list’s length containing all the retrieved chunks IDs.

Figure 3: Schema of metrics computing

If there are fewer expected chunk IDs than retrieved ones , we extend the expected list with 0s to increase the False Positive count for the retrieved IDs. Conversely, if there are fewer retrieved than expected chunk IDs, we extend the retrieved list with 0s to increase the False Negative count. It is important to note that the order of retrieval is significant.

Here is a little reminder about precision, accuracy and recall formulas:

Where:

  • TP (True Positive) — The number of chunks retrieved that are expected in the scenario. It means 1 in the expected list and 1 in the retrieved list.
  • FP (False Positive) — The number of chunks not retrieved, but that are expected in the scenario. Its means 0 in the expected list and 1 in the retrieved list.
  • FN (False Negative) — The number of chunks not retrieved that are not expected in the scenarios, its means 1 in the expected list and 0 in the retrieved list.
  • TN (True Negative) — The number of chunks not retrieved that are not expected in the scenarios, 0 in the expected list and 0 in the retrieved list.

An increase in False Positives (FP) leads to a decrease in precision. If the precision is too low, the final set of results is deemed too large. On the other hand, an increase in False Negatives (FN) leads to a decrease in recall. If the recall is too low, the final set of results is considered too small.

5. Implementing the test job: test runs

Finally, we implement the testing component that executes the defined scenarios and collects the results. It runs the chatbot through all test cases, records its performance, and stores the results for further analysis. Composed of two mains functions, the implementation ensures consistent and repeatable evaluations, allowing us to monitor RegAssist’s performance over time as detailed in the following schema:

Figure 4 : Evaluation process after RAG modifications

The first function is designed to initiate a new evaluation process once per test session. It sets an environment variable for the branch name, creates a test client from a Flask application, and then sends a POST request to start the evaluation. The request includes several parameters to allow identification at each run: admin role claims, identity value in the headers, and a JSON payload with the current date, time, branch name. It also checks for a 201-status code to confirm the evaluation process was successfully created and returns the JSON response for further use in tests.

The second function tests the retriever in various scenarios. First, it sends a GET request to retrieve scenario details, then a POST request to query the retriever and finishes by calculating performance metrics (accuracy, precision, and recal). This final step is accomplished by comparing expected and retrieved chunks’ IDs, and then sending another POST request to store these metrics in the database. Each step checks for successful status codes to ensure proper execution.

Here is the final scenario after the run:

{
"scenario_id": scenario_id,
"accuracy": 1.0,
"precision": 0.6,
"recall": 0.6,
"returned_chunks_ids": [“ea338efa-088b-41a5-832c-1e38723066bf", “61e825bb-84fb-48b3-9c78-69aa4a2a04c5", “58d6c3f5-f35c-463a-b154-97df7bee87d8"]
}

Now the retriever has been assessed and performances have been stored!

Conclusion

In conclusion, ensuring the effectiveness and efficiency of RegAssist requires continual performance assessment through rigorous and consistent testing. By defining detailed business test scenarios that simulate real user interactions and regulatory compliance situations, we can comprehensively evaluate the retriever’s ability to fetch relevant text excerpts accurately. This structured approach, underpinned by automated ingestion scripts and well-defined endpoints, enables us to maintain a robust and realistic testing framework.

By implementing precise evaluation metrics such as precision, accuracy, and recall, we can quantitatively measure RegAssist’s performance and identify specific areas for improvement. Regular evaluations using the same test scenarios ensure that we can compare performance before and after any changes, providing a clear view of the tool’s evolution.

Ultimately, this comprehensive assessment pipeline does not only help validate the improvements made but also enhance RegAssist’s overall functionality and user satisfaction, ensuring it remains a reliable tool for regulatory compliance needs.

Océane Wauquier Maxime Charpentier

--

--

Heka.ai
Heka.ai

Written by Heka.ai

We design, deploy and manage AI-powered applications

No responses yet