Text Extraction: Comparing DONUT and Amazon Textract in Insurance Field

14 min readMay 21, 2024

At Sia Partners, our Computer Vision Lab is dedicated to advancing new computer vision technologies and complementary tools. We operate on the cutting edge of OCR research, continually exploring novel approaches to enhance the accuracy and efficiency of text recognition. In our last article, we presented how we use OCR tools to support companies in the energy sector. We continually conduct benchmarks and tech watch, ensuring that our solutions consistently outperform industry standards. That is why we have taken a keen interest in LLMs (Large Language Models), which represent a true breakthrough in the technology industry. Their arrival in 2018 has significantly enhanced our capabilities and encouraged us to explore additional open-source models like DONUT, a solution based on LLMs that we will detail later. LLMs take advantage of the new coming architectures of transformers and large amount of data to be trained.

Conversely, Amazon Textract, a revolutionary well-known and established cloud-based service by Amazon Web Services (AWS) is transforming the way organizations handle their documents. Amazon Textract harnesses the power of machine learning and Optical Character Recognition (OCR) to automatically extract and analyse text, forms, tables, and structured data from a diverse range of documents. Whether it’s processing scanned documents, PDFs, or even handwritten text, Textract excels at making sense of information buried within paper and digital files.

We propose today to compare DONUT and Amazon Textract over a use case that we worked on at Sia Partners for the insurance field: automatic reading of car accident reports. We will introduce this specific use case, then present the Amazon Textract model, followed by the generative model DONUT. Finally, we will engage in a comparative discussion of these two solutions and their effectiveness in the context of the insurance industry.

Note to the reader : If you wish to learn more about our benchmark, the next two chapters deep dive into the details of model configurations, othewise you can directly go to the result of the benchmark : Donut vs Textract: revealing processing differences

Accelerating Accident Report Processing with AI Field Detection using Amazon Textract

In many countries, car accident reports are still filled out manually using standardized paper forms, which is time-consuming and error-prone. Automating this process with AI models for field detection can significantly streamline operations, enhance accuracy, and expedite claims processing. This digitization effort aims to improve customer experience and reduce costly errors associated with manual documentation. The objective is double, field detection like licence plate number or driver ID details and value identification. As you can see bellow on the picture of an unfilled report (Figure 1), the document contains dozens of standard text fields like names, addresses, contract numbers, but also checkboxes, drawing parts, signatures field, etc.

As a starting point for this work, we had access to a database of nearly 3000 real French car accident reports from 2017 to 2020 . These reports have been digitized and saved as images with a decent resolution as almost all the reports had approximately 5000x3000 pixels. However, the overall quality is medium as they are filled on low quality papers (carbon paper) and, most of the time, in the hood of a car, after an accident.

With this baseline, we can now dive into the application we developed and how we are currently using Textract (Figure 2) in our use case. In collaboration with the insurance company providing the data, we developed a platform that uses Textract to automatically read and extract all the information present in the above-mentioned accident reports. Every report is sent to the REST API of Textract (one call per report). As a result, the API sends back a JSON file containing all the information found in the report: every detected value is in the dictionary with associated parameters like the coordinates of the text, the relation with another part, the accuracy of the value, etc. With that information we can divide the document into big sections and associate each field with its raw value filled by the insured. These sections are related to business characteristics and can be identified on the image with the bold numbers from 1 to 15. This task is more complex than it seems. Car accidents are traumatic experiences and manually filling a regulatory form, often in bad conditions (on the accident location) is not the easiest task. Therefore, it is hard to properly read every character on the report even by a human, let alone an OCR tool that was not specifically trained for this type of documents. That is why we implemented postprocessing functions (Figure 2) based on REGEX characters manipulation, distance evaluation, business rules, etc. to complete the document reading and to cross-validate the reading with national database. For instance, we ensured that the licence plates have a proper format (there are different formats defined in France) and exist in a specific national database.

After all these steps, we needed to evaluate the performance of the overall pipeline. To compare the predictions with their respective targets, we developed an interface (a web page) that allows us to annotate reports and thus compute performance metrics. Each field having its own postprocessing, we decided to use a separate evaluation metric for each of them. For these tasks, we used simple metrics: “is the prediction equal to the label?”.

Figure 2: TEXTRACT model pipeline for inference

Donut: the generative approach

Donut stands for Document Understanding Transformer. It is an OCR-free Visual Document Understanding model. The Donut model was proposed in 2022 in OCR-free Document Understanding Transformer by Clova AI Research, a south-Korean company. Compared to conventional document extraction models, who usually rely on OCR models to read the text on the document image, and then focus on modelling the understanding part (Figure 3), Donut combines visual understanding and text generation in one single model and frees itself from OCR because it analyses directly the image to understand its content as a whole and not one character at a time (Figure 4). It can thus generate a coherent and complete description of the image in only one pass. The original code can be found here.

Figure 3: Conventional document information extraction pipeline

Implementing Donut for field detection

With Textract we had the possibility to divide the reports into crops based on business rules. We decided to train Donut on these crops. Indeed, using crops implies reducing the size of input images, resulting in less memory usage and faster training and inferences. Donut is also likely to perform better in decoding, by giving more attention to details for each type of crop. Eventually, crops avoid Donut paying attention to areas in the report that we are not interested in. As for our use case, using Donut on parsing configuration was the most relevant choice, as we need to detect all fields from a car report. Using Question Answering (Q&A : asking a question as a prompt, and expecting an answer as an output) mode was considered, but Donut appeared to be less efficient on this task, and there is little added value to have the possibility to ask a question in our case.

Figure 5: Example of a crop from the accident report

Figure 6: Donut and Textract detailed pipeline

First, to prepare the model for training, we need to label all the fields (type of the field + content) from each crop. The next step is the tokenization. First, let’s break down what a token is and why tokenization is necessary. Tokenization is the process of dividing a piece of text into smaller units called tokens, usually words or sub words. This process is crucial because it provides a structured representation of the text that can be easily processed by machine learning models. The instance responsible for processing this task is called the tokenizer.

In the preprocessing step to make the text compatible with Donut’s tokenizer, special tokens are added to serve as tags for structuring the text. For example, a preprocessing text just before being passed in the tokenizer might look like this:

<s><s_name>Antoine</s_namel><s_surname>Dupont</s_surname><s_address>12 rue Victor Hugo</s_address></s>

Here, the text is enclosed within <s> tags, with specific tokens like <s_name>, <s_surname>, and <s_address> indicating the different fields. These special tokens help structure the information within the text, making it easier for Donut’s tokenizer, to understand and process.

Once the text has been preprocessed in this manner, it is tokenized. Tokenization involves converting the preprocessed text into a sequence of tokens that can be understood by Donut’s tokenizer. In the case of Donut, the XLMRobertaTokenizer is used. It tokenizes the text at the word level, breaking it down into individual words or subwords for further processing. These tokens form the input data for training the model to associate the text with its corresponding fields, enabling the model to learn to extract and label the relevant information from each crop.

After that, we train the model to familiarize itself with assigning these texts to the given fields. The inference model consists of two steps:

Vision Encoder (SwinTransformer): The SwinTransformer serves as the vision encoder, tasked with recognizing the information contained in the image. SwinTransformer employs a window-based self-attention mechanism to capture contextual information across different spatial scales within the image. This allows it to effectively extract meaningful features from the input image, facilitating subsequent processing by the text decoder.

Text Decoder (BART): BART acts as the text decoder, responsible for structuring the information detected by the encoder and generating understandable text. BART, or Bidirectional and Auto-Regressive Transformers, is a transformer-based model specifically designed for sequence-to-sequence tasks, such as text generation and summarization. It leverages both auto-regressive and bidirectional training objectives to effectively reconstruct input sequences and generate coherent outputs.

The output token sequence is always converted to a JSON format because of its high representation capacity.

Finetuning Donut

After defining the architecture of our model, we finetuned the Donut model for our specific use case as shown on upper part of Figure 6. To do so, we focused our goal on making Donut detect and read the content of the 14 main fields present in a car accident report. We selected the same fields that were tested on Textract. We must precise that Textract was built for a more global use, while here, we focus our efforts on finetuning Donut for these specific 14 fields for testing purposes:

Date of the accident
Hour of the accident
Place of the accident
Circumstances
Name of the insured
Signature
Contract number
Vehicle brand
Licence plate
Country of the vehicle registration
Name of the insurance company
Country of the insured
Postal code of the insured
County of the accident

From our annotations of roughly 3000 accident reports, we obtained a total of 6787 crops, from which are regrouped the selected fields. However, not all the fields are always filled by the drivers in an accident report. So, the fields proportion is unequal. The number of annotations for each field range from 88 instances (date of the accident) to almost 3000 for the licence plate.

We separated our dataset into a training, validation and test set. Our test set was made of all the crops from the 100 accident reports that we already used to test Textract model so we can compare both models on the same test images.

Here is the procedure for preprocessing and finetuning Donut model on the accident reports use case:

Tokenizer enrichment:
We provide the model with the list of all the fields that it is likely to be encountered in our images (signature, circumstances etc.). The model enriches its tokenizer with all these fields.
Conversion:
We convert the annotated train dataset in sequence of tokens and convert the images into tensors.
Pre-trained model:
We initialize the weights with the basic pre-trained model.
Resizing:
We resize the embedding layer or match newly added tokens and adjust the image size of our encoder to match our dataset.
Parameters:
- Given the relatively small size of the training dataset, particularly for certain categories, and after several empirical tests, the number of training epochs is set to 10. Beyond that, performance doesn’t improve significantly.
- The learning rate has been set to 0.001, identically compared to other similar training sessions to our use case.
- The batch size has been set at 2 mainly due to capacity limitations
Training:
Donut’s training lasted around 2 hours on the training dataset and the inference time for one crop image is less than 3 seconds (as a comparison Textract pipeline takes between one to four seconds of inference for one documents)

Donut vs Textract: revealing processing differences

Figure 7: Preprocessing pipeline of the accident reports

These two models are built very differently, which makes them difficult to be compared directly. Donut and Textract are different in the way they are implemented. Donut strongly relies on a training phase given annotations. Textract, in contrary, does not need a training phase to adapt to the use case. However, given that Donut is trained for our case, it can learn from the particularities we ask him to pay attention to. For instance, if the image contains the handwritten hour “14h17”, if we specify that the correct annotation is 14:17, Donut will learn to replace “h” by “:” when giving the date time. While Textract will not adapt to that and still give “14h17” as output. That is why we need to apply post-processing actions to get the wanted results.

Another difference is that Donut detects itself the fields from an image without knowing in advance the fields to detect. Not only does Donut need to predict the field’s content, but also detect the field type from the image. The probability of errors is therefore more important. Even before predicting the content of a field, the model can make a mistake on the type of field it is predicting. Furthermore, it can forget fields during the detection phase or detect one that do not exist on the considered image.

Donut vs Textract: comparing results

In the previous paragraph we presented a comparison of the architectures of Donut and Textract.

To continue our exploration, we are presenting here the performance over the 100 reports in the test set mentioned above. As a reminder, there are two steps in the “Textract pipeline”: (1) Textract prediction and (2) postprocessing. The graph bellow shows the performance of Textract without postprocessing, Textract with postprocessing and Donut without postprocessing.

Figure 9: Comparing efficiency analysis (%)

To compare the two models equally, we need to compare their performance at the same state of the pipeline. Since no post-processing was applied to Donut, we compare its results with Textract results without post-processing. Globally, Textract performs better on most fields. Donut performs better on fields where Textract is weak and is worse on fields less represented in the training dataset. This reflects the limits of the Donut model. It relies a lot on the quality and quantity of the training dataset we provide. If Donut has few images for training in a specific field, it is likely to perform poorly in extracting information from that field.Also, like many other models, Donut will have more difficulty predicting text-dense areas like the address.

Several tracks can be explored to improve Donut’s performance. First, adding more training data is an easy and obvious way to improve Donut’s performance on fields that have very few instances. But the annotation phase can be very time-consuming. We could also try to reshape the images by adding white areas so that the resizing does not distort certain images. Exploring other hyperparameters combinations is also a possible track. Eventually, combining Donut with Textract could be a promising idea. Indeed, we already have a pipeline working well on identifying the fields inside an image. We could imagine keeping this part and only use the text decoding part of Donut to make the predictions.

Discussing Strengths and weaknesses

The fundamental difference between the Textract-based approach (Figure 2) and Donut (Figure 6) lies in how information is extracted from visual documents. Textract relies on OCR, whereas Donut adopts a Transformer-based approach that doesn’t require OCR. Donut’s strengths are that it can handle documents in different languages without prior specification. It also does not require off-the-shelf OCR engines/APIs, which significantly reduces the energy consumption and greenhouse gas emissions associated with using these engines, making its inference much lighter. Once trained, Donut is free to use. However, Donut demands substantial computing power and memory for training and fine-tuning, which depends on the amount of data processed. For our use case, we used a GPU with 64GB of RAM for the preprocessing and training phase of Donut, but for inference, only a CPU and 4GB of RAM are needed.

In comparison, Amazon Textract offers robust performance and ease of integration, but it may lack the customization potential like Donut, that makes post-processing layer necessary. It uses optical character recognition (OCR) algorithms that require considerable computing power and energy, potentially leading to a greater environmental impact. Contrary to Donut, it is billed based on usage.

Conclusion

In conclusion, the divergence between Textract and Donut lies in their approaches to document extraction. Textract relies on OCR, and has been trained on extensive volumes of documents, making it highly effective for extracting information from classic document types. While Donut and other GenAI approaches revolutionize older methods by eliminating the reliance on costly OCR, enabling linguistic flexibility, and reducing the need for manual post-processing. These models mark a significant advancement in the field of visual document understanding, accelerating processing workflows and improving information extraction accuracy.

However, the main issue we encounter with Donut, is that it requires a lot of data for training to reach a hight rate of precision. Donut was originally fine-tuned for different use cases like decoding receipts, scanned administrative documents, or images from limited types of documents like emails, letters, memo, and so on. If we take this pretrained model and infer it on a brand-new image of a completely different type from the ones it has been trained on, it will not perform well on decoding its content.

It is important to note that there is room for further exploration and improvement with Donut. Some potential enhancements remain untapped, presenting positive avenues for future development. Similarly, while Textract has a more established market presence compared to Donut, there is potential for refining post-processing methods, especially on specific but widely spread usecases like car accident reports.

@Côme STEPHANT, @Mathieu JUNCKER & Florent Cottier

References

OCR-free Document Understanding Transformer, Geewook Kim, Teakgyu Hong, Moonbin Yim, Jeongyeon Nam, Jinyoung Park, Jinyeong Yim, Wonseok Hwang, Sangdoo Yun, Dongyoon Han, Seunghyun Park

GitHub — clovaai/donut: Official Implementation of OCR-free Document Understanding Transformer (Donut) and Synthetic Document Generator (SynthDoG), ECCV 2022

OCR Software, Data Extraction Tool — Amazon Textract — AWS