Introduction
Leveraging large text datasets is a recurring problem for our customers. It ranges from harnessing answers to open-ended questions in customer opinion surveys, to processing meeting minutes or studying customer feedback from external sources. However, all these subjects share a common underlying question:
What insights can be drawn from general topic opinions or responses to open-ended questions?
The easy accessibility of Large Language Models (LLMs) has facilitated their integration into text mining use cases over the last few months, significantly enhancing the performance of previous pipelines.
We distinguish two phases in the processing of a dataset for text mining purposes:
- 1. A “topic modeling” phase: its goal is to determine the most prevalent topics from a corpus of documents.
- 2. A “topic classification” phase: in this phase, each document is associated with one of the topics identified in phase 1.
In this article, we present the topic modeling methodology we applied to customer use cases, outperforming the more conventional topic modeling methodologies that were covered in a series of articles published on our blog ([1] — Topic Modeling : An end-to-end process for semi-automatic topic modeling from a huge corpus of short texts, [2] — A journey to state of the art performances in topic analysis) (phase 1).
We also provide a comparison of several zero-shot approaches for classifying text into the detected categories (phase 2).
How can a Large Language Model leverage textual dataset?
Traditional topic modeling methodologies, such as BERTopic, perform phase 2 (text clustering to assign a cluster to each document) before phase 1 (clusters interpretation to obtain salient topics). The approach presented here capitalizes on the summarization capabilities of LLMs.
We had the opportunity to work with a variety of datasets, applying the methodology described in this article. For the sake of consistency in data understanding, as well as addressing confidentiality concerns, we chose to use a French dataset of customer reviews previously used in our work.
1- Introduction to Large Language Models for analyzing 200,000 customer reviews
The dataset at our disposal consists of 200,000 online reviews for a delivery agency. Being already familiar with this dataset, we were aware of the main topics mentioned and the varying quality of the reviews in terms of insights.
The dataset at our disposal consists of 200,000 online reviews for a delivery agency. Being already familiar with this dataset, we were aware of the main topics mentioned and the varying quality of the reviews in terms of insights.
Unlike our prior articles where we relied on conventional machine and deep learning methods to perform text mining, we now will be focusing on leveraging LLMs and more specifically GPT –4o mini and LLaMa-3.1–8B models.
- GPT–4o mini: this model was chosen for its performance [3] and its performance to price ratio
- LLaMa-3.1–8B: it is one of the most powerful models currently available in terms of number of parameters [3][4] which makes it easier to run on-premises, enabling more sensitive data to be processed on dedicated instances.
Both models also excel at multilingual tasks which is a key feature nowadays regarding the datasets we handle during projects. However, using an LLM as a summarizer to extract topics is limited by context size. It is indeed not (yet) viable to input 200,000 reviews to an LLM, as all the reviews correspond to around 3M tokens (using tiktoken library).
Since topic modeling is more valuable when text datasets are large, the following pipeline was created and used to overcome the problem of context size.
2- Framework of this experience: Leveraging summarization and extraction capacities of LLMs
The approach used follows a map-reduce methodology, tailored specifically for topic modeling purposes. It is a non-supervised solution to extract most relevant topics in the reviews.
There are three main steps in this method as can be seen on Figure 1:
(1) Step 1 — Iteratively prompt an LLM for summarization purposes.
Reviews are randomly shuffled and successively concatenated until the prompt context size reaches the maximum context size limit of the LLM used. When the context limit is reached, the process is repeated. The 200,000 reviews are finally separated into several review groups. Each of these groups of reviews is then given as input to the LLM with the objective of summarizing the most notable issues in the set of reviews.
Following this technique, we obtain about 200 summaries. Each summary contains the main ideas from the reviews and are also more structured which facilitates topic extraction process.
(2) Step 2 — Use the summaries to extract the main topics
However, even with the initial summarization step (1), the combined tokens from all summaries remained too numerous to be used directly as input for the model. On the other hand, opting for a second round of summarization posed several risks including the possibility of overgeneralization. So to overcome both issues, we conducted an iteration process of sampling a subset from the output summaries to identify the main topics.
In fact, iterating the sampling process 20 times allows selection bias prevention. It also helps smooth out the stochastic nature of LLMs as topics extracted from the same text may differ from one inference to another, despite using a low temperature setting. In this use case, up to 50% of the 200 intermediate summaries were utilized to be as exhaustive as possible. Each intermediate summary summarizes approximately 1 000 ± 20 reviews. For use cases with different amounts of data, this threshold can be adjusted.
Following this approach, a comprehensive list of topics was successfully generated.
(3) Step 3 — Extract one final list of topics.
By inputting the 20 lists of step 2 into a final LLM prompt with specific instructions, a consolidated list of final topics was obtained.
While comparing the results with the ones obtained following previous classical approaches (like statistical methods such as TF-IDF, graph-based methods or Latent Dirichlet Allocation), we noted a significant increase in topics’ quality. Indeed, the model enables us to generate directly interpretable topics for stakeholders. However, some of them may overlap in certain aspects.
This approach was implemented with GPT-4o mini and LLaMa-3.1 8B models tested in parallel, ensuring separate evaluation of each. The topics extracted by both models are quite similar yet in the presented use case, LLaMa-3.1–8B not only delivers results of comparable quality to GPT-4o mini but also offers the advantage of being a lighter model (Figure 2). The ability to host the system in a partitioned environment dedicated to our task also enables us to process more sensitive data. However, execution time is longer using this model. For this study, APIs provided by Groq were used which inference time is competitive, as well as being inexpensive.
3 — Comparing zero-shot approaches to associate a review with its topic(s).
Once the topics identified, the next step is to classify each review with the corresponding topic(s). This comes down to a multi-label classification as one review can have zero, one or several topics.
The absence of annotated data has led to evaluate different zero-shot approaches. Unlike other methods that require annotated examples, zero-shot approaches applied to texts aim to determine whether the text belongs to a class for which the model did not have a training phase.
Therefore, this type of approach does not require training on annotated data yielding a faster development time and an interesting complement to an unsupervised approach such as topic modeling.
a) Establishing a baseline: a simple regex
The first standard approach to assign each review to a category would be to use regex to match some keywords linked with the categories. This method creates a suitable baseline for the use case given that the reviews are long.
b) GPT-4o mini as a binary classifier
By leveraging vast datasets and advanced deep learning architectures, LLMs excel in tasks like natural language understanding, text generation, summarization, and more. Research shows that LLMs can also be used as zero-shot classifiers for simple classification tasks [5]. GPT-4o mini model was tested via the following prompt as a binary classifier to detect whether a review belongs to each topic.
c) LLaMa-3.1–8B as a binary classifier
The performances of LLaMa-3.1–8B for extracting relevant topics prompted the decision to test it for classification purposes. The use of this LLM is similar to that of GPT-4o mini.
d) BART
To compare with large-scale Transformer-based Pre-trained Language Models, BART model in zero-shot settings was also evaluated. BART is a large encoder-decoder model already fine-tuned for natural language inference tasks. Its scores the belonging of the review to each class and outputs the classes that exceeds a fixed threshold (0.5 in this study). The classes defined here are the topics found in the “Topic Modeling” phase.
e) Performance evaluation
To validate the model results, it was necessary to manually assign about 500 reviews to the relevant topic. The performance of different methods was then evaluated using metrics such as F1-Score, accuracy and recall for each topic. The models’ performance for F1-Score are given in Figure 5.
The associated costs are described in Figure 6.
Note that the price for “Classification” phase depends on the number of topics to which a review can belong, the more topics, the higher the cost. In our case, 7 topics were possible and the GPT tokenizer was used for counting the number of tokens in our inputs and outputs.
4 — Intrepretation
Generative models were found to be slightly better than regex for associating the reviews to the topic(s), at a very reasonable cost. The most general topic, “Manque de service à la clientèle” is the one for which every approach fails, possibly due to its vagueness.
While all approaches were used as succession of binary classifiers, BART’s poor performance can be attributed to its utilization of multi-class classifier.
In our delivery agency use case, a simple regex as classifier may have done the trick to offer an initial labeling of the reviews, but this is most likely due to the simplicity of the topics found after our topic modeling phase, which is why using an LLM as annotator for datasets with a richer jargon was preferred. For other classification use cases where performance is crucial, fine-tuning a specific BERT model trained on annotated data remains ideal ([6] — [7]).
Conclusion
We have come to appreciate the benefits of LLMs for text mining. The aim of this article is to present an accessible, versatile, fast (execution time) and inexpensive (total cost estimate) methodology for dealing with similar use cases.
From an unstructured set of texts and instructions in a prompt format, we converged on a list of latent topics in our corpus. Different zero-shot classification approaches can then assign a review to its category. LLMs showed the highest performance for our use case.
However, there are still several problems associated with relying on third-party APIs to process sensitive data, and the size of these models can also be an obstacle if the LLM used is hosted on-premises. However, the results obtained with LLaMa-3.1–8B (a much lighter model that can be hosted on-premises more easily) remain sufficiently convincing to justify this type of approach. Although the use of pre-trained LLMs entails great performances regarding our uses, it also implies a major interpretability drawback due to the lack of transparency regarding training data of those models.
References
[1] Topic Modeling : Un processus de bout en bout pour extraire les thèmes d’un corpus de textes : https://www.heka.ai/fr/nos-publications/topic-modeling-un-processus-de-bout-en-bout-pour-extraire-les-themes-dun-corpus-de,
[2] A journey to state of the art performances in topic analysis : https://heka-ai.medium.com/a-journey-to-state-of-the-art-performances-in-topic-analysis-8a8bf69de4e3
[3] Evaluation LLaMa-3–8B : https://chat.lmsys.org/
[4] Evaluation LLaMa-3–8B : https://github.com/meta-llama/llama3/blob/main/eval_details.md
[5] https://arxiv.org/pdf/2312.01044.pdf : LLMs as zero-shot classifiers for easy tasks.
[6] https://www.sciencedirect.com/science/article/pii/S2667305323001333 : ChatGPT and finetuned BERT: A comparative study for developing intelligent design support systems. ChatGPT seems as good as BERT for sentence classification on domain-specific datasets but less efficient for phrases classification.
[7] https://www.kdnuggets.com/2023/04/best-architecture-text-classification-task-benchmarking-options.html : Benchmark comparing performances of Roberta models against GPT3.5. models.