A journey to state of the art performances in topic analysis

7 min readMay 19, 2022

Topic Modeling — The track to discover relevant topics in a set of documents

Introduction

This article tackles the methods that are currently most useful to address Topic Modeling — a field of Natural Language Processing aiming at automatically determining the topics inside a document or a set of documents. Beyond the most well-known LDA approach, other ways exist and have flourished in the past years giving hope for better performance.

Let’s first go back to the purpose of topic analysis. Many situations require categorizing documents that can sometimes be so numerous that it would require a huge amount of work and time for a direct human analysis. Though, this work must be done in order to help facilitate people’s life : helping them find specific information, understanding their feedback to make corrective action are the two main reasons that drive us to topic modeling.

From our side, our mission was to help a postal service company to extract relevant feedback from agencies’ reviews. Sometimes, business experts already have intuitions about what they are wanting to extract and measure — which was the case with the postal service company — but sometimes they don’t, or fear to be missing out on something. That’s why working on topic analysis should embark on one or several steps of topic modeling, and not only topic detection (this will be the subject of Part 2 of this journey).

Data

Let us come back to the postal service company. They manage a network of more than 15,000 agencies and get feedback from Google reviews, Twitter, on-purpose satisfaction surveys, and other sources. Just for Google, it represents more than 10,000 reviews which cannot consistently be analyzed solely by humans, even if they are relatively short (limited to 4,000 characters). Let us see what we can do to find out about people’s minds with no a priori.

Evaluation Criteria

In order to get a grasp of how well a method works on extracting topics from a collection of documents, we need a measure of the relevance of the topics, which enables us to quantify how interpretable the topics are to humans. Note that there is no method that will directly give you a synthetic expression of the topic, it will rather enable you to get several collections of words or expressions inside topics.

One of the most famous measures of relevance in topic modeling is the coherence score. It measures the degree of semantic similarity between high scoring words in the topic. Several versions exist, like the CV and the UMass scores based on probability theory. Word2Vec has its own coherence score based on cosine similarity applied both between words that are inside the same topic (it measure the intra-topic similarity and how coherent a topic is) and across two different topics (it measure the inter-topic similarity and how separate two topics are). This approach is inspired from the “silhouette coefficient” used in clustering to select a number of clusters. Usually, the more topics, the better the silhouette coefficient. Though, at a point, this can turn out to be over-precision and affect the readability of topics. That is why an elbow method is often used to find the good trade-off.

Two scores were used in this analysis, the classic CV coherence score, and the silhouette coefficient combined with an elbow method. Moreover, in our use case, business experts already had an idea of the kind of topics and keywords they were expecting from the reviews, and wanted to confront it with an agnostic vision. In this case, it can be interesting to analyze the similarity between the expected topics and keywords, and the ones extracted from the unsupervised topic model. That is what we made with a cosine similarity measure.

Latent Dirichlet Allocation and clustering : the popular approach

Latent Dirichlet Allocation is the most known approach used to tackle topic modeling. It is based on a probabilistic framework, that is, Dirichlet distributions underlie the distribution of the topics in the documents, and also the distribution of words inside topics. It takes all the documents as a bag-of-words, and ignores syntax and context.

To mitigate this drawback, we can combine LDA with sentence transformers (sBert). The architecture of the model we used is described below. Documents are vectorized using sentence transformers, and those vectors are concatenated with the probability vectors of topics inside documents stemming from the LDA model. We also used a version where LDA is not used. A cluster analysis is then made with UMap and K-Means, then the most relevant keywords of each cluster are extracted via TF-IDF.

Several optimization steps intervene :

the number of topics in LDA is optimized through the CV coherence score
the relative importance of LDA versus sentence embeddings is controlled through a parameter
UMap and K-Means require a step of hyperparameter optimization to find the best number of clusters and the number of neighboring clusters and components for UMap.

Note that in this method, there is no direct control on the independence between topics, which can overlap each other.

Correlation Explanation

That is when Correlation Explanation models come on stage. This model seeks to maximize the readability of the dependencies of words in documents through latent topics.

This model requires to tune the number of topics through an elbow method on the measure of the total correlation between topics. It can also be improved by defining ‘anchor words’ to each topic, making the process semi-supervised. These words can be chosen manually or by calculating and maximizing the mutual information with the topic.

Let’s put it all together : Which method backs off ?

As said before, the number of topics for each method was adjusted through the elbow method applied to the CV coherence score or silhouette score, and it was further refined by observation of the model’s output words for each topic. These output words representing topics were compared to a predefined set of topics and keywords created by business experts. This allowed us to calculate a cosine similarity score between the predefined topics and the topic models, and to compare the performance of the different models.

As shown below, the four models (two main approaches with two variations) are within a handkerchief, with similarity scores ranging from 0.71 to 0.74. The combination of sBert and LDA from one hand, and the CorEx model with anchor words on the other, are the models that perform better.

*Cosine similarity score between model output topics and predefined topics : comparison across models*

Let’s dive a little further into the words shaping the topics output by the models. We did not represent the “CorEx + anchor words” which gave quasi same results as “CorEx” but enabled us to add other topics which we will discuss later. The three models extracted topics related to quality of service and opening hours. These two are actually the main concerns and feedback given by the customers, with more than 80% of reviews dealing with them. The three models were good at spotting them, but they did not in the same way. The “sBert + LDA + clustering” model seemed to identify a subtlety about opening hours : the concern about timetable being insufficient, with words “opening hours”, “timetable”, or weekdays involved, and the concern about the opening hours not being respected, and the office closed at a time it shouldn’t, with words “close” and others around notices being mentioned, which is often the reason why customers complain about the office being unexpectedly closed. The words describing a quality of service were often shaping several topics since it represents a vast majority of the reviews. Last but not least, the “CorEx” model showed interesting topics that were not pinpointed by the others, like the concerns around Covid-19, security and power of attorney. It was also good at giving variations / synonyms. Both of these characteristics can be attributed to the correlation ingredient of the method.

As could be expected, the three models were not performing well at finding precise topics that were of concern for the business experts : PRM accessibility, Business Service, Equipment and Tarif (this was improved with the use of anchor words in CorEx).

Following these observations, which model should be privileged ? It seems that they all have an interest and can give their own insights, so in the absence of time and resource constraints, the four could be used to get the most of these automatic extraction techniques. However, as shown below, the “CorEx” method is faster and shows a logarithmic increase relative to the number of documents, whereas the “sBert + β*LDA + clustering” methods increase arithmetically.

*Time (seconds) versus number of documents. On the left : Linear scale. On the right : Logarithmic scale. Resources used : 4 vCPU / 8 processors / 16 Go RAM)*

As a result, the Correlation Explanation approach should be prioritized in all cases, but the “sBert + clustering” method should also be envisaged to extract further information, especially when the documents are short and less numerous.

This entrance step in our journey also showed the utility of having business experts predefined topics. This can help extract more information from topic models (through anchor words or through analysis) but this will also be useful for the second step of the journey : automatizing documents categorization and insight extraction through supervised topic detection.

Learn more about topic modeling with the next article here.