An end-to-end process for semi-automatic topic modeling from a huge corpus of short texts

10 min readNov 10, 2022

Introduction

We propose an end-to-end process for applying topic modeling on any business case minimizing the needed human resources. This article follows our previous article about Topic Modeling which presented a detailed benchmark of various topic modeling techniques applied to a specific business case.

Let’s first remind ourselves what topic modeling is and why we need it. Topic modeling is a Natural Language Processing task that aims to extract meaningful topics from a huge collection of documents. While reading the full corpus would need a massive amount of time and of human resources, the idea behind topic modeling is to automate this step.

Just like in the previous article, we will also focus on extracting topics from postal service agencies’ reviews by applying the whole process step by step:

A data preprocessing step (automatic)
An hyperparameter tuning for several models based on Bayesian optimization (automatic)
A deep analysis of the results (semi-automatic)

Data preprocessing

The dataset contains about 30,000 short reviews. After deleting stopwords, there are 33,782 unique words left in the full dataset. There are few standard preprocessing steps for any NLP project :

Texts are set to lowercase.
Punctuation and numbers are removed.
The text is lemmatized (it means that we only keep the root word of each word in the dataset).
And finally, stopwords (meaningless words) are removed from the dataset.

Here is an example of preprocessed data:

Original document:

“Même si la traçabilité n’est pas aussi précise qu’avec Chronopost (pas le même tarif !!!)les envois sur la polynésie sont toujours dans des délais correct, je n’ai eu à ce jour aucun problème avec eux.”

Preprocessed document:

“si être aussi précise chronopost tarif envoi être toujours délai correct avoir avoir jour aucun problème”

Plenty of Python NLP libraries are proposing functions to handle these classical preprocessing steps.In our case, we worked with OCTIS: a NLP library specialized in topic modeling.

Metrics

Coherence

As mentioned in our previous article, Coherence is a classic metric for evaluating topic models. There are several coherence measures, but they all follow the same structure. Coherence is meant to measure how coherent documents are within a topic. It has the following structure:

An external reference corpus can help us create a probability calculation on words.

Given two input words w and w’, we can use this probability to compute the conditional probability that a document contains the word w’ given the fact that it contains w. This conditional probability can be used to build a direct confirmation measure (many choices are possible for constructing this measure).

Now let’s consider a set of words W, and a word w belonging to this set. We can compute the sum of the direct confirmation measures between w and all the other words w’ in W to create the indirect confirmation measure of w. It means that, given a word w, we use all the words w’ of the set W as intermediates for computing the score of w.

Finally, the scores of all words in W can be aggregated to compute the coherence score. For this analysis we chose to work with the CV coherence score, which corresponds to the general coherence score explained above with specific choices for the direct confirmation measure, the indirected confirmation measure and the aggregation method.

Diversity

Another important metric is diversity, used to measure to what extent the topic model is wide and able to capture as much information as possible. There are basically two kinds of diversities, some are based on word tokens considered as sets, and others are based on probability distributions.

The most widely used measure is the average Jaccard Similarity (JS). It is based on common words ratio between topics : topic t1 and t2 :

We get a diversity by taking 1 — JS(t1, t2)

Another metric based on word tokens being considered as sets is computing the percentage of unique words among the top 10 words of all topics, this is the metrics chosen for our experiments.

Probability-based metrics are generally using the KL divergence (Kullback-Leibler divergence) to compare topics’ distribution over words.

There is a balance that has to be found between coherence and diversity. A human eye is needed to assess the importance of these two parameters when applying them to a specific business case.

Metrics with predefined topics

Depending on the use case, a topic list can be predefined. In this article we focus on metrics and optimization without predefined topics, this section is for generic information on topic modeling.

The list can be complete or partial. The goal is to compare a list of topics found by the model and score it with the predefined topics list. There are specific metrics that can be used to address this part. In the following section we will mainly focus on two of them: Cosine similarity and Triangle area Similarity — Sector area Similarity (TS-SS)

Cosine similarity

Cosine similarity is a well known and used metric in a lot of applications to tackle NLP topics. It measures the similarity between vectors.

Given two vectors A and B, the cosine similarity cos(θ) is given by the following formula:

The value is between 0 and 1 as it is the cosine of the angle between both vectors. The closer the score is to 1, the more similar the vectors are.

One major drawback of cosine similarity is that it doesn’t take into account the vector’s magnitude which is not the case of the TS-SS metric.

Triangle area Similarity — Sector area Similarity (TS-SS)

As described in this paper, TS-SS combines cosine similarity and euclidean distance to take into account both the direction and magnitude of vectors.

Given two vectors A and B, the first part of the metric is the TS similarity TS(A,B) given by the following formula:

We generally use 10 degrees as a minimum to simplify calculations.

And the SS similarity is given by:

where MD is the absolute value of the difference between ||A|| and ||B||, ED is the Euclidean distance between A and B.

At the end, here is the final metric :

TSSS range from 0 to positive infinite. As opposed to cosine similarity, the closer the TSSS is to 0, the closer the vectors are.

Comparison of Cosine Similarity and TS-SS

To conclude, from our experience on topic modeling, TS-SS is better to compare models between themselves due to a sharp sensitivity to topic relevance variation.

Optimization step

Once the data is preprocessed, it is time to train the models and optimize hyperparameters. To complete what we have mentioned in our previous articles about topic modeling, we chose to focus on three neural networks models: NeuralLDA, ProdLDA and CTM. These models are derived from the classical Latent Dirichlet Allocation model but are coupled with deep learning techniques in order to leverage their performances.

In our experiments we chose the Python library Optuna for optimization purposes. This library implements Bayesian optimization techniques to iteratively propose new parameters maximizing a given metric.

Here, we try to maximize the mean between diversity and coherence. We decided to assign an equal weight to both metrics because our case wasn’t bringing any strong conviction on which metric we should focus. Given a business case, it is necessary to try to evaluate this balance more precisely and assign different weights to both metrics.

For each model, the idea is to run Optuna, drawing several iterations (~ 500). At each iteration, Optuna will suggest values for all hyperparameters with which a model will be trained. The metrics’ mean are sent to Optuna and the overall model performance is saved in a resulting table along with the hyperparameters.

After each model’s run, convergence checks should be done. On this graph, the metric optimized by Optuna (mean between diversity and coherence) is plotted for each model through the iterations. Due to poor performances (very low coherence value, mainly irrelevant topics), we decided to stop NeuralLDA experiments and only focus on ProdLDA and CTM. For example, NeuralLDA proposed the topic shown below:

This topic is a mix of off-comments and interesting comments on various subjects. Moreover, no clear meaning can be detected by the topic word cloud.

Analysis step

Model selection

After the convergence checks, the idea is to visualize all iterations on a diversity-coherence graph as presented below.

One can observe that it is tricky to interpret it directly. Thus, it is better to plot only the Pareto frontiers to only keep the “best” iterations for each model.This type of graph is interesting for comparing model families and selecting interesting iterations.

In our case, even if we made fewer iterations for NeuralLDA, we decided to select the iteration with a diversity equal to 1 alongside the ProdLDA iteration with the best diversity and the CTM one with the best coherence resulting in 3 topic models to explore.

Model manual exploration

As a first step, we use a PCA visualization in order to see how the suggested topics by all the models are close to one another:

We can observe that NeuralLDA topics seem to be contained by the ones proposed by the other models. On the other hand, ProdLDA and CTM, both propose new original topics — not initially suggested by NeuralLDA.

In order to dive deeper into each topic, one should understand how each model defines a “topic”. As inherited from the LDA, the topics have probabilistic definitions, meaning that each topic is defined as a probability distribution over the vocabulary: given a topic, we assign a score to all the words in the vocabulary which represents how strong this word is within this topic. Similarly, a document is modeled as a probability distribution over the topics meaning that, given a document, each topic is given a score representing how strong it is within this document.

In order to make all topics intelligible, we chose to analyze two synthetic representations of the topics. First, we considered word clouds which are based on the distribution of the topic over the vocabulary (this distribution being the scores for each word), and keeping only the top 50 words.

In the example above, the topic is containing comments that are about the banking services within post offices. Indeed, the studied postal agencies are also providing banking services.

The second representation we opted for are what we call the “top” documents.

Let d, t be the weight of the topic t in the distribution of the document d. Then, given a topic t, we compute a document score st(d) :

Where d,t is the min-max topic-normalized weight of the topic t in the distribution of the document d.

And min, t and max, t respectively the biggest and lowest scores of all documents with respect to a topic t.

The denominator of the score is a penalization on the document length |d| because we observed that this score tends to favor long comments.

Once all documents are scored, we can select the top 5 or top 10 documents in order to read the documents that are the most representative. This type of representation allows us to go deeper into each topic, enabling us to distinguish the closest topics on specific points.

Conclusion

We have seen an end-to-end process for running topic modeling on a specific business case. This process involves mostly automatic steps but can also rely on manual techniques especially for model selection and topic exploration.

Manual model selection adds a customization feature to the process as several business cases will make us select different models. Thus, the manual model selection step proves to be necessary. Last but not least, it enhances the user’s implication which increases the confidence in the produced results.

On the contrary, the global process could be improved by automating the topic exploration step. This step consists in reformulating all topics in a human intelligible form. A further exploration would be to use NLP summarization models in order to produce a couple of fluent sentences describing each topic.