Labeling text clusters with keywords

Heka.ai
9 min readJun 8, 2023

--

We propose to explore several keyword extraction techniques to label text clusters obtained after a Text Clustering or a Topic Modeling pipeline. This work is following our previous articles about Topic Modeling and Text Clustering (here and here).

Introduction

As a reminder, the aim of a Topic Modeling framework is to detect topics out of a huge corpus of texts without having to spend a massive amount of time analyzing the full corpus. This is a Natural Language Processing (NLP) task, and the literature is full of benchmarks assessing topic modeling techniques’ performances.

Here, the objective is to design the very last step of a topic modeling pipeline. Once we have detected “topics” or “clusters”, how do we use them to easily gather information from the corpus? Manual exploration of the clusters can be performed on the detected topics; however, the best scenario would be to have short and descriptive labels associated with each cluster. This task is named cluster labeling.

There are two main techniques families when it comes to cluster labeling: extractive techniques and generative techniques. As suggested by their names, extractive techniques extract labels from the cluster words, while generative techniques generate original labels, not necessarily using the cluster words.

This article will focus on extractive techniques based on keyword extraction methods.

We have tested these techniques on the same dataset as for the previous articles: a corpus of 200 000 online reviews for delivery agencies, and we have used an existing clustering with 12 clusters. Below is a table showing one random document for each of the 12 clusters.

Table 1- Example of a review for each cluster

Performances assessment

The cluster labeling task is completely unsupervised. Thus, the question of performance assessment is crucial since we need to compare techniques performances.

We propose to decompose the performances into two main components: intelligibility performances and accuracy performances.

The intelligibility component is about how the extracted keywords make it easier to guess, for a human, the meaning of a topic. The goal is to assess how understandable the keywords are. For example, if they are too different from each other, they could be considered as not intelligible. For each technique, we counted how many clusters we have been able to manually assign an intelligible label using the keywords extracted by the technique. This count will represent the intelligibility score of the technique (regardless of the accuracy of the given label).

The accuracy component is about how truly representative the extracted keywords are. To assess this component, we must compare the cluster's content with the keywords and check manually how accurate the keywords are. For each cluster, we will manually create ground truths: an extractive label and an abstract label. Then, we computed ROUGE-1 scores between the manual labels and the proposed keywords. The computed score will represent the accuracy score of the technique and will be min-max normalized to facilitate the benchmark. We will also add a manual evaluation of the keywords’ accuracy.

Table 2 — Example of an abstract label and extractive label (ground truth) for cluster 3.

We must emphasize that this performance assessment is very subjective because it implies a lot of human effort. One must also bear in mind that the ROUGE-1 Score can be quite limited when it comes to assessing abstract labels. Indeed, the ROUGE-1 Score penalizes all types of words different from the one(s) entered, so synonyms are not detected by this metric.

Statistical techniques

Statistical techniques are the most classical techniques to achieve topic labeling. They are usually based on words and n-gram statistics inside each cluster.

Firstly, there is a cluster version of the well-known TF-IDF metric, named c-TF-IDF. c-TF-IDF is a TF-IDF (term frequency, inverse document frequency) score applied to classes instead of the documents. All documents of a cluster are gathered into one document, and these documents are used to compute the score. Applied to our dataset, this technique shows good results for the accuracy part, but poor results for the intelligibility part.

Then, we explored a probabilistic differential technique based on Pointwise Mutual Information (PMI). This is a probabilistic score that measures how much a term in a document contributes to classifying the document in its cluster. A detailed explanation can be found here. This technique is intrinsically differential, so it is meant to separate clusters from each other. This method achieved better scores but is more computationally expensive than C-TF-IDF.

Finally, we explored n-grams methods. These methods extract “keyphrases” instead of keywords by scoring every n-gram of the texts (pieces of n words) and keeping the best n-grams. The YAKE algorithm is a complete and 5-step keyphrase extraction framework based on term frequencies. Applied to our clusters, it was able to reach a high intelligibility score and an average accuracy score.

Table 3 — Example of keywords returned using statistical techniques.

Embedding-based techniques

Another family of extractive techniques for topic labeling is embedding-based techniques. The idea behind these techniques is to embed the texts we want to label. Embeddings can be used to represent the semantic meaning of words and their relations to each other, which can help to identify and label topics more accurately and in a manner that statistical techniques cannot attain.

During our exploration, we mainly focused on KeyBERT, a keyword extraction technique that utilizes BERT embeddings to generate keywords that are most similar to a document. We consider a document as the concatenation of all the texts within the cluster. First, the document embedding is extracted using BERT. Then, word embeddings for N-gram words are extracted. Finally, cosine similarity is used to identify the N-gram word embeddings that are most similar to the document embedding. The top-ranked N-gram words can be considered the most descriptive keywords for the entire document.

Instead of cosine similarity, it’s possible to use Maximal Marginal Relevance (MMR) or Max Sum Similarity (MSS), both approaches allow for greater diversity in the extracted keywords by considering both the similarity of keywords with the document and the similarity of previously selected keywords.

In our case, we used KeyBERT with both French and English BERT embeddings (using a translated version of the dataset for the English version). The French version produced one of the lowest accuracies among all the methods we tested. Using a translated dataset, the English version of KeyBERT produced better results, with a slight improvement observed when used with Maximal Marginal Relevance.

In terms of intelligibility with all KeyBERT approaches, the results were relatively average, as we were able to identify the overall meaning of 5 or 6 clusters out of 12.

Embedding-based methods can be computationally demanding, especially when processing long documents, which can occur when working with large clusters of text. As a result, in this study, we only generated 1-gram keywords to mitigate the computational cost, which could potentially account for the relatively average results observed.

Table 4 — Example of keywords returned using the KeyBERT approach.

Graph-based techniques

Using graphs is another way to identify the main keywords and key phrases within a text. A graph is a set of nodes with connections between them. A text can be represented by a graph by, for example, turning words or groups of words into nodes, the edges define how connected terms are. The relevance of a word in the text is related to its connections with other words. Graph-based algorithms take this information as an input to bring out the best candidates.

These techniques are inspired by the PageRank algorithm. This algorithm is used by Google to rank web pages in their search engine results. It works by counting the number and quality of links to a page to determine how important the website is. Thus, nodes are web pages and a weight is determined for each page depending on the weights of its inbound pages. To go further on this subject, we suggest reading this article.

For keywords extractions, some methods can be used based on the same principle. They differ in their definition of nodes and connections. The Graph-based algorithm returns as keywords the nodes with the highest connections to other nodes. We usually preprocess our text and only keep words with specific POS tags (ex: keep only nouns, proper nouns and verbs).

The first algorithm we looked at was TextRank. In this method, nodes are the words of the document. Thus, all words of the cluster are considered candidates for the keywords. A window size is defined and any two-word pairs in this window are considered to have an undirected edge. This method is the one with the highest extractive ROUGE score of the methods tested (see table). However, we were able to identify only 5 topics over 12.

An extension of TextRank is PositionRank. This algorithm uses the same graph as above, but the weights of the words are then biased according to their position on the document. The idea is to assign larger weights to those words that occur very early in a document, since the first words of a document are more informative than the ones that occur later. The bias of each candidate word is equal to its inverse position in the document. If the same word appears multiple times in the target document, then we sum all its position weights.

Moreover, candidate words that occur together are collapsed into a phrase, specifically nouns and adjectives. The scores of the individual words are summed and the top-scoring keyphrases are displayed as final results.

In our case, the intelligibility score was the same for this method as for the TextRank one. However, the extractive ROUGE score is much lower.

Finally, another graph-based technique is TopicRank. In this method, for each one of the initial clusters, we apply another clustering algorithm to identify topics, i.e. clusters of similar keyphrases. Then, each topic is considered as a node and the weight between two topics is based on the strength of the semantic relation between them. Keyphrases are generated by selecting a candidate from each of the top-ranked topics. This technique shows the highest intelligibility score of graph-based methods, we were able to identify 7 topics over 12 and the extractive rouge score is 87%. However, this technique is really slow to compute.

Table 5 — Example of keywords returned using the graph-based approaches.
Table 6 — Accuracy and intelligibility results across all techniques.

Conclusion

Figure 1 — Intelligibility score vs accuracy score for each technique

We could not find a technique strictly better than all the others. A trade-off must be found between the accuracy and the intelligibility performances. For instance, Graph TextRank is strongly accurate but is poorly intelligible while YAKE was able to reach a high intelligibility but was poorly accurate.

We have also discovered that statistical techniques usually give comparable results to graph-based techniques. This was expected since these two families use similar metrics and are built on word frequency measures. On the other hand, embedding-based techniques give quite different results, and even though they performed poorly on our dataset, they can be used to find complementary results. The quality of the embeddings used with KeyBERT seems to play an important role key role in performance. In our case, the task performedCluster labeling may require more specific embeddings than those of a large, pre-trained model, such as BERT. Indeed, BERT is not specifically trained to capture the nuances of comments written in French on fairly similar topics.

From the computational performance perspective, all the techniques presented in the table are performing about the same, except for PMI and Topic Rank which are a bit longer. For a real-world case, the optimal pipeline would gather two techniques: either YAKE or TextRank, and a KeyBERT-based technique since their results are complementary.

Finally, we have noted that for each labeling technique, some of the extracted keywords may be repeated among the clusters. While this does not interfere with the understanding of the clusters or the comparison of labeling methods, applying a deduplication algorithm on the extracted keywords could offer additional insight by ensuring that each cluster has a distinct set of keywords.

One other method for labeling text clusters we have not explored in this article could be using generative models. The idea would be to analyze all the texts in a cluster and generate a label that synthesizes the main topics. Additionally, the models could go beyond labeling and produce a small explanation of the label, providing further context and insights into the data.

--

--

Heka.ai

We design, deploy and manage AI-powered applications