How to leverage AI to avoid endless annotation campaigns?

8 min readOct 6, 2023

Annotation challenges

Companies now have increasing access to textual data. From simple user feedback or surveys, they can obtain actionable information, trying to take advantage of what they have collected to improve the user experience. For example, identifying the frequently asked questions to customer service can help define a chatbot that can assist customers with most of their needs.

However, the inference methods often require many annotated samples. For any given task, it will often be necessary to train the algorithms on annotated examples with the labels we are trying to predict. Depending on the type of data, annotation can be costly and time-consuming. For judicial data, for example, it is necessary to employ notaries who cost a lot of money and time to do this work. Companies therefore seek to minimize the amount of labeling carried out, while optimizing the results of their algorithms. This trade-off in terms of quality, time and costs is the essence of this article.

A look at SOTA Methods

Data processing models improved a lot over the years, reaching exceptional performance on several natural language processing tasks. Classical methods gave way to deep learning models, particularly Transformers, which remain nowadays the most advanced architectures for text processing.

Our work draws inspiration from several works, notably Scaling Laws for Neural Language Models[AD1] which empirically proves that there is a power law correlating model performance with dataset size; and Beyond Neural Scaling Laws: Beating Power Law Scaling via Data Pruning which then shows that it is possible to bypass this empirical law by smartly selecting examples from the dataset. This article can be considered as an extension of this research paper on textual data.

Our approach

Our study focused on the Newsgroup text classification dataset. The Newsgroup dataset is a classic text classification dataset composed of posts of the Newsgroup forum with 20 different classes available. Classes are made from diverse topics, going from posts about religion to posts about electronics. This dataset is not a simple one, mainly due to the large number of classes and the contextual similarity of some topics. For instance, religion.misc, religion.christianity, and atheism are three topics that have similar vocabulary and will, therefore, be tough to properly be classified. This particular characteristics make it an excellent dataset to study for our annotation use case.

The first step is to process unannotated data. We have decided to group similar examples together to avoid annotating similar texts. The idea is to combine machine learning methods to extract representative samples from each cluster. This can give distinct examples to annotate, which minimizes redundancy in the datasets to be labelled.

Afterwards, once a few samples are labelled, one can wonder how to smartly add new ones while minimizing redundancy in training data and increasing the quality of the overall data altogether. This approach is known as “active learning”.

Active Learning works in the following way:

Collect data
Train model on the selected annotated samples and is then used to evaluate the more complex data for our final model to handle.
Once this data is singled out, annotate it to be given as training data for the final model, increasing sample diversity and model performance.

Hands-on: let’s implement these methods!

A reminder on word representation: NLP experts, you can skip to the next part!

Let us start this part by giving a quick reminder on how words get to be represented for algorithms to use. Machine learning models do not directly exploit strings of text. What they exploit instead is a vector representation of those strings. Several methods have been implemented to turn a list of texts into vector, and we will have a look at the two most commonly used ones nowadays:

1 — TF-IDF vectorization performed as follows:

Create a vocabulary dictionary containing the words in the corpus. The frequency of occurrence in the whole document, referred to as document frequency is also calculated.
For each corpus sample, count the frequency of occurrence of each term in the lone sample, referred to as term frequency.
The vector coefficients for each sample are proportional to the term frequencies divided by the document frequencies.

This technique makes it possible to ignore words appearing in too many different samples. However, there is one main problem that cannot be avoided: the lack of consideration of the word order and the context in which each word is used. For example, ‘I shoot using your bow’ and ‘I bow to you’ will be very close in a TF-IDF vectorization due to synonyms and similar words, whereas a context-sensitive approach would distinguish them much more effectively.

TF-IDF vectorization is accessible easily through its scikit-learn implementation.

def get_features(df:pd.DataFrame) -> Tuple[np.ndarray]:
    """
    Performs tf_idf vectorization on our whole dataframe
    Vectorizes afterwards the train annotated examples

    Args:
        df (pd.DataFrame): annotated dataframe used to train the XGBoost on

    Returns:
        full_features (array): all the texts from the dataframe vectorized
        train_features (array): all the annotated texts from the dataframe vectorized
    """
    vectorizer = TfidfVectorizer(min_df=20)
    full_texts = df.text.values
    full_features = vectorizer.fit_transform(full_texts)
    train_examples = df[df["annotated"]].text.values
    train_features = vectorizer.transform(train_examples)
    return full_features, train_features

2 — Sentence-Bert embeddings:

The state-of-the-art method is to use a transformer model. The operation and training of such a model are described in Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. The use of this type of vectorization makes it possible to consider the context and interaction between words and the use of synonyms.

Nevertheless, the main shortcomings of such an approach are the lack of control and transparency of the results produced by the model, as well as lengthy and cumbersome pre-training. However, this method is easily accessible through the sentence-transformers library implementation, which allows us to overcome the need for pretraining.

def get_features(df:pd.DataFrame) -> np.ndarray:
    """
    Use of the SentenceTransformer Model to get the features from our list of texts

    Args:
        df (pd.DataFrame): The dataframe that will be used and preprocessed. Here, we use its "text" column.
    
    Returns:
        features: an array with all texts encoded ; with the MiniLM model, each text is encoded on a 384-dimensional vector
    """
    texts = df["text"]
    st_model = SentenceTransformer('all-MiniLM-L6-v2')
    features = st_model.encode(texts)
    return features

Capturing the singularity of your data

Once we have converted words to float vectors, we can now move to the clustering of unannotated data.

This step must create clusters large enough to limit the number of annotations required, and sparse enough so that we can have a fine-grained representation of the data. The best approach is to embed texts into 728-d vectors with Sentence-Bert embeddings, then use a dimensionality reduction tool like UMAP and pair it with a clustering tool like HDBSCAN, which keeps cluster assignment probabilities.

def generate_clusters(features:np.ndarray,
                      n_neighbors:int,
                      n_components:int, 
                      min_cluster_size:int,
                      random_state:int = None) -> hdbscan.HDBSCAN:
    """
    Generate HDBSCAN cluster object after reducing embedding dimensionality with UMAP

    Args:
        features (array): embeddings of the texts, generated by "get_features"
        n_neighbors (int): number of neighbors parameter of the UMAP for global distances preservation
        n_components (int): number of components parameter of the UMAP for dimensionality 
        min_cluster_size (int): number of minimum elements in a cluster parameter of the HDBSCAN
        random_state (int): integer that sets up the seed

    Returns:
        clusters (hdbscan.HDBSCAN): HDBSCAN cluster class
    """
    umap_embeddings = umap.UMAP(n_neighbors=n_neighbors, 
                                n_components=n_components, 
                                metric='cosine', 
                                random_state=random_state)
                            .fit_transform(features)  

    clusters = hdbscan.HDBSCAN(min_cluster_size = min_cluster_size,
                               metric='euclidean', 
                               cluster_selection_method='eom'
                               ).fit(umap_embeddings)

    return clusters

The main reason we use a UMAP and do not solely cluster the BERT embeddings is that the UMAP allows us to modulate cluster size by working with both the hyperparameters of UMAP and HDBSCAN. The dimensionality change obtained by UMAP modifies the HDBSCAN result. We can then finetune the HDBSCAN’s hyperparameters to get satisfactory results.

The objective behind this clustering is to get diverse clusters composed of very similar samples. We do not want each cluster to have too many samples, but do not want them to be with too few samples neither.

The optimal length that empirically gives the most relevant clusters is between 20 and 50 samples per cluster. This amount is reached by tuning the hyperparameters of the UMAP and the HDBSCAN methods so that we finally obtain between 75 to 100 clusters.

This number of clusters allow us to clusterize half of the original datasets examples with the remaining ones being “unclustered”. These unclustered examples are basically sentences which are too different from the rest based on our given settings. As we want to take representative sentences that will effectively describe each cluster with few stereotypical starting examples, these “too-generic sentences” will not be the first ones to be considered.

Once the clustering process is done, we randomly extract for each cluster a portion α of samples, defined by the user. This portion gives us representative data points on which our model be trained. At this point, our unclustered examples might be useful especially if we consider that we extracted enough representative samples from our clusters but want a few more sentences for the training process.

Leveraging model uncertainty to improve performances

Once this model is trained, we can focus on improving its performance with more annotations selected through active learning.

So, after we have selected samples, we vectorize them and train a classifier to compute class probabilities for the non-annotated examples. Once this is done, we get the most uncertain examples and label them. We iterate this process quite a few times, annotating examples at each step, and retraining a model on the new dataset to output new scores.

def compute_uncertainty(df:pd.DataFrame, clf_scores:np.ndarray) -> np.ndarray:
    """
    Computes uncertainty scores for all texts given the classifier probability scores
    Uncertainty is set to 0 for annotated examples

    Args:
        df (pd.DataFrame): dataframe we want to get uncertainty scores for
        clf_scores (array): probabilities predicted by the classifier for each class for each sample
    
    Returns:
        uncertainty_scores (array): uncertainty scores of the classifier for each sample
    """
    uncertainty_scores = 1 - np.max(clf_scores, axis=1)
    indexes = df[df["annotated"]].index
    uncertainty_scores[indexes] = 0
    return uncertainty_scores

This iterative process enables us to get a better, more diverse dataset, compared to randomly selecting texts. In fact, even more complex methods of active learning can be implemented considering sample distribution as well. However, during our experiments, we found that these methods did not yield a substantial increase in accuracy.

In the figure below, active learning performance is compared to other techniques thought the post classification task on the Newsgroup dataset. It shows that our active learning sample selection process had a consistent accuracy improvement of 3% compared to random selection over several different train-test splits. This proves that an active learning-based sample selection can allow the user to get more relevant samples for the classification task.

*Accuracy results training an XGBoost model for the “Newsgroup” dataset with different query methods*

Conclusion

The methods presented in this article allow us to select which samples are most relevant to label in order to optimize model performances.

Whether working without prior samples or working with already-annotated samples and determining which ones to work on, we have succeeded in producing a smart selection process to increase dataset diversity and limit annotation time.

In fact, recent advances in Generative AI have shown that it is possible to improve performance by generating self-annotated samples. Large Language Model as Attributed Training Data Generator: A Tale of Diversity and Bias outlines how Gen AI can leveraged to generate relevant samples along with their annotations. It even shows hat a few generated samples are even better for training models from the start than randomly selecting those few samples from your dataset! This is another promising avenue for exploiting AI to obtain relevant data to train machine learning models.

At the NLP Lab of Sia Partners, we are working daily to take advantage of the latest improvements in Generative AI. As we look forward to sharing our work on that topic, we can only recommend following our medium account!