Clustering Unstructured Text with LLM Embeddings and HDBSCAN

Clustering Unstructured Text with LLM Embeddings and HDBSCAN


In this article, you will learn how to build a text clustering pipeline by combining large language model embeddings with HDBSCAN, a density-based clustering algorithm, to automatically discover topics in unlabeled text data.

Topics we will cover include:

  • How to generate text embeddings for raw documents using a pre-trained sentence-transformers model.
  • How to reduce the dimensionality of those embeddings with UMAP to prepare them for clustering.
  • How to apply HDBSCAN to automatically discover topic clusters and visualize the results.
Clustering Unstructured Text with LLM Embeddings and HDBSCAN

Clustering Unstructured Text with LLM Embeddings and HDBSCAN

Introduction

The current era of Generative AI seems to primarily focus on chat interfaces and prompts, but the range of applications of large language models, or LLMs for short, is not limited to just that. Indeed, one of their most powerful downstream abilities consists of turning raw, messy, unstructured text into semantically rich mathematical representations called embeddings. Once that’s done, we can use these text representations for a variety of machine learning use cases, with clustering being no exception.

In particular, embeddings can be combined with advanced, density-based clustering techniques like HDBSCAN, allowing as a result for the discovery of hidden topics, patterns, or categories in your collection of text documents: all without the need for prior labeling.

This article shows how to construct a text-based clustering pipeline from scratch. We will use a freely available dataset containing text instances, as well as an open-source LLM that has been trained for generating embeddings — i.e. a so-called embedding model. The icing on the cake: we’ll use free and handy, modern Python libraries providing implementations of clustering algorithms like HDBSCAN.

Step-by-Step Walkthrough

First, let’s start by installing the key Python libraries we will need:

  • Sentence transformers, to load a pre-trained LLM for embedding generation from Hugging Face — you’ll need a Hugging Face API key, also called an access token, to be able to load the model.
  • Umap-learn, to apply an algorithm to reduce the dimensionality of embeddings.

Likewise, if you are working on a local IDE instead of a cloud notebook environment and don’t have scikit-learn and pandas, you may need to install them too.

Now we start the coding part by getting some fresh data. The fetch_20newsgroups function, which fetches a dataset containing texts from categorized news articles, will do. Note that even though the dataset contains labels, we will omit them, as we are pretending not to know this information for the sake of clustering these data instances into groups based on similarity. Also, we sample down the dataset to 150 instances, which will be representative enough for our example.

Output:

The next step is to obtain the embeddings from raw texts. To do this, we load all-MiniLM-L6-v2 from Hugging Face’s sentence-transformers library. This is a lightweight yet effective model to obtain embeddings quickly.

Since the embedding dimension is originally too high for clustering purposes, we now apply a dimensionality reduction technique by using the UMAP algorithm from the namesake library installed earlier:

Now our numerical embedding vectors associated with news articles consist of five dimensions (attributes) only. Let’s see if this compact representation is meaningful enough to obtain insightful clustering by applying the HDBSCAN algorithm, which is a density-based clustering approach:

Important: the clustering results are partly influenced by the hyperparameter settings we defined for HDBSCAN. I recommend you try out other configurations for the minimum cluster size and other hyperparameters to explore how this affects results.

Result:

It looks like HDBSCAN detected two clusters associated with high-density regions in the data space. Would there also be noisy points that were not allocated to either of these two clusters? Let’s check:

Output:

Seems like all data points in the sample of 150 were allocated to either one of the two clusters identified, thus hinting at the clue that the news articles might easily separable according to topic.

For extra insight, we can show some cluster visualizations with the aid of the supplementary code provided below, which shows a scatterplot for every pairwise combination of the five existing components that describe each data point:

Result:

Clustering visualizations

By trying different configurations for HDBSCAN, you may come across results in which the number of identified clusters could be different from two. Just give it a try!

Wrapping Up

Once we have gone through the process of building the text-based clustering pipeline, it is worth concluding by pointing out the key reasons why putting together LLM embeddings with HDBSCAN is worth it. These include the ability to retain and capture, to some extent, the true semantic meaning and linguistic nuances of the original text, thanks to the properties inherent to embeddings obtained through sentence-transformers. Moreover, HDBSCAN automatically determines an optimal number of clusters and is able to detect outlying points that might be noise or outliers that would distort group-level statistics.



Source link