Ch 7: Topic Modeling, Clustering & Theme Discovery at Scale

Foundations

What topic modeling is, how embeddings enable semantic clustering, and the end-to-end pipeline from raw text to grouped vectors.

Topic Modeling vs Classification

Classification starts with predefined labels and assigns data to known buckets. Topic modeling is exploratory — it discovers latent themes from the data itself before any taxonomy exists.

💡 Classification is sorting mail into labeled bins; topic modeling is dumping all mail on a table and discovering what the recurring themes actually are.

Classification (Supervised)

Topic Modeling (Unsupervised)

When to Use Which

Classification is the right tool when you have a stable, well-defined label set and enough labeled training data. Topic modeling is the right tool when you do not know what the categories should be, or when you suspect the existing taxonomy is missing important themes.

In practice, teams often use topic discovery before building a formal taxonomy. It helps reveal recurring issues, hidden subpopulations, and language patterns used by real users. Once themes stabilize, they can be codified into a classification system. See Topic 3: The Discovery Pipeline for how this handoff works.

Key Differences

Dimension	Classification	Topic Modeling
Labels	Predefined by humans	Discovered from data
Supervision	Supervised (needs labeled data)	Unsupervised or semi-supervised
Goal	Decision assignment	Pattern discovery
Output	Hard label per item	Cluster membership or topic distribution
Evaluation	Accuracy, F1, precision/recall	Coherence, human judgment, utility

Interview Framing

In interviews, say that topic modeling is about pattern discovery, while classification is about decision assignment. Strong candidates explain that the two are complementary: topic discovery informs taxonomy design, and classification operationalizes it. See Topic 8: Evaluating Topic Quality for how to measure whether discovered topics are actually useful.

→ Topic modeling discovers structure; classification applies it. The best pipelines use discovery to inform the taxonomy that classification then enforces.

Python Example

# Contrast: classification assigns to known labels,
# topic modeling discovers groups from embeddings

from sklearn.cluster import KMeans
from sklearn.linear_model import LogisticRegression
import numpy as np

# --- Classification: labels are predefined ---
X_train = np.random.randn(100, 384)   # embedding vectors
y_train = np.random.randint(0, 5, 100)  # known labels: 0-4
clf = LogisticRegression().fit(X_train, y_train)
predictions = clf.predict(X_train)  # assigns to existing buckets

# --- Topic modeling: labels emerge from data ---
X_unlabeled = np.random.randn(500, 384)  # no labels at all
km = KMeans(n_clusters=8, random_state=42)
clusters = km.fit_predict(X_unlabeled)   # discovers groupings
# clusters are numbers (0-7), NOT meaningful names
# human review or LLM labeling is needed next
print(f"Found {len(set(clusters))} clusters to review")

Follow-up Questions

Can you do topic modeling on already-classified data?

Yes, and it is often valuable. Running topic discovery on data that was already classified can reveal sub-themes within categories or show that existing labels are too coarse. For example, a "billing" category might contain distinct sub-clusters for refund requests, pricing confusion, and payment failures.

How does topic modeling relate to traditional methods like LDA?

Latent Dirichlet Allocation (LDA) was the dominant approach for years. It models documents as mixtures of topics, where each topic is a distribution over words. Modern embedding-based approaches have largely replaced LDA in production because they capture semantic similarity rather than just word co-occurrence, and they scale better with LLM-based labeling.

When should you skip topic modeling and go straight to classification?

Skip discovery when you have a stable, well-validated taxonomy and enough labeled examples. If the label set has been validated by domain experts and has not changed in months, building a classifier directly is more efficient. Topic modeling adds value when the domain is new, the taxonomy is stale, or you suspect hidden patterns.

Embedding-Based Clustering

Embedding-based methods represent each text unit in a semantic vector space, so clustering can group items that are conceptually related even when they share no exact keywords. This is why modern topic discovery relies on embeddings rather than word counts.

💡 Embeddings are like GPS coordinates for meaning — texts about the same topic end up near each other on the semantic map, regardless of the words they use.

Why Keywords Fall Short

Traditional bag-of-words or TF-IDF approaches group documents that share surface-level tokens. But real users describe the same issue in many different ways: "my order never arrived," "delivery missing," and "package lost" all mean the same thing but share almost no keywords. Embedding models capture this semantic equivalence because they are trained on massive corpora of contextual language.

The Embedding Advantage

Embeddings give you better semantic grouping, and LLMs can then summarize or name the discovered clusters. That combination is often more useful to product teams than traditional topic-word lists alone. See Topic 9: LLMs in Topic Workflows for how language models enhance the labeling step.

Semantic recall: Groups items by meaning, not surface form
Multilingual capability: Multilingual embedding models cluster across languages
Composability: Embeddings integrate with vector databases, rerankers, and downstream classifiers

Practical Considerations

The quality of clustering is bounded by the quality of the embedding model. A general-purpose model may group medical texts differently than a domain-tuned one. Choose your embedding model with the same care you would choose a classifier — test on representative samples and verify with Topic 8: Evaluating Topic Quality.

→ Embeddings capture semantic similarity where keyword methods fail — making them the foundation for modern topic discovery.

Python Example

from sentence_transformers import SentenceTransformer
from sklearn.cluster import KMeans

# 1. Load an embedding model
model = SentenceTransformer("all-MiniLM-L6-v2")

# 2. Encode a batch of support tickets
tickets = [
    "My package never arrived",
    "Delivery is missing from my doorstep",
    "Order lost in transit",
    "App keeps crashing on Android",
    "The mobile app freezes constantly",
    "Cannot open the app after update",
]
embeddings = model.encode(tickets)

# 3. Cluster: semantically similar tickets group together
km = KMeans(n_clusters=2, random_state=42)
labels = km.fit_predict(embeddings)

for ticket, label in zip(tickets, labels):
    print(f"  Cluster {label}: {ticket}")
# Tickets about delivery group together,
# tickets about app crashes group together,
# even though they share no keywords

Follow-up Questions

Which embedding models work best for topic discovery?

General-purpose models like all-MiniLM-L6-v2 or BGE-large work well for broad corpora. For domain-specific data (medical, legal, financial), fine-tuned or domain-adapted models often produce tighter, more meaningful clusters. Always validate embedding quality by inspecting nearest-neighbor lists on representative samples.

Does embedding dimensionality affect clustering quality?

Higher-dimensional embeddings (768, 1024) capture more nuance but can make clustering harder due to the curse of dimensionality. This is why dimensionality reduction (see Topic 4: Dimensionality Reduction) is commonly applied before clustering. The sweet spot depends on the diversity of your corpus.

Can you mix embeddings from different models?

No. Embeddings from different models live in different vector spaces and are not directly comparable. Mixing them will produce meaningless clusters. If you need to combine data embedded at different times, use the same model version and keep track of model upgrades that require re-embedding.

The Discovery Pipeline

A practical topic discovery pipeline follows a fixed sequence: clean the text, choose the unit of analysis, embed the data, optionally reduce dimensionality, cluster the vectors, extract representatives, and finally label the clusters with human or LLM review.

💡 Think of the pipeline as a funnel: raw messy text at the top, validated themes at the bottom. Each stage filters noise and adds structure.

Click a stage above to see details.

Stage-by-Stage Breakdown

Clean the text: Remove boilerplate, signatures, duplicates, and irrelevant metadata. Garbage in, garbage out applies strongly to clustering.
Choose the unit: Decide whether to embed entire documents, paragraphs, sentences, or ticket-level units. The unit shapes what clusters can represent.
Embed: Convert text units into dense vectors using an embedding model appropriate for the domain.
Reduce dimensions (optional): Apply UMAP or PCA to make clustering more effective in lower dimensions. See Topic 4: Dimensionality Reduction.
Cluster: Apply a clustering algorithm suited to the data shape and scale. See Topic 5: Choosing a Clustering Algorithm.
Extract representatives: Pull the items closest to each cluster centroid as evidence for labeling.
Label and validate: Use LLMs, human reviewers, or both to name each cluster. See Topic 6: Naming Clusters.

Scalability Considerations

In interviews, note that scalability depends on batching, approximate indexing, incremental updates, and sampling strategies for review. A topic modeling system must be designed like a data product, not just a notebook experiment. The difference between a demo and a production system is often the last three stages.

Anti-patterns

Skipping text cleaning is the most common pipeline failure. Boilerplate signatures, auto-replies, and templated text create spurious clusters that waste review time. Another anti-pattern is choosing the wrong unit of analysis — embedding entire multi-topic documents when paragraph-level units would produce cleaner clusters.

→ The pipeline matters more than the algorithm. Raw clusters do not explain themselves — the last stages (labeling, validation) are where value is created.

Python Example

# End-to-end topic discovery pipeline
from sentence_transformers import SentenceTransformer
from sklearn.cluster import KMeans
import umap
import numpy as np

def topic_discovery_pipeline(texts, n_topics=8):
    # Step 1: Clean (simplified)
    cleaned = [t.strip() for t in texts if len(t.strip()) > 20]

    # Step 2: Embed
    model = SentenceTransformer("all-MiniLM-L6-v2")
    embeddings = model.encode(cleaned, show_progress_bar=True)

    # Step 3: Reduce dimensionality
    reducer = umap.UMAP(n_components=10, random_state=42)
    reduced = reducer.fit_transform(embeddings)

    # Step 4: Cluster
    km = KMeans(n_clusters=n_topics, random_state=42)
    labels = km.fit_predict(reduced)

    # Step 5: Extract representatives (nearest to centroid)
    representatives = {}
    for c in range(n_topics):
        mask = labels == c
        dists = np.linalg.norm(reduced[mask] - km.cluster_centers_[c], axis=1)
        idxs = np.where(mask)[0][np.argsort(dists)[:3]]
        representatives[c] = [cleaned[i] for i in idxs]

    return labels, representatives

# Step 6: Send representatives to LLM for naming

Follow-up Questions

How do you choose the right unit of analysis?

The unit should match the granularity of the themes you want to find. Sentence-level clustering captures fine-grained topics but can be noisy. Document-level clustering captures broader themes but misses multi-topic documents. Paragraph-level is often a practical middle ground for support tickets and reviews.

How do you handle duplicate or near-duplicate texts?

Deduplicate aggressively before clustering. Near-duplicates (auto-replies, templates, copy-pasted complaints) will dominate cluster centroids and obscure real themes. Use MinHash or embedding cosine similarity to flag near-duplicates, then keep only one representative from each group.

What batch sizes work for embedding at scale?

Most embedding models process batches of 32-256 texts efficiently on a GPU. For millions of documents, use batched inference with progress tracking and checkpointing. Approximate nearest-neighbor indexes (FAISS, ScaNN) can then cluster the embeddings without loading everything into memory at once.

Dimensionality Reduction

High-dimensional embedding spaces can be noisy and hard for clustering algorithms to partition cleanly. Dimensionality reduction reveals local structure, denoises the space, and makes clusters easier to separate — but it can distort distances if applied carelessly.

💡 Dimensionality reduction is like looking at a 3D sculpture from the best angle — you lose a dimension but can suddenly see the structure clearly.

High-dimensional space: clusters overlap due to noise in many dimensions.

When to Reduce

Reduction is a tool, not a default law. You use it when it improves cluster structure or interpretability, then verify the result with representative examples rather than trusting a plot alone. Common scenarios where reduction helps:

Visualization: 2D projections with UMAP or t-SNE help teams visually inspect cluster structure
Clustering performance: Some algorithms (K-Means, HDBSCAN) work better in moderate dimensions (10-50)
Denoising: Removing noisy dimensions can sharpen cluster boundaries

UMAP vs PCA vs t-SNE

Method	Preserves	Speed	Best For
PCA	Global variance	Fast	Initial compression, preprocessing
t-SNE	Local neighborhoods	Slow	2D visualization only
UMAP	Local + some global structure	Moderate	Clustering prep + visualization

The Distortion Trade-off

Every reduction method distorts some distances. PCA may compress meaningful differences into discarded components. UMAP may create artificial gaps between points that were close in high dimensions. The safest approach is to cluster in the reduced space but validate using original embeddings — check that nearest neighbors in the original space actually belong to the same cluster.

→ Reduce dimensions when it helps clustering, but always validate results against the original embeddings. Never trust a 2D plot as ground truth.

Python Example

import umap
import numpy as np
from sklearn.decomposition import PCA

# Original embeddings: 1000 items x 384 dimensions
embeddings = np.random.randn(1000, 384)

# Option A: PCA to 50 dims (fast, preserves global variance)
pca_reduced = PCA(n_components=50).fit_transform(embeddings)
print(f"PCA: {embeddings.shape} -> {pca_reduced.shape}")

# Option B: UMAP to 10 dims (preserves local structure)
umap_reduced = umap.UMAP(
    n_components=10,       # target dimensionality
    n_neighbors=15,        # local neighborhood size
    min_dist=0.1,          # controls cluster tightness
    random_state=42
).fit_transform(embeddings)
print(f"UMAP: {embeddings.shape} -> {umap_reduced.shape}")

# Option C: UMAP to 2 dims for visualization only
vis_2d = umap.UMAP(n_components=2).fit_transform(embeddings)
# WARNING: do not cluster on 2D projections!
# Use higher dims (10-50) for clustering, 2D for plots only

Follow-up Questions

Can you cluster directly in the original high-dimensional space?

Yes, and sometimes it works well, especially with cosine-similarity-based methods. The question is whether reduction improves cluster quality for your specific data. Test both and compare using silhouette scores or manual inspection of representative examples.

What UMAP hyperparameters matter most?

The two most important are n_neighbors (controls local vs. global focus; 15-50 is typical) and min_dist (controls how tightly points cluster; lower values create denser clusters). For clustering, use min_dist close to 0. For visualization, values around 0.1-0.5 produce more readable plots.

Should you reduce before or after clustering?

Reduce before clustering if the algorithm struggles in high dimensions (e.g., K-Means with many noisy features). If using a method that handles high dimensions well (e.g., cosine-based hierarchical clustering), reduction is optional. Either way, validate cluster quality using the original embeddings afterward.

Choosing a Clustering Algorithm

The right clustering algorithm depends on what you expect the data to look like. K-Means assumes spherical clusters. Density-based methods handle irregular shapes and noise. Hierarchical methods give coarse-to-fine exploration. No algorithm can rescue weak embeddings.

💡 Choosing a clustering algorithm is like choosing a map projection — each preserves different properties, and the best choice depends on what you need to see.

Algorithm Comparison

Algorithm	Cluster Shape	Needs k?	Handles Noise?	Scale
K-Means	Spherical	Yes	No	Very large datasets
HDBSCAN	Arbitrary	No	Yes (noise label)	Medium-large
Agglomerative	Any (via linkage)	Optional	No	Small-medium
Gaussian Mixture	Ellipsoidal	Yes	Partial	Medium
Spectral	Non-convex	Yes	No	Small-medium

Judgment Over Brand Loyalty

In interviews, show judgment rather than brand loyalty. The right algorithm depends on data distribution, scale, and the need for interpretability. Considerations:

Do you know k? If not, use HDBSCAN or elbow/silhouette analysis with K-Means
Is noise expected? HDBSCAN explicitly labels outliers; K-Means forces everything into a cluster
Need hierarchy? Agglomerative clustering produces dendrograms for coarse-to-fine exploration
Scale? K-Means and Mini-Batch K-Means scale to millions; HDBSCAN struggles beyond ~500K without tricks

The Embedding Quality Floor

No clustering method can rescue weak embeddings or ill-defined units of analysis. If your embedding model does not separate the concepts you care about, no algorithm will find clean clusters. Always verify embedding quality first (see Topic 2: Embedding-Based Clustering).

→ Choose the algorithm that matches your data shape and scale. When in doubt, try both K-Means and HDBSCAN and compare the results with human review.

Python Example

from sklearn.cluster import KMeans
import hdbscan
from sklearn.metrics import silhouette_score
import numpy as np

embeddings = np.random.randn(500, 50)  # after dim reduction

# --- K-Means: need to choose k ---
# Use silhouette score to find best k
best_k, best_score = 2, -1
for k in range(3, 15):
    km = KMeans(n_clusters=k, random_state=42, n_init=10)
    labels = km.fit_predict(embeddings)
    score = silhouette_score(embeddings, labels)
    if score > best_score:
        best_k, best_score = k, score
print(f"Best k={best_k}, silhouette={best_score:.3f}")

# --- HDBSCAN: no k needed, handles noise ---
clusterer = hdbscan.HDBSCAN(
    min_cluster_size=10,   # min points to form a cluster
    min_samples=5,          # core point density threshold
    metric='euclidean'
)
hdb_labels = clusterer.fit_predict(embeddings)
n_clusters = len(set(hdb_labels)) - (1 if -1 in hdb_labels else 0)
noise_pct = (hdb_labels == -1).mean() * 100
print(f"HDBSCAN: {n_clusters} clusters, {noise_pct:.1f}% noise")

Follow-up Questions

How do you decide k when using K-Means?

Use the elbow method (plot inertia vs. k and look for a bend) or silhouette analysis (maximize average silhouette score). Neither is definitive — always validate the chosen k by inspecting cluster representatives. Domain knowledge often narrows the range before any metric is computed.

When should you use HDBSCAN instead of K-Means?

Use HDBSCAN when clusters have irregular shapes, when you expect noise or outliers, or when you do not want to predefine the number of clusters. HDBSCAN is especially useful for exploratory work where forcing everything into a cluster would mask the true structure.

Can you combine multiple clustering methods?

Yes. Ensemble clustering runs multiple algorithms and merges their results. A simpler approach is to use HDBSCAN for initial discovery, then assign the noise points using a nearest-centroid classifier trained on the HDBSCAN clusters. This gives you the flexibility of density-based discovery with full coverage.

Production & Evaluation

Once clusters exist, the real work begins: naming them for human consumption, monitoring drift, evaluating quality, and avoiding the mistakes that erode trust.

Naming Clusters

A useful cluster label should summarize the theme, not simply repeat the most frequent token. Good labels come from a combination of top terms, representative examples, and an LLM or human summarizer that sees enough evidence to describe the cluster accurately.

💡 Naming clusters is like writing newspaper headlines — the goal is to tell someone what the story is about, not to list every word in the article.

The Naming Problem

Raw cluster output is a number — "Cluster 3." That is useless to a product manager, operations lead, or executive. The cluster must be named in a way that enables decision-making. "Billing friction during checkout" is more useful than "payment issue words." This is a human-factors problem as much as a modeling problem.

Naming Strategies

Top-term extraction: Look at the most frequent or distinctive words in the cluster. Fast but often superficial.
Representative example review: Read the 5-10 items closest to the centroid. Slower but more accurate.
LLM summarization: Feed representative examples to an LLM and ask for a concise theme label. See Topic 9: LLMs in Topic Workflows.
Human-in-the-loop: Domain experts review clusters and assign operational names. Most reliable but least scalable.

Operational Labels

The best labels are operational — they tell someone what to do, not just what the cluster contains. Compare:

Weak Label	Strong Label	Why Better
payment words	Billing friction during checkout	Identifies the pain point and context
login cluster	Password reset flow failures	Specifies the failure mode
misc negative	Delivery delays on international orders	Actionable for the logistics team

→ Cluster naming is where data science meets communication. Labels should enable decisions, not just describe statistics.

Python Example

import openai

def name_cluster(representative_texts, client):
    """Use an LLM to generate an operational cluster name."""
    examples = "\n".join(
        f"- {t}" for t in representative_texts[:10]
    )

    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{
            "role": "user",
            "content": f"""Below are representative examples from a cluster
of customer support tickets. Generate a concise,
operational label (5-8 words) that a product manager
could use to prioritize action.

Examples:
{examples}

Label:"""
        }],
        max_tokens=30,
        temperature=0.0
    )
    return response.choices[0].message.content.strip()

# Example usage
# label = name_cluster(representatives[cluster_id], client)

Follow-up Questions

How do you validate that a cluster name is accurate?

Show the label and 10-20 random (not just centroid-near) examples to a domain expert. If the expert agrees that 80%+ of examples match the label, the name is adequate. If agreement drops below 60%, the cluster may need splitting or the label needs revision.

What if a cluster contains multiple themes?

That cluster is likely too broad. Options include increasing k, using HDBSCAN to discover sub-structure, or applying a second round of clustering within the problematic cluster. Mixed clusters are a signal that the embedding or reduction step may also need tuning.

How often should cluster names be refreshed?

Refresh names whenever the underlying data distribution shifts significantly — typically every 1-3 months for fast-moving products. See Topic 7: Evolving Topics Over Time for monitoring strategies. Stale labels erode trust faster than stale clusters.

Evolving Topics Over Time

Topics drift as products change, events occur, and new language enters the corpus. A production system must support periodic re-embedding, incremental clustering, or time-sliced analysis so teams can see whether themes are growing, shrinking, splitting, or merging.

💡 Topics are living organisms, not fossils. They grow, split, merge, and die over time — a one-time snapshot misses the story.

Click a month to see topic evolution details.

Why Static Snapshots Fail

Topic discovery is not a one-time report. Customer complaints shift with product releases. Research themes evolve with new publications. Support tickets change seasonally. A topic model trained in January may be stale by March. In interviews, show that you understand topic evolution as a temporal analytics problem, not only a static NLP task.

Monitoring Strategies

Time-sliced clustering: Run clustering separately on each time window (week, month) and compare cluster compositions
Incremental clustering: Assign new data to existing clusters, flagging items that do not fit any cluster as potential new topics
Drift detection: Track cluster centroid movement, size changes, and member turnover over time
Re-embedding cycles: Periodically re-embed and re-cluster the entire corpus to catch structural shifts

What to Track

Signal	What It Means	Action
Cluster growing rapidly	Emerging issue or trending topic	Escalate for review
Cluster shrinking	Issue resolved or seasonal decline	Deprioritize or archive
Cluster splitting	Theme is diversifying into sub-topics	Create sub-labels
Two clusters merging	Previously distinct issues converging	Consolidate labels
New cluster appearing	Previously unseen theme	Investigate and name

→ Topic monitoring is what separates an insight from a one-time report. Production systems must track how themes change, not just what they are today.

Python Example

import numpy as np
from sklearn.cluster import KMeans
from collections import Counter

def track_topic_drift(embeddings_by_month, n_clusters=8):
    """Compare cluster distributions across time windows."""
    results = {}

    for month, embs in embeddings_by_month.items():
        km = KMeans(n_clusters=n_clusters, random_state=42)
        labels = km.fit_predict(embs)

        # Track cluster sizes as proportions
        counts = Counter(labels)
        total = len(labels)
        dist = {c: counts[c] / total for c in range(n_clusters)}
        results[month] = {
            "distribution": dist,
            "centroids": km.cluster_centers_,
        }

    # Compare consecutive months
    months = sorted(results.keys())
    for i in range(1, len(months)):
        prev, curr = results[months[i-1]], results[months[i]]
        for c in range(n_clusters):
            delta = curr["distribution"][c] - prev["distribution"][c]
            if abs(delta) > 0.05:  # 5% shift threshold
                direction = "growing" if delta > 0 else "shrinking"
                print(f"{months[i]}: Cluster {c} {direction} ({delta:+.1%})")

Follow-up Questions

How do you align clusters across different time windows?

Cluster IDs are arbitrary, so you need to align them. Common approaches include matching clusters by centroid similarity (cosine distance between centroids) or by member overlap (Jaccard similarity of items that appear in both windows). Hungarian algorithm matching works well for automated alignment.

How frequently should you re-run topic discovery?

It depends on the data velocity. For customer support, weekly or bi-weekly is common. For research papers, monthly or quarterly. Set up drift alerts so re-analysis is triggered when cluster distributions shift beyond a threshold, rather than on a fixed schedule.

Can you do real-time topic discovery?

Yes, but with trade-offs. Streaming clustering (e.g., incremental DBSCAN) can assign incoming items to existing clusters in real time. However, detecting new clusters and handling splits/merges is harder in streaming mode. Most production systems use a hybrid: real-time assignment with periodic batch re-clustering.

Evaluating Topic Quality

Good topics are coherent internally, distinct from one another, and useful to decision-makers. Automatic measures can help, but manual inspection of representative examples remains essential. The question is not only "Do the clusters exist?" but "Can a team act on them?"

💡 Evaluating topics is like evaluating a map: mathematical accuracy matters, but if the map does not help you navigate, it has failed its purpose.

Automatic Metrics

Human Evaluation

Automatic Measures

Metric	What It Measures	Limitation
Silhouette Score	How well-separated clusters are	Favors spherical clusters; sensitive to dimensionality
Coherence (C_v)	Semantic coherence of topic words	Designed for word-based topics, less relevant for embedding clusters
Calinski-Harabasz	Ratio of between-cluster to within-cluster dispersion	Assumes convex clusters
Davies-Bouldin	Average similarity ratio of each cluster to its most similar cluster	Sensitive to noisy clusters

Human Evaluation

Automatic metrics tell you whether clusters are mathematically well-separated, but they cannot tell you whether the clusters are useful. Human evaluation should check:

Coherence: Do random examples from the same cluster feel like they belong together?
Distinctness: Can a reviewer tell clusters apart without looking at labels?
Completeness: Are important themes represented, or are they split across clusters?
Actionability: Can a product team take different actions for different clusters?

Evaluation Should Combine Both

The strongest interview answer is that evaluation should combine statistical coherence with analyst usefulness. If a cluster contains semantically mixed examples, the label is likely misleading even if a metric looks acceptable. Conversely, if humans find the clusters useful but the silhouette score is mediocre, the clusters may still be production-worthy. See Topic 6: Naming Clusters for how label quality affects evaluation.

→ Evaluation combines math and judgment. A cluster with a great silhouette score but mixed themes is worse than one with a mediocre score but clear actionability.

Python Example

from sklearn.metrics import (
    silhouette_score,
    calinski_harabasz_score,
    davies_bouldin_score,
)
import numpy as np

def evaluate_clusters(embeddings, labels):
    """Compute automatic cluster quality metrics."""
    # Filter out noise labels (HDBSCAN assigns -1)
    mask = labels >= 0
    X = embeddings[mask]
    y = labels[mask]

    if len(set(y)) < 2:
        return {"error": "Need at least 2 clusters"}

    return {
        "silhouette": silhouette_score(X, y),       # [-1, 1] higher = better
        "calinski_harabasz": calinski_harabasz_score(X, y),  # higher = better
        "davies_bouldin": davies_bouldin_score(X, y),    # lower = better
        "n_clusters": len(set(y)),
        "noise_pct": 1.0 - mask.mean(),
    }

# Usage
# metrics = evaluate_clusters(reduced_embeddings, cluster_labels)
# print(f"Silhouette: {metrics['silhouette']:.3f}")
# Then: manually inspect 10 random items per cluster

Follow-up Questions

What silhouette score is "good enough" for production?

There is no universal threshold. For text clustering, scores above 0.2 are common and often sufficient if human evaluation confirms cluster utility. Scores above 0.5 are strong. The metric matters most for comparing configurations (e.g., k=8 vs k=12) rather than as an absolute quality gate.

How do you run human evaluation efficiently at scale?

Sample 10-20 random items from each cluster (not just centroid-near items). Present them to 2-3 reviewers and measure inter-annotator agreement on whether items belong to the labeled theme. A tool like Label Studio or a simple spreadsheet with randomized assignments works well for teams of 2-5 reviewers.

Can LLMs replace human evaluation of clusters?

LLMs can supplement but not fully replace human evaluation. An LLM can quickly check whether representative examples match a proposed label, but it cannot judge whether the label is operationally useful for a specific team. Use LLMs for the first pass and humans for final validation.

LLMs in Topic Workflows

LLMs are especially useful after clustering. They can label themes, summarize representative examples, compare adjacent clusters, and generate human-readable insights from a large corpus. But LLM-generated summaries can sound crisp even when the underlying cluster is messy.

💡 The LLM is a skilled reporter, not a fact-checker. It can write a compelling headline for any cluster — your job is to verify the headline is true.

Where LLMs Add Value

Labeling: Generate concise, operational names for clusters based on representative examples (see Topic 6: Naming Clusters)
Summarization: Produce paragraph-length summaries of each theme for executive reports
Comparison: Describe how adjacent or overlapping clusters differ, helping decide whether to merge or keep separate
Taxonomy bootstrapping: Once latent topics are discovered, an LLM can propose a hierarchical taxonomy structure that organizes themes into categories and sub-categories

The Crispness Trap

LLM-generated summaries can sound crisp even when the underlying cluster is messy. A cluster containing a mix of billing complaints, shipping issues, and random noise will still receive a confident-sounding label. This is dangerous because it creates a false sense of structure.

In interviews, say that LLMs improve interpretation and reporting, but cluster quality must still be validated against raw examples. The LLM is an amplifier, not a validator.

Prompt Design for Topic Labeling

Effective prompts for cluster labeling should:

Provide 5-10 representative examples (not all items)
Specify the desired label format (length, style, audience)
Ask for a confidence assessment alongside the label
Request alternative labels when confidence is low

→ LLMs make topic workflows faster and more readable, but they cannot validate cluster quality. Always verify LLM-generated labels against raw data.

Python Example

import openai

def compare_clusters(cluster_a_examples, cluster_b_examples, client):
    """Ask an LLM to describe how two clusters differ."""
    prompt = f"""You are analyzing two clusters of customer feedback.

Cluster A examples:
{chr(10).join('- ' + e for e in cluster_a_examples[:5])}

Cluster B examples:
{chr(10).join('- ' + e for e in cluster_b_examples[:5])}

1. Provide a short label for each cluster (5-8 words).
2. Describe the key difference between them in 1-2 sentences.
3. Should they be merged or kept separate? Why?"""

    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": prompt}],
        temperature=0.0,
    )
    return response.choices[0].message.content

# Caution: always validate LLM output against raw examples!
# The LLM will produce confident labels even for messy clusters

Follow-up Questions

Can LLMs replace the entire clustering pipeline?

Not yet at scale. LLMs can classify individual items into topics, but doing so for millions of items is prohibitively expensive and slow. Clustering with embeddings is orders of magnitude cheaper. The sweet spot is embedding-based clustering for grouping, followed by LLM-based labeling for the much smaller set of cluster representatives.

How do you handle LLM hallucination in cluster labels?

Cross-check the label against the actual examples. If the label mentions a concept that does not appear in any representative example, it is likely hallucinated. A simple validation step is to ask a second LLM call: "Does this label accurately describe these examples? Answer yes/no with reasoning."

What about using LLMs for embedding instead of sentence transformers?

LLM APIs (e.g., OpenAI embeddings) can produce high-quality embeddings, but at higher cost and latency than dedicated sentence transformer models. For large-scale topic discovery (100K+ items), dedicated models like BGE or E5 running locally are more practical. Use LLM embeddings when quality matters more than cost.

Common Mistakes at Scale

Common mistakes include using the wrong unit of analysis, clustering noisy boilerplate, overinterpreting weak visualizations, and treating automatically generated labels as truth. The strongest interview answer emphasizes that topic discovery is iterative sense-making, not just chart generation.

💡 Topic modeling mistakes are like optical illusions — they look convincing until you check the raw data beneath the surface.

The Validation Imperative

Topic discovery should be treated as iterative sense-making. The goal is insight you can trust, not just an impressive chart. Every stage of the pipeline (see Topic 3: The Discovery Pipeline) can introduce errors that compound downstream.

Mistake Taxonomy

Mistake	Symptom	Fix
Wrong unit of analysis	Clusters are too broad or incoherent	Try paragraph or sentence-level embedding
Clustering boilerplate	Largest cluster is auto-replies or signatures	Clean text before embedding
Trusting 2D plots	Clusters look separated in UMAP but overlap in reality	Validate with original embeddings
Auto labels as truth	Teams act on LLM-generated labels without verification	Human review of random samples
Ignoring temporal drift	Stale clusters no longer match current data	Periodic re-clustering and monitoring
Too many clusters	Adjacent clusters are nearly identical	Merge similar clusters; reduce k
Too few clusters	Clusters contain obviously different themes	Increase k; try HDBSCAN

Interview Signal

The strongest interview answer emphasizes validation at every stage. Weak candidates describe the algorithm; strong candidates describe how they verified the results. Mention that you would inspect representative examples, check cluster stability across reruns, and present results to domain experts before trusting them. See Topic 8: Evaluating Topic Quality for the evaluation framework.

→ The most dangerous mistake is skipping validation. Topic discovery produces hypotheses, not facts — treat every output as something that must earn your trust.

Python Example

import numpy as np
from sklearn.cluster import KMeans

def validate_cluster_stability(embeddings, n_clusters=8, n_runs=5):
    """Check if clusters are stable across random seeds."""
    from sklearn.metrics import adjusted_rand_score

    all_labels = []
    for seed in range(n_runs):
        km = KMeans(n_clusters=n_clusters, random_state=seed, n_init=10)
        all_labels.append(km.fit_predict(embeddings))

    # Compare all pairs of runs
    scores = []
    for i in range(n_runs):
        for j in range(i + 1, n_runs):
            scores.append(adjusted_rand_score(all_labels[i], all_labels[j]))

    avg_ari = np.mean(scores)
    print(f"Average ARI across {n_runs} runs: {avg_ari:.3f}")
    # ARI > 0.8: stable clusters
    # ARI 0.5-0.8: some instability, investigate
    # ARI < 0.5: clusters are not reliable
    return avg_ari

Follow-up Questions

How do you convince stakeholders that cluster labels are provisional?

Frame results as hypotheses, not conclusions. Present labels alongside representative examples and confidence indicators. Use language like "this cluster appears to be about X" rather than "this category is X." Showing the raw examples builds trust and calibrates expectations.

What is the biggest difference between a notebook demo and a production topic system?

A notebook demo runs once on static data. A production system handles incremental data, temporal drift, monitoring, alerting, and human-in-the-loop validation. It also needs reproducibility (pinned seeds, versioned embeddings), access control for labeled data, and integration with downstream workflows like ticket routing or dashboards.

How do you handle topics that are genuinely ambiguous?

Some items legitimately belong to multiple topics. Options include soft clustering (probabilistic membership), multi-label assignment, or an explicit "ambiguous" category. Do not force ambiguous items into a single cluster — this degrades the quality of the clusters they are forced into.