Dunn Index (DI):

The Dunn Index (DI) is a metric used to determine the quality of the clustering algorithm’s output. The idea is to identify sets of clusters that are compact (intra-cluster distances are small) and well-separated (inter-cluster distances are large).

Mathematically, the Dunn Index is defined as:

In this formula:

  1. Distance between clusters: This is the shortest distance between two clusters. For two clusters and , it can be computed as:

  2. Distance within clusters: This is the maximal intra-cluster distance. For a cluster , it can be computed as:

Where is the distance between two data points.

Interpreting the Dunn Index:

  • A higher value of the Dunn Index suggests better clustering. This is because a high value would imply that the minimum inter-cluster distance is much larger than the maximum intra-cluster distance.

  • A low value suggests the opposite, indicating that the clusters might not be well-separated or are not compact.

Limitations:

  1. Computationally intensive: Calculating the DI requires computing distances between all pairs of clusters and within all clusters. This can be a limiting factor for very large datasets.

  2. Dependence on the distance metric: The DI’s value is affected by the choice of distance metric (e.g., Euclidean, Manhattan, etc.). Thus, the metric should be chosen carefully based on the nature of the data.

  3. Doesn’t capture other cluster quality aspects: While the DI focuses on cluster separation and compactness, there are other aspects of clustering quality (like shape, density, etc.) that it might not capture.

python

  from sklearn.metrics import silhouette_score
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
import numpy as np
 
# Sample data
data, _ = make_blobs(n_samples=300, centers=3, random_state=42)
 
# Sample clustering
kmeans = KMeans(n_clusters=3)
labels = kmeans.fit_predict(data)
 
# Dunn Index
def dunn_index(X, labels):
    # pairwise distance matrix
    distance_matrix = np.linalg.norm(X[:, np.newaxis] - X, axis=2)
 
    # For each cluster, compute the intra-cluster distance
    intra_cluster_distances = np.array([
        np.max(distance_matrix[labels == i][:, labels == i])
        for i in np.unique(labels)
    ])
 
    # For each pair of clusters, compute the inter-cluster distance
    inter_cluster_distances = np.array([
        np.min(distance_matrix[labels == i][:, labels == j])
        for i in np.unique(labels)
        for j in np.unique(labels)
        if i != j
    ])
 
    return np.min(inter_cluster_distances) / np.max(intra_cluster_distances)
 
di = dunn_index(data, labels)
print(f"Dunn Index: {di}")