Dunn Index (DI):
The Dunn Index (DI) is a metric used to determine the quality of the clustering algorithm’s output. The idea is to identify sets of clusters that are compact (intra-cluster distances are small) and well-separated (inter-cluster distances are large).
Mathematically, the Dunn Index is defined as:
In this formula:
-
Distance between clusters: This is the shortest distance between two clusters. For two clusters and , it can be computed as:
-
Distance within clusters: This is the maximal intra-cluster distance. For a cluster , it can be computed as:
Where is the distance between two data points.
Interpreting the Dunn Index:
-
A higher value of the Dunn Index suggests better clustering. This is because a high value would imply that the minimum inter-cluster distance is much larger than the maximum intra-cluster distance.
-
A low value suggests the opposite, indicating that the clusters might not be well-separated or are not compact.
Limitations:
-
Computationally intensive: Calculating the DI requires computing distances between all pairs of clusters and within all clusters. This can be a limiting factor for very large datasets.
-
Dependence on the distance metric: The DI’s value is affected by the choice of distance metric (e.g., Euclidean, Manhattan, etc.). Thus, the metric should be chosen carefully based on the nature of the data.
-
Doesn’t capture other cluster quality aspects: While the DI focuses on cluster separation and compactness, there are other aspects of clustering quality (like shape, density, etc.) that it might not capture.
python
from sklearn.metrics import silhouette_score
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
import numpy as np
# Sample data
data, _ = make_blobs(n_samples=300, centers=3, random_state=42)
# Sample clustering
kmeans = KMeans(n_clusters=3)
labels = kmeans.fit_predict(data)
# Dunn Index
def dunn_index(X, labels):
# pairwise distance matrix
distance_matrix = np.linalg.norm(X[:, np.newaxis] - X, axis=2)
# For each cluster, compute the intra-cluster distance
intra_cluster_distances = np.array([
np.max(distance_matrix[labels == i][:, labels == i])
for i in np.unique(labels)
])
# For each pair of clusters, compute the inter-cluster distance
inter_cluster_distances = np.array([
np.min(distance_matrix[labels == i][:, labels == j])
for i in np.unique(labels)
for j in np.unique(labels)
if i != j
])
return np.min(inter_cluster_distances) / np.max(intra_cluster_distances)
di = dunn_index(data, labels)
print(f"Dunn Index: {di}")