The Rand index is a measure of the similarity between two data clusterings. It is often used to measure the performance of a clustering algorithm against a “true” or ground-truth clustering.
Given:
- : the number of pairs of elements that are in the same set in both the predicted clustering and the true clustering
- : the number of pairs of elements that are in different sets in both the predicted clustering and the true clustering
- : the number of pairs of elements that are in the same set in the predicted clustering and in different sets in the true clustering
- : the number of pairs of elements that are in different sets in the predicted clustering and in the same set in the true clustering
The Rand index is computed as:
The Rand index takes a value between 0 and 1. A value of 1 indicates that the two clusterings are identical (ignoring permutations), and a value of 0 indicates that the two clusterings are completely dissimilar.
However, the Rand index has a chance-corrected version known as the Adjusted Rand Index (ARI). The ARI corrects the Rand index for the chance grouping of elements, and its values lie between -1 and 1. An ARI of 1 indicates perfect matching, an ARI of 0 indicates randomness, and an ARI of -1 indicates a perfect mismatch.
When evaluating clustering algorithms, it’s common to use the ARI because it provides a more informative measure than the raw Rand index.
python
from sklearn.metrics import adjusted_rand_score, pairwise_distances
import numpy as np
def rand_index_score(clusters_true, clusters_pred):
"""
Compute the Rand Index between two clusterings.
Parameters:
- clusters_true : array-like, true labels
- clusters_pred : array-like, predicted labels
Returns:
- ri : float, Rand Index
"""
tp_fp = sum([len(np.where(clusters_pred == label)[0]) ** 2 for label in set(clusters_pred)])
tp_fn = sum([len(np.where(clusters_true == label)[0]) ** 2 for label in set(clusters_true)])
A = np.array([clusters_true == clusters_true[i] for i in range(len(clusters_true))])
B = np.array([clusters_pred == clusters_pred[i] for i in range(len(clusters_pred))])
tp = sum([(A[i] & B[i]).sum() for i in range(len(A))]) / 2
fp = tp_fp / 2 - tp
fn = tp_fn / 2 - tp
tn = len(clusters_pred) ** 2 / 2 - tp - fp - fn
ri = (tp + tn) / (tp + fp + fn + tn)
return ri
# Sample data
true_clusters = [0, 0, 1, 1, 2, 2]
predicted_clusters = [0, 0, 1, 2, 2, 2]
# Calculate Rand Index
ri = rand_index_score(true_clusters, predicted_clusters)
print(f"Rand Index: {ri:.4f}")
# Calculate Adjusted Rand Index using sklearn
ari = adjusted_rand_score(true_clusters, predicted_clusters)
print(f"Adjusted Rand Index: {ari:.4f}")