The Rand index is a measure of the similarity between two data clusterings. It is often used to measure the performance of a clustering algorithm against a “true” or ground-truth clustering.

Given:

  • : the number of pairs of elements that are in the same set in both the predicted clustering and the true clustering
  • : the number of pairs of elements that are in different sets in both the predicted clustering and the true clustering
  • : the number of pairs of elements that are in the same set in the predicted clustering and in different sets in the true clustering
  • : the number of pairs of elements that are in different sets in the predicted clustering and in the same set in the true clustering

The Rand index is computed as:

The Rand index takes a value between 0 and 1. A value of 1 indicates that the two clusterings are identical (ignoring permutations), and a value of 0 indicates that the two clusterings are completely dissimilar.

However, the Rand index has a chance-corrected version known as the Adjusted Rand Index (ARI). The ARI corrects the Rand index for the chance grouping of elements, and its values lie between -1 and 1. An ARI of 1 indicates perfect matching, an ARI of 0 indicates randomness, and an ARI of -1 indicates a perfect mismatch.

When evaluating clustering algorithms, it’s common to use the ARI because it provides a more informative measure than the raw Rand index.

python

from sklearn.metrics import adjusted_rand_score, pairwise_distances
import numpy as np
 
def rand_index_score(clusters_true, clusters_pred):
    """
    Compute the Rand Index between two clusterings.
 
    Parameters:
    - clusters_true : array-like, true labels
    - clusters_pred : array-like, predicted labels
 
    Returns:
    - ri : float, Rand Index
    """
    tp_fp = sum([len(np.where(clusters_pred == label)[0]) ** 2 for label in set(clusters_pred)])
    tp_fn = sum([len(np.where(clusters_true == label)[0]) ** 2 for label in set(clusters_true)])
    A = np.array([clusters_true == clusters_true[i] for i in range(len(clusters_true))])
    B = np.array([clusters_pred == clusters_pred[i] for i in range(len(clusters_pred))])
    tp = sum([(A[i] & B[i]).sum() for i in range(len(A))]) / 2
    fp = tp_fp / 2 - tp
    fn = tp_fn / 2 - tp
    tn = len(clusters_pred) ** 2 / 2 - tp - fp - fn
    ri = (tp + tn) / (tp + fp + fn + tn)
    return ri
 
# Sample data
true_clusters = [0, 0, 1, 1, 2, 2]
predicted_clusters = [0, 0, 1, 2, 2, 2]
 
# Calculate Rand Index
ri = rand_index_score(true_clusters, predicted_clusters)
print(f"Rand Index: {ri:.4f}")
 
# Calculate Adjusted Rand Index using sklearn
ari = adjusted_rand_score(true_clusters, predicted_clusters)
print(f"Adjusted Rand Index: {ari:.4f}")

Reference List

  1. A Comprehensive Survey of Clustering Algorithms Xu, D. & Tian, Y. Ann. Data. Sci. (2015)