K-means clustering is the most commonly used clustering algorithm. It’s a Centroid-based Clustering algorithm and the simplest unsupervised learning algorithm.
This algorithm tries to minimize the variance of data points within a cluster. It’s also how most people are introduced to unsupervised machine learning.
K-means is best used on smaller data sets because it iterates over all of the data points. That means it’ll take more time to classify data points if there are a large amount of them in the data set.
Since this is how k-means clusters data points, it doesn’t scale well.
Implementation
Using numpy array:
import numpy as np
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
# Example data
data = np.array([[2, 3], [4, 6], [8, 8], [3, 2], [10, 11], [15, 13]])
# Number of clusters
k = 2
# Create and fit KMeans model
kmeans = KMeans(n_clusters=k)
kmeans.fit(data)
# Get cluster labels and centers
labels = kmeans.labels_
cluster_centers = kmeans.cluster_centers_
# Visualize the clusters
plt.scatter(data[:, 0], data[:, 1], c=labels, cmap='viridis')
plt.scatter(cluster_centers[:, 0], cluster_centers[:, 1], c='red', marker='x', s=200)
plt.show()
import numpy as np
import pandas as pd
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
# Example data as a NumPy array
data = np.array([[2, 3], [4, 6], [8, 8], [3, 2], [10, 11], [15, 13]])
# Convert the data to a Pandas DataFrame
df = pd.DataFrame(data, columns=['X', 'Y'])
# Number of clusters
k = 2
# Create and fit KMeans model
kmeans = KMeans(n_clusters=k)
kmeans.fit(df)
# Get cluster labels and centers
labels = kmeans.labels_
cluster_centers = kmeans.cluster_centers_
# Visualize the clusters
plt.scatter(df['X'], df['Y'], c=labels, cmap='viridis')
plt.scatter(cluster_centers[:, 0], cluster_centers[:, 1], c='red', marker='x', s=200)
plt.show()
optimal number of clusters
Elbow Method
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
distortions = []
K = range(1, 10)
for k in K:
kmeanModel = KMeans(n_clusters=k)
# if use Pandas DataFrame, change to df
kmeanModel.fit(data)
distortions.append(kmeanModel.inertia_)
plt.plot(K, distortions, 'bx-')
plt.xlabel('Number of Clusters')
plt.ylabel('Distortion')
plt.title('Elbow Method')
plt.show()