Step-by-step walkthrough of the K-Means clustering algorithm (Legacy)

The K-means clustering algorithm is a popular unsupervised learning technique used to partition a dataset into K distinct clusters. This guide will provide a step-by-step walkthrough of the K-means algorithm, focusing on its traditional implementation.

Initialization

Choose the value of K: Determine the number of clusters you want to create. This is a crucial step as the choice of K can significantly impact the clustering results.
Randomly initialize centroids: Select K data points from the dataset at random to serve as the initial centroids for each cluster.

Assignment

Calculate distances: For each data point, compute its Euclidean distance to each centroid.
Assign to nearest centroid: Assign each data point to the cluster whose centroid is closest to it.

Update

Recalculate centroids: For each cluster, calculate the new centroid as the mean of all data points assigned to that cluster.

Repeat

Iterate: Repeat steps 2 and 3 until the centroids converge or a maximum number of iterations is reached. Convergence occurs when the centroids no longer change significantly between iterations.

Example

Consider a dataset of two-dimensional points:

Data points: (2, 3), (4, 5), (6, 7), (8, 9), (1, 2)
K = 2

Initialization: Randomly select two data points as initial centroids: (2, 3) and (8, 9).
Assignment:
- (2, 3) is closer to (2, 3) than (8, 9), so it is assigned to cluster 1.
- (4, 5) is closer to (2, 3) than (8, 9), so it is assigned to cluster 1.
- (6, 7) is closer to (8, 9) than (2, 3), so it is assigned to cluster 2.
- (8, 9) is closer to (8, 9) than (2, 3), so it is assigned to cluster 2.
- (1, 2) is closer to (2, 3) than (8, 9), so it is assigned to cluster 1.
Update:
- Centroid of cluster 1: (2 + 4 + 1) / 3 = 2.33, (3 + 5 + 2) / 3 = 3.33
- Centroid of cluster 2: (6 + 8) / 2 = 7, (7 + 9) / 2 = 8
Repeat: Continue iterating steps 2 and 3 until convergence or a maximum number of iterations is reached.

Limitations and Considerations

Sensitivity to initialization: The initial centroids can significantly affect the final clustering results.
Assumption of spherical clusters: K-means assumes that clusters are spherical and of similar size.
Scalability: K-means can be computationally expensive for large datasets.

To address these limitations, various modifications and extensions of the K-means algorithm have been proposed, such as K-means++, hierarchical K-means, and fuzzy c-means.

Pulkit Dheer

A simple guide to K-Means clustering

Soft K-Means explained

Step-by-step walkthrough of the K-Means clustering algorithm (Legacy)

Initialization

Assignment

Update

Repeat

Example

Limitations and Considerations

Get Govt. Certified Secure Assured Job Interview

Advance Your Career: Upskill Now!

Get industry recognized certification – Contact us

Get Govt. Certified
Secure Assured Job Interview