K-Means is unsupervised learning algorithm that works by clustering data points into k number of different groups (centroids) each having different arithmetic mean from the other. The K-Means is one of the widely used clustering algorithm for the unsupervised tasks. K-Means clustering algorithm groups the data points by there features based on the similarity function usually the Euclidean distance. K-Means is used in tasks such as anomaly detection, study of genomics, behavioral modeling segmentation among others. In this post we are going to look at how the K-Means clustering algorithm works, its weaknesses, strengths and applications.

K-Means Clustering Algorithm

The concept of K-Means was coined by Stuart Llyod in 1957, however, the use of K-Means can be dated back in 1967 by MacQueen. The K-Means algorithm uses an iterative refinement technique to arrive to the final clusters. This process is a classical case of NP-hard problem where the algorithm has to try different combination of data points in order to arrive at the most suitable clusters. Due to this weakness the algorithm uses heuristic approaches to arrive at the local optimum easily. There are several extensions of k-means clustering algorithms such as k-median clustering which uses median instead of mean, Fuzzy C-Means clustering where the data has fuzzy degree of being in the clusters. K-Means clustering algorithm has found its use in machine learning itself where it can be used to model supervised learning data. K-Means can be used as the initial stage in machine learning where it groups data into different classes which can then used for supervised learning.

How K-Means Clustering Algorithm Works

As the name suggest K-Means groups the data points into a number of k clusters, each cluster has its own defined means. Below is the main steps on how the K-Means clustering works’

  1. Randomly selects and initializes a centroid value for each cluster.
  2. Assign the data points to the nearest cluster as defined by the distance function usually the Euclidean distance.
  3. Update the mean of each clusters based on new centroid values.

Steps 2 and 3 are repeated iterativelly until the convergence. The convergence occurs when assignments no longer change. This can be computationally expensive, thus the use of heuristic techniques such as Lloyd’s algorithm. Below is a K-Means clustering objective function;

k-means clustering objective function equation - K-Means Clustering Algorithm
Formula from http://www.saedsayad.com/clustering_kmeans.htm

K-Means Clustering Algorithm Example in Scikit-Learn

Scikit learn has a sklearn.cluster.KMeans class that can be used to implement the k-means clustering. The KMeans(n_clusters=8, init=’k-means++’, n_init=10, max_iter=300, tol=0.0001, precompute_distances=’auto’, verbose=0, random_state=None, copy_x=True, n_jobs=1, algorithm=’auto’) function takes in different parameters that can be adjusted to improve the performance of the model. In this post we are going to look at how the K-Means clustering algorithm is implemented on different data sets. To learn more about the sklearn KMeans function refer to the sklearn documentation here.

Generating Random Data Set With make_blobs

Output

Random data set with make_blobs - K-Means Clustering Algorithm

Creating K-Means Clusters

Output

k-means clusters - K-Means Clustering Algorithm

Defining The Center of Clusters

Output

k-means cluster centers - K-Means Clustering Algorithm

Iris Data Set Original Classification

Output

original iris data set clusters - K-Means Clustering Algorithm

K-Means Model With Centers On Iris Data Set

k-means modeling on iris data set - K-Means Clustering Algorithm

Pros

  • Simple to implement.
  • Easy to interpret the results.
  • Performs better than hierarchical clustering with large number of variables and if k is small.
  • Produces good results if the clusters are spherical.

Cons

  • Hard to choose the k parameter that can yield high results.
  • It is sensitive to scale because it relies on the distance function (Euclidean distance).
  • Assumes that the data is spherically distributed which is not always the case in real life.
  • It is sensitive to outliers.
  • Finding the optimal solution to K-Means clustering is computationally expensive. Since it relies on the Euclidean distance this makes it an NP-hard problem.
  • K-Means assumes that all variables have the same variable which is not always true.

Applications of K-Means Clustering Algorithm

  • Vector quantization e.g in image processing.
  • Behavioral modeling such as segmentation in news articles,customers purchasing behaviors e.t.c.
  • Anomaly detection.
  • Feature learning.
  • Geostatic.
  • Healthcare Fraud Detection.
  • Robotics.
  • Study of genomics.

Conclusion

K-Means is a powerful clustering algorithm that is widely used in unsupervised learning. The algorithm partitions the data set into a k number of centroid with each centroid having its mean. The data points are then assigned into the closest centroid using the Euclidean distance function forming clusters. K-Means clustering algorithm has many applications such as in feature engineering in machine learning, study of the genomics, cluster analysis, business use cases such as behavioral modeling fraud and anomaly detection e.t.c. Despite it being easy to implement and interpret K-Means is sensitive to outliers. It assumes that data is spherically distributed and that all variables have the same variance which is not always the case in real life. Additionally it is hard to find the value of k that can instantly give the best results. K-means has many variations but in this post we have just looked at the most commonly used K-Means clustering algorithm.

What’s Next

In this post we have looked at the K-Means clustering algorithm, in the next post we are going to look at the Hierarchical agglomerative clustering algorithm.

K-Means Clustering Algorithm

Post navigation