K-Means is unsupervised learning algorithm that works by clustering data points into ** k** number of different groups (centroids) each having different arithmetic mean from the other. The K-Means is one of the widely used clustering algorithm for the unsupervised tasks. K-Means clustering algorithm groups the data points by there features based on the similarity function usually the Euclidean distance. K-Means is used in tasks such as anomaly detection, study of genomics, behavioral modeling segmentation among others. In this post we are going to look at how the K-Means clustering algorithm works, its weaknesses, strengths and applications.

**K-Means Clustering Algorithm**

The concept of K-Means was coined by Stuart ** Llyod in 1957**, however, the use of K-Means can be dated back in

**. The K-Means algorithm uses an iterative refinement technique to arrive to the final clusters. This process is a classical case of NP-hard problem where the algorithm has to try different combination of data points in order to arrive at the most suitable clusters. Due to this weakness the algorithm uses heuristic approaches to arrive at the local optimum easily. There are several extensions of k-means clustering algorithms such as k-median clustering which uses median instead of mean, Fuzzy C-Means clustering where the data has fuzzy degree of being in the clusters. K-Means clustering algorithm has found its use in machine learning itself where it can be used to model supervised learning data. K-Means can be used as the initial stage in machine learning where it groups data into different classes which can then used for supervised learning.**

*1967 by MacQueen***How K-Means Clustering Algorithm Works**

As the name suggest K-Means groups the data points into a number of k clusters, each cluster has its own defined means. Below is the main steps on how the K-Means clustering works’

- Randomly selects and initializes a centroid value for each cluster.
- Assign the data points to the nearest cluster as defined by the distance function usually the Euclidean distance.
- Update the mean of each clusters based on new centroid values.

Steps 2 and 3 are repeated iterativelly until the convergence. The convergence occurs when assignments no longer change. This can be computationally expensive, thus the use of heuristic techniques such as Lloyd’s algorithm. Below is a K-Means clustering objective function;

**K-Means Clustering Algorithm Example in Scikit-Learn**

Scikit learn has a ** sklearn.cluster.KMeans** class that can be used to implement the k-means clustering. The

**function takes in different parameters that can be adjusted to improve the performance of the model. In this post we are going to look at how the K-Means clustering algorithm is implemented on different data sets. To learn more about the sklearn KMeans function refer to the sklearn documentation here.**

*KMeans(n_clusters=8, init=’k-means++’, n_init=10, max_iter=300, tol=0.0001, precompute_distances=’auto’, verbose=0, random_state=None, copy_x=True, n_jobs=1, algorithm=’auto’)***Generating Random Data Set With make_blobs**

1 2 3 4 5 6 7 8 9 10 11 12 13 |
from sklearn.datasets.samples_generator import make_blobs import matplotlib.pyplot as plt plt.rcParams['figure.figsize'] = (6, 5) plt.style.use('ggplot') X_gen, y_true = make_blobs(n_samples=600, centers=6, random_state=10) x = X_gen[:, 0] y = X_gen[:, 1] plt.title('Random Data set') plt.scatter(x, y,marker='x') plt.show() |

**Output**

**Creating K-Means Clusters**

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 |
from sklearn.datasets.samples_generator import make_blobs from sklearn.cluster import KMeans import numpy as np import matplotlib.pyplot as plt plt.rcParams['figure.figsize'] = (8, 5) plt.style.use('ggplot') X_gen, y_true = make_blobs(n_samples=600, centers=6, random_state=10) x = X_gen[:, 0] y = X_gen[:, 1] plt.plot() X = np.array(list(zip(x, y))).reshape(len(x), 2) colors = ['red', 'green', 'blue','orange'] markers = ['o', 'v', 's','o', 'v', 's'] # KMeans algorithm kmeans_model = KMeans(n_clusters=4) kmeans_model.fit(X) plt.plot() for i, l in enumerate(kmeans_model.labels_): plt.plot(x[i], y[i], color=colors[l], marker='x',ls='None') plt.title('K-Means Clusters') plt.show() |

**Output**

**Defining The Center of Clusters**

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 |
from sklearn.datasets.samples_generator import make_blobs from sklearn.cluster import KMeans import numpy as np import matplotlib.pyplot as plt plt.rcParams['figure.figsize'] = (8, 5) plt.style.use('ggplot') X_gen, y_true = make_blobs(n_samples=600, centers=6, random_state=10) x = X_gen[:, 0] y = X_gen[:, 1] # create new plot and data plt.plot() X = np.array(list(zip(x, y))).reshape(len(x), 2) colors = ['red', 'green', 'blue','orange','skyblue','gold'] markers = ['o', 'v', 's','o', 'v', 's'] # KMeans algorithm kmeans_model = KMeans(n_clusters=4) kmeans_model.fit(X) plt.plot() for i, l in enumerate(kmeans_model.labels_): plt.plot(x[i], y[i], color=colors[l], marker='x',ls='None') #Define the centers of the clusters centers = kmeans_model.cluster_centers_ plt.scatter(centers[:, 0], centers[:, 1], c='black', s=200, alpha=0.5) plt.title('Cluster Centers') plt.show() |

**Output**

**Iris Data Set Original Classification**

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 |
from sklearn.datasets import load_iris import numpy as np import pandas as pd import matplotlib.pyplot as plt plt.style.use('ggplot') df = load_iris() X = df.data y = df.target # Set the size of the plot plt.figure(figsize=(14,7)) # Create a colormap colors = np.array(['red', 'green', 'blue']) # Plot Sepal plt.subplot(1, 2, 1) plt.scatter(X[:, 0], X[:, 1],c=[colors[i] for i in y], s=40) plt.title('Sepal (Original Clusters)') plt.xlabel('Sepal Length') plt.ylabel('Sepal Width') # Plot Petal plt.subplot(1, 2, 2) plt.scatter(X[:, 2], X[:, 3],c=[colors[i] for i in y], s=40) plt.title('Petal (Original Clusters)') plt.xlabel('Petal Length') plt.ylabel('Petal Width') plt.show() |

**Output**

**K-Means Model With Centers On Iris Data Set**

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 |
from sklearn.datasets import load_iris from sklearn.cluster import KMeans import numpy as np import pandas as pd import matplotlib.pyplot as plt plt.style.use('ggplot') df = load_iris() X = df.data y = df.target kmeans_model = KMeans(n_clusters=3) kmeans_model.fit(X) # Set the size of the plot plt.figure(figsize=(14,7)) # Create a colormap colors = np.array(['red', 'green', 'blue']) # Plot Sepal plt.subplot(1, 2, 1) plt.scatter(X[:, 0], X[:, 1],c=[colors[i] for i in kmeans_model.labels_], s=40) plt.title('Sepal (K-Means Clustering)') plt.xlabel('Sepal Length') plt.ylabel('Sepal Width') #Define the centers of the clusters centers = kmeans_model.cluster_centers_ plt.scatter(centers[:, 0], centers[:, 1], c='black', s=200, alpha=0.5) # Plot Petal plt.subplot(1, 2, 2) plt.scatter(X[:, 2], X[:, 3],c=[colors[i] for i in kmeans_model.labels_], s=40) plt.title('Petal (K-Means Clustering)') plt.xlabel('Petal Length') plt.ylabel('Petal Width') #Define the centers of the clusters centers = kmeans_model.cluster_centers_ plt.scatter(centers[:, 2], centers[:, 3], c='black', s=200, alpha=0.5) plt.show() |

**Pros**

- Simple to implement.
- Easy to interpret the results.
- Performs better than hierarchical clustering with large number of variables and if k is small.
- Produces good results if the clusters are spherical.

**Cons**

- Hard to choose the k parameter that can yield high results.
- It is sensitive to scale because it relies on the distance function (Euclidean distance).
- Assumes that the data is spherically distributed which is not always the case in real life.
- It is sensitive to outliers.
- Finding the optimal solution to K-Means clustering is computationally expensive. Since it relies on the Euclidean distance this makes it an NP-hard problem.
- K-Means assumes that all variables have the same variable which is not always true.

**Applications of K-Means Clustering Algorithm**

- Vector quantization e.g in image processing.
- Behavioral modeling such as segmentation in news articles,customers purchasing behaviors e.t.c.
- Anomaly detection.
- Feature learning.
- Geostatic.
- Healthcare Fraud Detection.
- Robotics.
- Study of genomics.

**Conclusion**

K-Means is a powerful clustering algorithm that is widely used in unsupervised learning. The algorithm partitions the data set into a k number of centroid with each centroid having its mean. The data points are then assigned into the closest centroid using the Euclidean distance function forming clusters. K-Means clustering algorithm has many applications such as in feature engineering in machine learning, study of the genomics, cluster analysis, business use cases such as behavioral modeling fraud and anomaly detection e.t.c. Despite it being easy to implement and interpret K-Means is sensitive to outliers. It assumes that data is spherically distributed and that all variables have the same variance which is not always the case in real life. Additionally it is hard to find the value of k that can instantly give the best results. K-means has many variations but in this post we have just looked at the most commonly used K-Means clustering algorithm.

**What’s Next**

In this post we have looked at the K-Means clustering algorithm, in the next post we are going to look at the Hierarchical agglomerative clustering algorithm.