Elbow Method – Metric Which helps in deciding the value of k in K-Means Clustering Algorithm

2 min read

Here in this article, I am going to explain the information about the method, which is helping in deciding the value of the k which you can use for the clustering of the data using the K-Means clustering algorithm.

In the past, we do not have any techniques to clustering the data by some means. We have to label it separately on the go by having good knowledge in that field. If you do not have the knowledge of the domain you are dealing with you can not distinguish between them. Suppose you have a lot of content about the companies and you do not have knowledge about the company. Like which company belongs to which domain. Then it is almost an impossible task to make a cluster of them. Because the number of the company is too much and it will take a lot of time for clustering it.

But nowadays due to the evolution of machine learning techniques, it is a very easy task to distinguish between the company and make a cluster of similar companies using its a little bit of description content analysis. 

Unsupervised Learning

Clustering is an unsupervised machine learning technique that is used for classification. 

Unsupervised learning means there is no specific output available for guiding the learning process. Data is explored selfly by an algorithm to find patterns and give some output accordingly.

K-Means Clustering Algorithm

K-Means Clustering Method/Algorithm is popular for cluster analysis in Data Mining and Analysis field. K-means used to make partition of n-observations in k Number of clusters in which each observation belongs to the cluster with the nearest mean. 

Using the K-Means clustering algorithm we can make some clusters. We do not have to tell the algorithm how to make the clusters as it does on its own. The result is that each data point or observation belongs to the same group/cluster are more similar than observations in the other cluster/group. 

K-Means uses the iterative refinement method to make the final clusters based on the number of clusters to find given by users. Whatever the value of k is defined by the user it will distribute the data into k-number of clusters.

Elbow Method for Evaluation of K-Means Clustering

As we know we have to decide the value of k. But for deciding the value of k Elbow Method can help us to find the best value of k. 

It uses the sum of squared distance (SSE) between the data points and their respective assigned clusters centroid or says mean value. And We pick k value at where the point SSE starts to flatten out and forming an elbow. 

This is how the method helps to find the good value of k (number of clusters for the dataset) and help in making the good clusters for the given dataset.

Here below I am including the code file for the execution of the Elbow method for finding the best value of k in the K-Means clustering algorithm.

Link to the Gist file (code for Elbow Method) –>  https://gist.github.com/0d4f6e1110a1af34da888eb196e83508.git

# code for the Elbow Method to find out the best value of k for the K-Means clustering algorithm.

import pyplot as plt

from sklearn.cluster import KMeans

# Run the Kmeans algorithm and get the index of data points clusters

# sse is sum of squared distance list

sse = []

X = “corpus of documents as list”

# k_list is the list of range in between we want to find clusters for

k_list = list(range(1, 10))

for k in list_k:

    # km_model is the KMeans where we define the model for fitting the data

    km_model = KMeans(n_clusters=k)

    # fitting the data (X is the data set) to km_model

    km_model.fit(X)

    sse.append(km_model.inertia_)

# Plot sse against k and find the value of k where it starts to flatten down and make angle like elbow.

plt.figure(figsize=(6, 6))

plt.plot(k_list, sse, ‘-o’)

plt.xlabel(r’Number of clusters *k*’)

plt.ylabel(‘Sum of squared distance’)

 

This will give the best value for k and you can get the best clusters for your data in which it will generate the good clusters where each observation or say data point is assigned to the best cluster for it. And we do not have to worry about assigning the value of k randomly. 

Try to find the values of k using this method and check how it is improving the accuracy of the clusters. 

Jayesh Manani Jayesh Manani is Computer Engineering Graduate. He has worked with some companies like Allevents.in and Torrent Pharmaceuticals Ltd. as a Data Science Engineer. Currently working at IIM Ahmedabad as Research Associate and working on analytical projects like text analysis and social media analysis with Professor at Indian Institute of Management, Ahmedabad. He is also good at Web Scraping and Machine Learning. He made some of the scrapers which can be used for the analysis of the social media data and companies can take many decisions based on the result of the analysis. He also writes about technology on Medium.com, How to use the technology and algorithms to deal with the problem faced by common people in daily life. Jayesh is passionate about technology and life. He is pretty interested in Automation, Artificial Intelligence, Blockchain, IoT, Data Science, and the impact that can be introduced by technology to our society and life.

2 Replies to “Elbow Method – Metric Which helps in deciding the…”

  1. this blog was really great, never seen a great blog like this before. i think im gonna share this to my friends..

  2. Hi! Thanks for the nice article!

    I just wanted to comment on the elbow method, and that it can be also made a quantitative (as opposed to empirical), if one introduce the quantity called “elbow strength”. With this quantity it is possible to determine the number of clusters in an automatic way using the elbow method and not by eye. The elbow strength was introduced in this publication: https://iopscience.iop.org/article/10.1088/2632-2153/abd87c

    and is described in details in the supplementary material.

Leave a Reply

Your email address will not be published. Required fields are marked *