Here in this article, I am going to explain the information about the method, which is helping in deciding the value of the k which you can use for the clustering of the data using the K-Means clustering algorithm.
In the past, we do not have any techniques to clustering the data by some means. We have to label it separately on the go by having good knowledge in that field. If you do not have the knowledge of the domain you are dealing with you can not distinguish between them. Suppose you have a lot of content about the companies and you do not have knowledge about the company. Like which company belongs to which domain. Then it is almost an impossible task to make a cluster of them. Because the number of the company is too much and it will take a lot of time for clustering it.
But nowadays due to the evolution of machine learning techniques, it is a very easy task to distinguish between the company and make a cluster of similar companies using its a little bit of description content analysis.
Unsupervised Learning
Clustering is an unsupervised machine learning technique that is used for classification.
Unsupervised learning means there is no specific output available for guiding the learning process. Data is explored selfly by an algorithm to find patterns and give some output accordingly.
K-Means Clustering Algorithm
K-Means Clustering Method/Algorithm is popular for cluster analysis in Data Mining and Analysis field. K-means used to make partition of n-observations in k Number of clusters in which each observation belongs to the cluster with the nearest mean.
Using the K-Means clustering algorithm we can make some clusters. We do not have to tell the algorithm how to make the clusters as it does on its own. The result is that each data point or observation belongs to the same group/cluster are more similar than observations in the other cluster/group.
K-Means uses the iterative refinement method to make the final clusters based on the number of clusters to find given by users. Whatever the value of k is defined by the user it will distribute the data into k-number of clusters.
Elbow Method for Evaluation of K-Means Clustering
As we know we have to decide the value of k. But for deciding the value of k Elbow Method can help us to find the best value of k.
It uses the sum of squared distance (SSE) between the data points and their respective assigned clusters centroid or says mean value. And We pick k value at where the point SSE starts to flatten out and forming an elbow.
This is how the method helps to find the good value of k (number of clusters for the dataset) and help in making the good clusters for the given dataset.
Here below I am including the code file for the execution of the Elbow method for finding the best value of k in the K-Means clustering algorithm.
Link to the Gist file (code for Elbow Method) –> https://gist.github.com/0d4f6e1110a1af34da888eb196e83508.git
# code for the Elbow Method to find out the best value of k for the K-Means clustering algorithm.
import pyplot as plt
from sklearn.cluster import KMeans
# Run the Kmeans algorithm and get the index of data points clusters
# sse is sum of squared distance list
sse = []
X = “corpus of documents as list”
# k_list is the list of range in between we want to find clusters for
k_list = list(range(1, 10))
for k in list_k:
# km_model is the KMeans where we define the model for fitting the data
km_model = KMeans(n_clusters=k)
# fitting the data (X is the data set) to km_model
km_model.fit(X)
sse.append(km_model.inertia_)
# Plot sse against k and find the value of k where it starts to flatten down and make angle like elbow.
plt.figure(figsize=(6, 6))
plt.plot(k_list, sse, ‘-o’)
plt.xlabel(r’Number of clusters *k*’)
plt.ylabel(‘Sum of squared distance’)
This will give the best value for k and you can get the best clusters for your data in which it will generate the good clusters where each observation or say data point is assigned to the best cluster for it. And we do not have to worry about assigning the value of k randomly.
Try to find the values of k using this method and check how it is improving the accuracy of the clusters.
this blog was really great, never seen a great blog like this before. i think im gonna share this to my friends..
Hi! Thanks for the nice article!
I just wanted to comment on the elbow method, and that it can be also made a quantitative (as opposed to empirical), if one introduce the quantity called “elbow strength”. With this quantity it is possible to determine the number of clusters in an automatic way using the elbow method and not by eye. The elbow strength was introduced in this publication: https://iopscience.iop.org/article/10.1088/2632-2153/abd87c
and is described in details in the supplementary material.