Description:
- Implement the k-Means and k-Medoids clustering algorithms.
- Compare the performances of the above two clustering algorithms using standard measures such as variance, silhouette coefficient etc. If ground truth is available, you may use extrinsic measures such as BCubed precision and BCubed recall. Use at least 7 real-life clustering datasets.
Data Sets:
What is clustering:
Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group or cluster are more similar to each other than to those in other groups or clusters. It is a main task of exploratory data analysis, and a common technique for statistical data analysis that is used in many fields. The fields are pattern recognition, image analysis, information retrieval, bioinformatics, data compression, computer graphics and machine learning.
Cluster analysis itself is not one specific algorithm, but the general task to be solved. It can be achieved by various algorithms (k-means algorithm, k-medoids algorithm) that differ significantly in their understanding of what constitutes a cluster and how to efficiently find them. Popular notions of clusters include groups with small distances between cluster members, dense areas of the data space, intervals or particular statistical distributions. Clustering can therefore be formulated as a multi-objective optimization problem. The appropriate clustering algorithm and parameter settings depend on the individual data set and intended use of the results. Cluster analysis as such is not an automatic task, but an iterative process of knowledge discovery or interactive multi-objective optimization that involves trial and failure. It is often necessary to modify data preprocessing and model parameters until the result achieves the desired properties.
In this article we will discuss about k-means and k-medoids algorithm.
K-Means Algorithm:
There are mainly two types of learning algorithms in machine learning. They are supervised learning and unsupervised learning algorithm. K-means clustering algorithm is an unsupervised learning algorithm which is used to clustering data
in machine learning or data science.
K-means clustering is an unsupervised learning algorithm which groups the unlabeled dataset into different clusters or groups. In “k-means” k defines the number of pre defined clusters of group that needed to be create in the given data set. For example if k=3, there will ne three clusters, for k=5 there will be five clusters and so on.
K-means helps to cluster data into different clusters in a conveninet way to discover the categories of groups in the unlabeled dataset on its own without training. K-means clusering algorithm is a centroid based algorithm, where each cluster is associated with a centroid. The main goal of this algorithm is to minimize the sum of the intra distances of data point in their corresponding clusters. The algorithm takes unlabled dataset as input and it devides the dataset into k number of groups and the process is being repeated untill it gets the best clusters. Always the value of k is predetermined in this algorithm.
The main task of k-means algorithm is like this:
- Select the number of clusters k.
- Select random k points from data sets as centroid.
- Assaign each data point to their closest centroid, which will from the k number of clusters.
- Calculate the variance and place a new centroid of each cluster.
- Repeat the third steps, which means reassign each datapoint to the new closest centroid of each cluster.
Here is a example of k-means algorithm:

Solution: Implementation of K-means algorithm using python. Data preprocessing is needed to perform k-means algorithm. Here in my solution I have performed data preprocessing to implement k-means algorithm.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import preprocessing
import random
import math
import copy
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import silhouette_samples, silhouette_score
class TrainModel:
def __init__(self, data, k_value,max_iteration):
self.data = data
scaler = MinMaxScaler()
self.data = scaler.fit_transform(self.data)
self.k_value = k_value
self.max_iteration=max_iteration
self.centroids = []
self.data_cluster()
def generate_column(self,col,data):
values = []
for i in range(len(data)):
values.append(data[i][col])
return values
def calculateDistance(self,x,y):
return sum((x - y)**2)**0.5
def get_closest_centroid(self,points,centroids):
closest_centroids=[]
for i in points:
distance = []
for c in centroids:
dis = self.calculateDistance(i,c)
distance.append(dis)
closest_centroids.append(np.argmin(distance))
return closest_centroids
def calculate_new_centroids(self,clusters, X):
new_centroids = []
new_df = pd.concat([pd.DataFrame(X), pd.DataFrame(clusters, columns=['cluster'])],axis=1)
for c in set(new_df['cluster']):
current_cluster = new_df[new_df['cluster'] == c][new_df.columns[:-1]]
cluster_mean = current_cluster.mean(axis=0)
new_centroids.append(cluster_mean)
return new_centroids
def get_clustered_data(self,points,centroids):
closest_centroids = self.get_closest_centroid(points,centroids)
clustered_data = {}
for i in range(self.k_value):
clustered_data[i]=[]
for i in range(len(points)):
clustered_data[closest_centroids[i]].append(points[i])
return clustered_data
def get_clusters_label(self,data_points,clusters):
labels=[]
for i in range(len(data_points)):
labels.append(0)
for i in clusters.keys():
cluster = clusters[i]
for j in range(len(cluster)):
for k in range(len(data_points)):
if (cluster[j]==data_points[k]).all():
labels[k]=i
break
return labels
def data_cluster(self):
centroid_points=random.sample(range(0, len(self.data)), self.k_value)
for i in centroid_points:
self.centroids.append(self.data[i])
for i in range(self.max_iteration):
closest_centroids = self.get_closest_centroid(self.data,self.centroids)
self.centroids=self.calculate_new_centroids(closest_centroids,np.array(self.data))
final_clusters = self.get_clustered_data(self.data,self.centroids)
cluster_labels = self.get_clusters_label(self.data,final_clusters)
silhouette_avg = silhouette_score(self.data, cluster_labels)
print("Silhouette coefficient for Kmeans is for k=",self.k_value,silhouette_avg)
dataset = pd.read_csv('iris.data',header=None).values
model = TrainModel(dataset,3,15)
K-medoids Algorithm: K-medoids algorithms is almost same as k-means algorithm. But for k-medoid algorithm centroid is chosen from dataset by choosing medoids.
Here is the algorithm:
- Initialize: select k random points out of the n data points as the medoids.
- Associate each data point to the closest medoid by using any common distance metric methods.
- While the cost decreases:
For each medoid m, for each data o point which is not a medoid:
i. Swap m and o, associate each data point to the closest medoid, recompute the cost.
ii. If the total cost is more than that in the previous step, undo the swap.
Solution: Implementation of k-medoids in python. Data preprocessing is needed to perform k-medoids algorithm. Here in my solution I have performed data preprocessing to implement k-medoids algorithm:
# -*- coding: utf-8 -*-
"""kmediods.ipynb
Automatically generated by Colaboratory.
Original file is located at
https://colab.research.google.com/drive/1RUMBXVKjZY8Q-0FvHIsBiS9WoDMyV9dw
"""
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import preprocessing
import random
import math
import copy
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import silhouette_samples, silhouette_score
class TrainModel:
def __init__(self, data, k_value):
self.data = data
scaler = MinMaxScaler()
self.data = scaler.fit_transform(self.data)
self.k_value = k_value
self.kmedoids(self.data)
def get_random_medoids(self,data):
points=random.sample(range(0, len(data)), self.k_value)
medoids=[]
for i in range(self.k_value):
medoids.append(data[i])
return medoids
def get_closest_medoids(self,sample_point,medoids):
min_distance = float('inf')
closest_medoid= None
for i in range(len(medoids)):
distance = self.calculateDistance(sample_point,medoids[i])
if distance<min_distance:
min_distance = distance
closest_medoid = i
return closest_medoid
def get_clusters(self,data_points,medoids):
clusters = [[] for _ in range(self.k_value)]
for i in range(len(data_points)):
x=self.get_closest_medoids(data_points[i],medoids)
clusters[x].append(data_points[i])
return clusters
def calculate_cost(self,data_points,clusters,medoids):
cost = 0
for i in range(len(clusters)):
for j in range(len(clusters[i])):
cost+=self.calculateDistance(medoids[i],clusters[i][j])
return cost
def get_non_medoids(self,data_points,medoids):
non_medoids = []
for sample in data_points:
flag = False
for m in medoids:
if (sample==m).all():
flag = True
if flag==False:
non_medoids.append(sample)
return non_medoids
def get_clusters_label(self,data_points,clusters):
labels=[]
for i in range(len(data_points)):
labels.append(0)
for i in range(len(clusters)):
cluster = clusters[i]
for j in range(len(cluster)):
for k in range(len(data_points)):
if (cluster[j]==data_points[k]).all():
labels[k]=i
break
return labels
def kmedoids(self,data):
medoids = self.get_random_medoids(data)
clusters = self.get_clusters(data,medoids)
initial_cost = self.calculate_cost(data,clusters,medoids)
#print(len(clusters[0]),len(clusters[1]),len(clusters[2]))
while True:
best_medoids=medoids
lowest_cost=initial_cost
for i in range(len(medoids)):
non_medoids = self.get_non_medoids(data,medoids)
for j in range(len(non_medoids)):
new_medoids = medoids.copy()
for k in range(len(new_medoids)):
if (new_medoids[k] == medoids[i]).all():
new_medoids[k] = non_medoids[j]
new_clusters = self.get_clusters(data,new_medoids)
new_cost = self.calculate_cost(data,new_clusters,new_medoids)
if new_cost < lowest_cost:
lowest_cost = new_cost
best_medoids = new_medoids
if lowest_cost < initial_cost:
initial_cost = lowest_cost
medoids = best_medoids
else:
break
final_clusters = self.get_clusters(data,medoids)
cluster_labels = self.get_clusters_label(data,final_clusters)
silhouette_avg = silhouette_score(data, cluster_labels)
print("Silhouette coefficient for Kmedoidis for k=",self.k_value,silhouette_avg)
def calculateDistance(self,x,y):
return sum((x - y)**2)**0.5
dataset = pd.read_csv('iris.data',header=None).values
model = TrainModel(dataset,2)
Report:
Leave a Reply