Find Your Solution Here

Data Mining

Implementation of the k-Means and k-Medoids clustering algorithms using python

Description:

  1. Implement the k-Means and k-Medoids clustering algorithms.
  2. Compare the performances of the above two clustering algorithms using standard measures such as variance, silhouette coefficient etc. If ground truth is available, you may use extrinsic measures such as BCubed precision and BCubed recall. Use at least 7 real-life clustering datasets.

Data Sets:

What is clustering:

Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group or cluster are more similar to each other than to those in other groups or clusters. It is a main task of exploratory data analysis, and a common technique for statistical data analysis that is used in many fields. The fields are pattern recognition, image analysis, information retrieval, bioinformatics, data compression, computer graphics and machine learning.

Cluster analysis itself is not one specific algorithm, but the general task to be solved. It can be achieved by various algorithms (k-means algorithm, k-medoids algorithm) that differ significantly in their understanding of what constitutes a cluster and how to efficiently find them. Popular notions of clusters include groups with small distances between cluster members, dense areas of the data space, intervals or particular statistical distributions. Clustering can therefore be formulated as a multi-objective optimization problem. The appropriate clustering algorithm and parameter settings depend on the individual data set and intended use of the results. Cluster analysis as such is not an automatic task, but an iterative process of knowledge discovery or interactive multi-objective optimization that involves trial and failure. It is often necessary to modify data preprocessing and model parameters until the result achieves the desired properties.

In this article we will discuss about k-means and k-medoids algorithm.

K-Means Algorithm:

There are mainly two types of learning algorithms in machine learning. They are supervised learning and unsupervised learning algorithm. K-means clustering algorithm is an unsupervised learning algorithm which is used to clustering data
in machine learning or data science.
K-means clustering is an unsupervised learning algorithm which groups the unlabeled dataset into different clusters or groups. In “k-means” k defines the number of pre defined clusters of group that needed to be create in the given data set. For example if k=3, there will ne three clusters, for k=5 there will be five clusters and so on.

K-means helps to cluster data into different clusters in a conveninet way to discover the categories of groups in the unlabeled dataset on its own without training. K-means clusering algorithm is a centroid based algorithm, where each cluster is associated with a centroid. The main goal of this algorithm is to minimize the sum of the intra distances of data point in their corresponding clusters. The algorithm takes unlabled dataset as input and it devides the dataset into k number of groups and the process is being repeated untill it gets the best clusters. Always the value of k is predetermined in this algorithm.

The main task of k-means algorithm is like this:

  1. Select the number of clusters k.
  2. Select random k points from data sets as centroid.
  3. Assaign each data point to their closest centroid, which will from the k number of clusters.
  4. Calculate the variance and place a new centroid of each cluster.
  5. Repeat the third steps, which means reassign each datapoint to the new closest centroid of each cluster.
The main task of k-means algorithm is like this:1. Select the number of clusters k. 2. Select random k points from data sets as centroid. 3. Assaign each data point to their closest centroid, which will from the k number of clusters. 4. Calculate the variance and place a new centroid of each cluster. 5. Repeat the third steps, which means reassign each datapoint to the new closest centroid of each cluster.

Here is a example of k-means algorithm:

Solution: Implementation of K-means algorithm using python. Data preprocessing is needed to perform k-means algorithm. Here in my solution I have performed data preprocessing to implement k-means algorithm.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import preprocessing
import random
import math
import copy
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import silhouette_samples, silhouette_score
 
class TrainModel:
  def __init__(self, data, k_value,max_iteration):
    self.data = data
    scaler = MinMaxScaler()
    self.data = scaler.fit_transform(self.data)
    self.k_value = k_value
    self.max_iteration=max_iteration
    self.centroids = []
    self.data_cluster()
  def generate_column(self,col,data):
    values = []
    for i in range(len(data)):
      values.append(data[i][col])
    return values
  def calculateDistance(self,x,y):
    return sum((x - y)**2)**0.5
  def get_closest_centroid(self,points,centroids):
    closest_centroids=[]
    for i in points:
      distance = []
      for c in centroids:
        dis = self.calculateDistance(i,c)
        distance.append(dis)
      closest_centroids.append(np.argmin(distance))
    return closest_centroids
  
  def calculate_new_centroids(self,clusters, X):
    new_centroids = []
    new_df = pd.concat([pd.DataFrame(X), pd.DataFrame(clusters, columns=['cluster'])],axis=1)
    for c in set(new_df['cluster']):
        current_cluster = new_df[new_df['cluster'] == c][new_df.columns[:-1]]
        cluster_mean = current_cluster.mean(axis=0)
        new_centroids.append(cluster_mean)
    return new_centroids
  def get_clustered_data(self,points,centroids):
    closest_centroids = self.get_closest_centroid(points,centroids)
    clustered_data = {}
    for i in range(self.k_value):
      clustered_data[i]=[]
    for i in range(len(points)):
      clustered_data[closest_centroids[i]].append(points[i])
    return clustered_data
  def get_clusters_label(self,data_points,clusters):
    labels=[]
    for i in range(len(data_points)):
      labels.append(0)
    for i in clusters.keys():
      cluster = clusters[i]
      for j in range(len(cluster)):
        for k in range(len(data_points)):
          if (cluster[j]==data_points[k]).all():
            labels[k]=i
            break
    return labels
 
 
  def data_cluster(self):
    centroid_points=random.sample(range(0, len(self.data)), self.k_value)
    for i in centroid_points:
      self.centroids.append(self.data[i])
    for i in range(self.max_iteration): 
      closest_centroids = self.get_closest_centroid(self.data,self.centroids)
      self.centroids=self.calculate_new_centroids(closest_centroids,np.array(self.data))
    final_clusters = self.get_clustered_data(self.data,self.centroids)
    cluster_labels = self.get_clusters_label(self.data,final_clusters)
    silhouette_avg = silhouette_score(self.data, cluster_labels)
    print("Silhouette coefficient for Kmeans is for k=",self.k_value,silhouette_avg)
 
 
dataset = pd.read_csv('iris.data',header=None).values
model = TrainModel(dataset,3,15)

K-medoids Algorithm: K-medoids algorithms is almost same as k-means algorithm. But for k-medoid algorithm centroid is chosen from dataset by choosing medoids.

Here is the algorithm:

  1. Initialize: select k random points out of the n data points as the medoids.
  2. Associate each data point to the closest medoid by using any common distance metric methods.
  3. While the cost decreases:
    For each medoid m, for each data o point which is not a medoid:
    i. Swap m and o, associate each data point to the closest medoid, recompute the cost.
    ii. If the total cost is more than that in the previous step, undo the swap.

Solution: Implementation of k-medoids in python. Data preprocessing is needed to perform k-medoids algorithm. Here in my solution I have performed data preprocessing to implement k-medoids algorithm:

# -*- coding: utf-8 -*-
"""kmediods.ipynb

Automatically generated by Colaboratory.

Original file is located at
    https://colab.research.google.com/drive/1RUMBXVKjZY8Q-0FvHIsBiS9WoDMyV9dw
"""

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import preprocessing
import random
import math
import copy
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import silhouette_samples, silhouette_score


class TrainModel:
  def __init__(self, data, k_value):
    self.data = data
    scaler = MinMaxScaler()
    self.data = scaler.fit_transform(self.data)
    self.k_value = k_value
    self.kmedoids(self.data)

  def get_random_medoids(self,data):
    points=random.sample(range(0, len(data)), self.k_value)
    medoids=[]
    for i in range(self.k_value):
      medoids.append(data[i])
    return medoids
  
  def get_closest_medoids(self,sample_point,medoids):
    min_distance = float('inf')
    closest_medoid= None
    for i in range(len(medoids)):
      distance = self.calculateDistance(sample_point,medoids[i])
      if distance<min_distance:
        min_distance = distance
        closest_medoid = i
    return closest_medoid
  def get_clusters(self,data_points,medoids):
    clusters = [[] for _ in range(self.k_value)]
    for i in range(len(data_points)):
      x=self.get_closest_medoids(data_points[i],medoids)
      clusters[x].append(data_points[i])
    return clusters
  def calculate_cost(self,data_points,clusters,medoids):
    cost = 0
    for i in range(len(clusters)):
      for j in range(len(clusters[i])):
        cost+=self.calculateDistance(medoids[i],clusters[i][j])
    return cost
  def get_non_medoids(self,data_points,medoids):
    non_medoids = []
    for sample in data_points:
      flag = False
      for m in medoids:
        if (sample==m).all():
          flag = True
      if flag==False:
        non_medoids.append(sample)
    return non_medoids
  def get_clusters_label(self,data_points,clusters):
    labels=[]
    for i in range(len(data_points)):
      labels.append(0)
    for i in range(len(clusters)):
      cluster = clusters[i]
      for j in range(len(cluster)):
        for k in range(len(data_points)):
          if (cluster[j]==data_points[k]).all():
            labels[k]=i
            break
    return labels
 
  def kmedoids(self,data):
    medoids = self.get_random_medoids(data)
    clusters = self.get_clusters(data,medoids)
    initial_cost = self.calculate_cost(data,clusters,medoids)
    #print(len(clusters[0]),len(clusters[1]),len(clusters[2]))
    while True:
      best_medoids=medoids
      lowest_cost=initial_cost
      for i in range(len(medoids)):
        non_medoids = self.get_non_medoids(data,medoids)
        for j in range(len(non_medoids)):
          new_medoids = medoids.copy()
          for k in range(len(new_medoids)):
            if (new_medoids[k] == medoids[i]).all():
              new_medoids[k] = non_medoids[j]
          new_clusters = self.get_clusters(data,new_medoids)
          new_cost = self.calculate_cost(data,new_clusters,new_medoids)
          if new_cost < lowest_cost:
              lowest_cost = new_cost
              best_medoids = new_medoids
      if lowest_cost < initial_cost:
        initial_cost = lowest_cost
        medoids = best_medoids 
      else:
        break
    final_clusters = self.get_clusters(data,medoids)
    cluster_labels = self.get_clusters_label(data,final_clusters)
    silhouette_avg = silhouette_score(data, cluster_labels)

    print("Silhouette coefficient for Kmedoidis for k=",self.k_value,silhouette_avg)

  def calculateDistance(self,x,y):
    return sum((x - y)**2)**0.5
  

dataset = pd.read_csv('iris.data',header=None).values
model = TrainModel(dataset,2)

Report:

Leave a Reply