Implementing the kmean algorithm using python

1. Introduction

kmean is an unsupervised learning algorithm that is mainly used for cluster analysis, where he will count several points in the data set as cluster centers, find the distances of these data sets from these cluster centers, and group the data that are closest to the same cluster center. Therefore, the most important aspect of kmean is about the selection of cluster centers.

The flow of his algorithm is briefly summarized as follows

The choice of the number of clusters;
Calculate the distance of the samples to the selected cluster centers, divide the samples, and group the samples that are closest to the same cluster center;
Set a number of iterations to continuously update the cluster center;

2. kmean algorithm process

kmean算法过程

Here the initial cluster centers are not chosen randomly because some cluster centers may overlap if they are chosen randomly, here it is changed to choose the two samples with the farthest distance as the two initial cluster centers, and then the next cluster center will be the one with the farthest distance as the next one from the samples with the closest distance from the previous cluster centers, and so on and so forth until it meets the defined k-size. At the same time the new cluster centers in the graph should be updated only when the old and new cluster centers are not equal. In fact, the effect of kmean depends largely on the number of cluster centers and the choice of the initial cluster center, and he is suitable for some samples that are clustered into clusters for division.

2.1 Selection of the number of clusters

graphical method: The most direct approach is still to draw a map, from the map directly to see the entire sample is divided into several categories, this approach is suitable for low-dimensional (1-3 dimensional) data, for high-dimensional data is difficult to directly into the map, but can be selected through feature extraction can reflect the differences in the samples of the characteristics of the dimensions (the differences are difficult to define, after all, is an unsupervised learning, you can simply think of the map into the sample is aggregated into a number of classes, the interval between these classes are far away), or directly down to low-dimensional data into a map to observe. Or directly reduce the dimensionality of the data into a graphical observation of low-dimensional data.
the exhaustive k method: By defining a series of different numbers of clusters and evaluating the effect of these different numbers of clusters by means of a clustering metric, the one with the best effect is selected;

2.2 Clustering evaluation metrics

There are more evaluation metrics, and here I've chosen some simple ones that are easy to implement.

2.2.1 Contour coefficients

Reference:/item/%E8%BD%AE%E5%BB%93%E7%B3%BB%E6%95%B0/17361607?fr=aladdin

轮廓系数

Supplementary: if Si is close to 0 it means that the more likely the sample is on the boundary of the class; the closer it is to -1 it means that it is more likely to belong to another class;

2.2.2 Compactness indicators

紧密性指标

This refers to the calculation of the average distance CPi of the samples of the same cluster from the center of the cluster, followed by the calculation of the average value of CPi;

2.2.3 Spacing indicators

间隔性指标

3. Code

Dataset: the classic Iris dataset, which contains data on three different species for a total of four different traits.

# Data sets
5.1,3.5,1.4,0.2,Iris-setosa
4.9,3.0,1.4,0.2,Iris-setosa
4.7,3.2,1.3,0.2,Iris-setosa
4.6,3.1,1.5,0.2,Iris-setosa
5.0,3.6,1.4,0.2,Iris-setosa
5.4,3.9,1.7,0.4,Iris-setosa
4.6,3.4,1.4,0.3,Iris-setosa
5.0,3.4,1.5,0.2,Iris-setosa
4.4,2.9,1.4,0.2,Iris-setosa
4.9,3.1,1.5,0.1,Iris-setosa
5.4,3.7,1.5,0.2,Iris-setosa
4.8,3.4,1.6,0.2,Iris-setosa
4.8,3.0,1.4,0.1,Iris-setosa
4.3,3.0,1.1,0.1,Iris-setosa
5.8,4.0,1.2,0.2,Iris-setosa
5.7,4.4,1.5,0.4,Iris-setosa
5.4,3.9,1.3,0.4,Iris-setosa
5.1,3.5,1.4,0.3,Iris-setosa
5.7,3.8,1.7,0.3,Iris-setosa
5.1,3.8,1.5,0.3,Iris-setosa
5.4,3.4,1.7,0.2,Iris-setosa
5.1,3.7,1.5,0.4,Iris-setosa
4.6,3.6,1.0,0.2,Iris-setosa
5.1,3.3,1.7,0.5,Iris-setosa
4.8,3.4,1.9,0.2,Iris-setosa
5.0,3.0,1.6,0.2,Iris-setosa
5.0,3.4,1.6,0.4,Iris-setosa
5.2,3.5,1.5,0.2,Iris-setosa
5.2,3.4,1.4,0.2,Iris-setosa
4.7,3.2,1.6,0.2,Iris-setosa
4.8,3.1,1.6,0.2,Iris-setosa
5.4,3.4,1.5,0.4,Iris-setosa
5.2,4.1,1.5,0.1,Iris-setosa
5.5,4.2,1.4,0.2,Iris-setosa
4.9,3.1,1.5,0.1,Iris-setosa
5.0,3.2,1.2,0.2,Iris-setosa
5.5,3.5,1.3,0.2,Iris-setosa
4.9,3.1,1.5,0.1,Iris-setosa
4.4,3.0,1.3,0.2,Iris-setosa
5.1,3.4,1.5,0.2,Iris-setosa
5.0,3.5,1.3,0.3,Iris-setosa
4.5,2.3,1.3,0.3,Iris-setosa
4.4,3.2,1.3,0.2,Iris-setosa
5.0,3.5,1.6,0.6,Iris-setosa
5.1,3.8,1.9,0.4,Iris-setosa
4.8,3.0,1.4,0.3,Iris-setosa
5.1,3.8,1.6,0.2,Iris-setosa
4.6,3.2,1.4,0.2,Iris-setosa
5.3,3.7,1.5,0.2,Iris-setosa
5.0,3.3,1.4,0.2,Iris-setosa
7.0,3.2,4.7,1.4,Iris-versicolor
6.4,3.2,4.5,1.5,Iris-versicolor
6.9,3.1,4.9,1.5,Iris-versicolor
5.5,2.3,4.0,1.3,Iris-versicolor
6.5,2.8,4.6,1.5,Iris-versicolor
5.7,2.8,4.5,1.3,Iris-versicolor
6.3,3.3,4.7,1.6,Iris-versicolor
4.9,2.4,3.3,1.0,Iris-versicolor
6.6,2.9,4.6,1.3,Iris-versicolor
5.2,2.7,3.9,1.4,Iris-versicolor
5.0,2.0,3.5,1.0,Iris-versicolor
5.9,3.0,4.2,1.5,Iris-versicolor
6.0,2.2,4.0,1.0,Iris-versicolor
6.1,2.9,4.7,1.4,Iris-versicolor
5.6,2.9,3.6,1.3,Iris-versicolor
6.7,3.1,4.4,1.4,Iris-versicolor
5.6,3.0,4.5,1.5,Iris-versicolor
5.8,2.7,4.1,1.0,Iris-versicolor
6.2,2.2,4.5,1.5,Iris-versicolor
5.6,2.5,3.9,1.1,Iris-versicolor
5.9,3.2,4.8,1.8,Iris-versicolor
6.1,2.8,4.0,1.3,Iris-versicolor
6.3,2.5,4.9,1.5,Iris-versicolor
6.1,2.8,4.7,1.2,Iris-versicolor
6.4,2.9,4.3,1.3,Iris-versicolor
6.6,3.0,4.4,1.4,Iris-versicolor
6.8,2.8,4.8,1.4,Iris-versicolor
6.7,3.0,5.0,1.7,Iris-versicolor
6.0,2.9,4.5,1.5,Iris-versicolor
5.7,2.6,3.5,1.0,Iris-versicolor
5.5,2.4,3.8,1.1,Iris-versicolor
5.5,2.4,3.7,1.0,Iris-versicolor
5.8,2.7,3.9,1.2,Iris-versicolor
6.0,2.7,5.1,1.6,Iris-versicolor
5.4,3.0,4.5,1.5,Iris-versicolor
6.0,3.4,4.5,1.6,Iris-versicolor
6.7,3.1,4.7,1.5,Iris-versicolor
6.3,2.3,4.4,1.3,Iris-versicolor
5.6,3.0,4.1,1.3,Iris-versicolor
5.5,2.5,4.0,1.3,Iris-versicolor
5.5,2.6,4.4,1.2,Iris-versicolor
6.1,3.0,4.6,1.4,Iris-versicolor
5.8,2.6,4.0,1.2,Iris-versicolor
5.0,2.3,3.3,1.0,Iris-versicolor
5.6,2.7,4.2,1.3,Iris-versicolor
5.7,3.0,4.2,1.2,Iris-versicolor
5.7,2.9,4.2,1.3,Iris-versicolor
6.2,2.9,4.3,1.3,Iris-versicolor
5.1,2.5,3.0,1.1,Iris-versicolor
5.7,2.8,4.1,1.3,Iris-versicolor
6.3,3.3,6.0,2.5,Iris-virginica
5.8,2.7,5.1,1.9,Iris-virginica
7.1,3.0,5.9,2.1,Iris-virginica
6.3,2.9,5.6,1.8,Iris-virginica
6.5,3.0,5.8,2.2,Iris-virginica
7.6,3.0,6.6,2.1,Iris-virginica
4.9,2.5,4.5,1.7,Iris-virginica
7.3,2.9,6.3,1.8,Iris-virginica
6.7,2.5,5.8,1.8,Iris-virginica
7.2,3.6,6.1,2.5,Iris-virginica
6.5,3.2,5.1,2.0,Iris-virginica
6.4,2.7,5.3,1.9,Iris-virginica
6.8,3.0,5.5,2.1,Iris-virginica
5.7,2.5,5.0,2.0,Iris-virginica
5.8,2.8,5.1,2.4,Iris-virginica
6.4,3.2,5.3,2.3,Iris-virginica
6.5,3.0,5.5,1.8,Iris-virginica
7.7,3.8,6.7,2.2,Iris-virginica
7.7,2.6,6.9,2.3,Iris-virginica
6.0,2.2,5.0,1.5,Iris-virginica
6.9,3.2,5.7,2.3,Iris-virginica
5.6,2.8,4.9,2.0,Iris-virginica
7.7,2.8,6.7,2.0,Iris-virginica
6.3,2.7,4.9,1.8,Iris-virginica
6.7,3.3,5.7,2.1,Iris-virginica
7.2,3.2,6.0,1.8,Iris-virginica
6.2,2.8,4.8,1.8,Iris-virginica
6.1,3.0,4.9,1.8,Iris-virginica
6.4,2.8,5.6,2.1,Iris-virginica
7.2,3.0,5.8,1.6,Iris-virginica
7.4,2.8,6.1,1.9,Iris-virginica
7.9,3.8,6.4,2.0,Iris-virginica
6.4,2.8,5.6,2.2,Iris-virginica
6.3,2.8,5.1,1.5,Iris-virginica
6.1,2.6,5.6,1.4,Iris-virginica
7.7,3.0,6.1,2.3,Iris-virginica
6.3,3.4,5.6,2.4,Iris-virginica
6.4,3.1,5.5,1.8,Iris-virginica
6.0,3.0,4.8,1.8,Iris-virginica
6.9,3.1,5.4,2.1,Iris-virginica
6.7,3.1,5.6,2.4,Iris-virginica
6.9,3.1,5.1,2.3,Iris-virginica
5.8,2.7,5.1,1.9,Iris-virginica
6.8,3.2,5.9,2.3,Iris-virginica
6.7,3.3,5.7,2.5,Iris-virginica
6.7,3.0,5.2,2.3,Iris-virginica
6.3,2.5,5.0,1.9,Iris-virginica
6.5,3.0,5.2,2.0,Iris-virginica
6.2,3.4,5.4,2.3,Iris-virginica
5.9,3.0,5.1,1.8,Iris-virginica

#python3

#----------------------------
# author: little shark
# date: 2022/3/12
"""
Kmean Realization
"""

from matplotlib import pyplot as plt
import numpy as np
import pandas as pd
import random

class Algo:
	def __init__(self):
		 = None
		 = None
		, = None, None
		 = None
		self.dis_i = None
		self.rbf_i = None
		self.G_i = None
		self.GDis_i = True
		
	def loadData(self,file):
		"""
		Read file
		"""
		df=pd.read_csv(file,delimiter=",",header=None)
		=[:,0:-1].values
		=[:,-1].values
		
		,=
		#print()
		
	def dataNorm(self):
		"""
Normalized data
"""
		assert  != None or  !=None, "First you must loadData!!!"
		
		#get min matrix
		min_=(axis=0).reshape(1,)
		
		#get max matrix
		max_=(axis=0).reshape(1,)
		
		#get range matrix
		range_=max_-min_

		#update 
		=(-min_)/range_
	
	def euDistances(self,x):
		"""
		Get EuclideanDistances matrix
		"""
		assert  != None or  !=None, "First you must loadData!!!"
		
		# Input matrix size
		rows,cols = 
		
		#define a empty matrix
		dis=(rows**2).reshape(rows,rows)
		
		#fill dis matrix by calculating distance of each point
		for i in range(rows):
			for j in range(rows):
				if i == j:
					dis[i][j] = 0
					continue
				d=sum(pow(x[i]-x[j],2))**0.5
				dis[i][j]=d
				dis[j][i]=d
		
		#get EuclideanDistances matrix
		=()
		self.dis_i=True
		
		return dis
		
		
	def kmean(self,k=3,maxIteration=100,x=None,print_=False):
		"""
kmean entities
"""
		assert  != None or  !=None, "First you must loadData!!!"
		assert () != None, "you must import a data to kmean"
		
		# Input matrix size
		rows,cols = 
		
		# Define a list of labels representing the categories of all samples
		label=(rows)
		labelCode=[i for i in range(k)]
		
		# Calculate the distance matrix of the input matrix
		dis=(x)
		
		#------------------------------------------
		# Select initial cluster centers, number = k
		# The two farthest points are used as cluster centers #
		max_=0
		maxI=0
		maxJ=0
		for i in range(rows):
			for j in range(rows):
				if dis[i][j] > max_:
					max_ = dis[i][j]
					maxI = i
					maxJ = j
		k1I = maxI
		k2I = maxJ
		k1=x[k1I].reshape(1,cols)
		k2=x[k2I].reshape(1,cols)
		
		Ci=[k1I,k2I] # Preserve cluster-centered indexes
		C=[k1,k2] # Save the cluster center value
		
		#Step 3, retrieve other cluster centers #
		for i in range(2,k):
			tempI=[]
			temp=[]
			for j in range(len(C)):
				
				#step 3.1 Closest point to the center of the cluster
				minI=0
				min_=99999
				index=0
				for d in dis[Ci[j]]:
					# Omit itself and the points already retrieved
					if d==0 or index in Ci:
						index+=1
						continue
					if d<min_:
						minI=index
						min_=d
					index+=1
				
				
				#step 3.2 Adding distance sizes and their indexes
				(dis[Ci[j]][minI])
				(minI)
				
			#step 3.3 Select the next cluster center with the largest distance from the closest point to the selected cluster center.
			nextKI=tempI[(max(temp))]
			nextK=x[nextKI].reshape(1,cols)
			(nextKI)
			(nextK)
		
		#------------------------------------
		#updatelabel
		for i in range(rows):
			sample=x[i].reshape(1,cols)
			
			# Determine the closest cluster center
			minI=0
			min_=99999
			for j in range(k):
				disV=(sample,C[j])
				#print(dis)
				if disV<min_:
					minI=j
					min_=disV
			label[i]=labelCode[minI]
		
		#------------------------------------
		# Updating cluster centers
		for iter in range(maxIteration):
			for j in range(k):
				# Get data under the same category
				dataSet=x[label == labelCode[j]]
				if len(dataSet) == 0:
					continue
				
				# Calculate the average
				mean_=(axis=0).reshape(1,cols)
				
				# Updating cluster centers
				C[j]=mean_
				
			#------------------------------------
			#updatelabel
			for i in range(rows):
				sample=x[i].reshape(1,cols)
				
				# Determine the closest cluster center
				minI=0
				min_=99999
				for j in range(k):
					disV=(sample,C[j])
					#print(dis)
					if disV<min_:
						minI=j
						min_=disV
				label[i]=labelCode[minI]
			
			cp = (label,C,x)
			sp = (C)
			conCoe = (label,C,x)
			
			if print_:
				print("Iteration %-4d CP = %-7.4f CP = %-7.4f coef = %-7.4f"%(iter,cp,sp,conCoe))
			
		return label,C
	
	def getDistance(self,x1,x2):
		"""
Compute the geometric distance between two vectors
"""
		return pow(((x1-x2)**2).sum(),0.5)
				
	def CP(self,label,c,x):
		"""
Closeness evaluation indicators
"""
		k = len(c)
		labelCode = (label)
		
		cp = 0
		for i in range(k):
			#Homocluster matrices
			x1 = x[label == i]
			
			if len(x1) == 0:continue
			
			# size
			rows, cols = 
			
			# Euclidean distances from cluster centers and
			sum_ = pow(((x1 - c[i])**2).sum(),0.5)
			
			cp += sum_/rows
		
		return cp/k
		
	def SP(self,c):
		"""
Indicators of interval evaluation
"""
		k = len(c)
		
		sp = 0
		for i in range(k):
			for j in range(1,k):
				sp += (c[i],c[j])
		
		return 2*sp/(k**2-k)
		
	
	def calAi(self,v,vs):
		"""
Calculating ai in contour coefficients
v: a vector
vs: co-cluster vector of this vector
"""
		rows,cols = 
		ai = pow((vs - v)**2,0.5).sum()/rows
		
		return ai
		
	def calBi(self,v,vs,vsLabel):
		"""
Calculate bi in the contour coefficients
v: a vector
vs: non-homocluster vector of this vector
vsCode: label of the non-homocluster vector of this vector
"""
		ks = (vsLabel)
		k = len(ks)
		min_ = 99999
		
		for i in range(k):
			# Homocluster vectors in non-homoclusters
			x1 = vs[vsLabel == i]
			
			if len(x1) == 0:continue
			
			# Average distance of homogeneous clusters in non-homogeneous clusters
			disAver = (v,x1)
			
			if disAver < min_:
				min_ = disAver
		
		return min_
		
	def contourCoefficient(self,label,c,x):
		"""
Contour coefficient
"""
		k = len(c)
		rows,cols = 
		S = (rows)
		
		for i in range(rows):
			# Sample clusters
			x1 = x[label == label[i]]
			
			# Non-homoclustered samples
			x2 = x[label != label[i]]
			label2 = label[label != label[i]]
			
			# Current Sample
			sample = x[i]
			
			#ai,bi Calculate
			ai = (sample,x1)
			bi = (sample,x2,label2)
			
			s = (bi-ai)/max([ai,bi])
			
			S[i] = s
		
		return ()
		
if __name__ == "__main__":
	#-------------------------
	# Directly define the number of clusters
	a=Algo()
	
	#Load data
	("data set/iris (2).data")
	
	# Run the algorithm
	label,c = (x=,k=3)
	kCode = (label)
	
	print(label)
	
	#Drawing out the data partitioning classes
	for k in kCode:
		([label == k][:,0],[label == k][:,2],alpha=0.6,label="sample %d"%k)
	
	# Draw the center of the cluster
	for i in range(len(c)):
		(x=c[i][:,0],y=c[i][:,2],color="red",marker="*",label="center %d"%i)
	()
	()
	
	
	#----------------------------
	#Exhaustive k method
	#exit()
	coef = []
	CP = []
	SP = []
	for k in range(2,10):
		label,c = (x=,k=k)
		((label,c,))
		((c))
		((label,c,))
		print("* When k = %d done *"%k)
	
	(131)
	(range(2,10),SP,alpha=0.8,linestyle="--",color="red",label="SP")
	("K")
	()
	
	(132)
	(range(2,10),coef,alpha=0.8,linestyle=":",color="green",label="coef")
	("K")
	()
	
	(133)
	(range(2,10),CP,alpha=0.8,linestyle="-",color="blue",label="CP")
	("K")
	()
	()

4. Output results

4.1 Command line output

命令行输出

Explanation: the first output is the clustering division of all samples, here does not disrupt the order of the samples to analyze, so the same variety of samples is a continuous chunk, roughly speaking, the direct definition of the division of the k = 3 effect is not bad. The latter output is the exhaustive k method, and calculate the corresponding evaluation index process.

4.2 Visualization

在这里插入图片描述

Explanation: Different colors represent different categories of data, the red star represents the center of the cluster, and the dimensions of the data here show the column 1 and column 3 dimensions.

4.3 Indicators of the poor-k method

在这里插入图片描述

Explanation: the first figure is the spacing indicator SP is used to measure the average distance between two cluster centers, with the increase of k value, there are more obvious fluctuations, you can see that when k = 3 there is a minimum value, indicating that k = 3 when the clustering between the classes is the closest, ideally we would like to have this indicator is larger, indicating that the category differentiation is obvious; the second is the contour coefficient, with the increase in the value of k is gradually decreasing; third is the closeness indicator, which measures the distance of the inner data from the center of the cluster, and shows a little fluctuation as the value of k increases; from the figure any one of these indicators does not accurately show that k=3 (the true number of clusters) is a good number of clusters, and perhaps these indicators are not suitable for finding the number of clusters to be defined.

Simply from the profile coefficient, in the case of knowing k = 3, the closer to 1 represents the clustering of the more reasonable, obviously unrealistic, and from the formula when k = 1, the profile coefficient will be very close to 1, is it "everyone is a family, all the four seas are brothers"? Unlikely, and k = 2 when the coefficient is 0.75, is it better to be divided into 2? Obviously in the knowledge of the true classification is also not suitable, thenAre the results of our statistical analysis really the same as the real thing?

First of all, from the kmean, he is an unsupervised algorithm, there is no a priori labeling of the data, he is used to measure the similarity of the data, and the higher degree of similarity of the data into one category, reflecting the "idea of closer attributes in the same species", but he is limited by the choice of the initial k; and then from the classification indicators, these indicators are mainly used to measure the degree of intra-cluster and inter-cluster division, if k is larger, more samples may be "separate", then the SP, CP and contour coefficient between them can be seen. These indicators are mainly used to measure the degree of intra-cluster and inter-cluster division, if k is larger, the more samples may "go their own way", then the SP, CP and profile coefficient between them can be seen to be smaller and smaller, from the significance of these indicators does not accurately reflect the number of clusters.

Furthermore, the human cognitive defects, we know what we know, we can not intuitively recognize the objective law, only through we can perceive everything to observe and then summarize the inference of the objective law, our current knowledge is the "predecessor with measuring tools and magnifying glass" a little bit of accumulation, we think what we think! What we think is the product of practice, tools, ideas are human activities under the crystallization, that is to say, there are deficiencies, but it is we can take out of the "family".

The main point I want to make here is not to be too superstitious about indicators when analyzing data, especially when doing analysis on data that is not known to have a true distribution. If these indicators could accurately and consistently reflect the real data, then we would have already dominated the universe and predicted the future.

Back to the previous topic:Are the results of our statistical analysis really the same as the real thing? As of now, these measurements and analytical methods are all we can come up with, and as mankind progresses these things will either evolve or become obsolete. These tools just reflect the reality to a certain extent, maybe very close, maybe very different... As for the final result before the real cognizance of the objective law, we will never know, just take a series of products of our human activities to label it, maybe the future of mankind looking back at the present, will also find it interesting.

To this article on the use of python to implement the kmean algorithm is introduced to this article, more related python to implement the kmean algorithm content, please search for my previous articles or continue to browse the following related articles I hope you will support me more in the future!