1. Introduction
kmean is an unsupervised learning algorithm that is mainly used for cluster analysis, where he will count several points in the data set as cluster centers, find the distances of these data sets from these cluster centers, and group the data that are closest to the same cluster center. Therefore, the most important aspect of kmean is about the selection of cluster centers.
The flow of his algorithm is briefly summarized as follows
- The choice of the number of clusters;
- Calculate the distance of the samples to the selected cluster centers, divide the samples, and group the samples that are closest to the same cluster center;
- Set a number of iterations to continuously update the cluster center;
2. kmean algorithm process
Here the initial cluster centers are not chosen randomly because some cluster centers may overlap if they are chosen randomly, here it is changed to choose the two samples with the farthest distance as the two initial cluster centers, and then the next cluster center will be the one with the farthest distance as the next one from the samples with the closest distance from the previous cluster centers, and so on and so forth until it meets the defined k-size. At the same time the new cluster centers in the graph should be updated only when the old and new cluster centers are not equal. In fact, the effect of kmean depends largely on the number of cluster centers and the choice of the initial cluster center, and he is suitable for some samples that are clustered into clusters for division.
2.1 Selection of the number of clusters
- graphical method: The most direct approach is still to draw a map, from the map directly to see the entire sample is divided into several categories, this approach is suitable for low-dimensional (1-3 dimensional) data, for high-dimensional data is difficult to directly into the map, but can be selected through feature extraction can reflect the differences in the samples of the characteristics of the dimensions (the differences are difficult to define, after all, is an unsupervised learning, you can simply think of the map into the sample is aggregated into a number of classes, the interval between these classes are far away), or directly down to low-dimensional data into a map to observe. Or directly reduce the dimensionality of the data into a graphical observation of low-dimensional data.
- the exhaustive k method: By defining a series of different numbers of clusters and evaluating the effect of these different numbers of clusters by means of a clustering metric, the one with the best effect is selected;
2.2 Clustering evaluation metrics
There are more evaluation metrics, and here I've chosen some simple ones that are easy to implement.
2.2.1 Contour coefficients
Reference:/item/%E8%BD%AE%E5%BB%93%E7%B3%BB%E6%95%B0/17361607?fr=aladdin
Supplementary: if Si is close to 0 it means that the more likely the sample is on the boundary of the class; the closer it is to -1 it means that it is more likely to belong to another class;
2.2.2 Compactness indicators
This refers to the calculation of the average distance CPi of the samples of the same cluster from the center of the cluster, followed by the calculation of the average value of CPi;
2.2.3 Spacing indicators
3. Code
Dataset: the classic Iris dataset, which contains data on three different species for a total of four different traits.
# Data sets 5.1,3.5,1.4,0.2,Iris-setosa 4.9,3.0,1.4,0.2,Iris-setosa 4.7,3.2,1.3,0.2,Iris-setosa 4.6,3.1,1.5,0.2,Iris-setosa 5.0,3.6,1.4,0.2,Iris-setosa 5.4,3.9,1.7,0.4,Iris-setosa 4.6,3.4,1.4,0.3,Iris-setosa 5.0,3.4,1.5,0.2,Iris-setosa 4.4,2.9,1.4,0.2,Iris-setosa 4.9,3.1,1.5,0.1,Iris-setosa 5.4,3.7,1.5,0.2,Iris-setosa 4.8,3.4,1.6,0.2,Iris-setosa 4.8,3.0,1.4,0.1,Iris-setosa 4.3,3.0,1.1,0.1,Iris-setosa 5.8,4.0,1.2,0.2,Iris-setosa 5.7,4.4,1.5,0.4,Iris-setosa 5.4,3.9,1.3,0.4,Iris-setosa 5.1,3.5,1.4,0.3,Iris-setosa 5.7,3.8,1.7,0.3,Iris-setosa 5.1,3.8,1.5,0.3,Iris-setosa 5.4,3.4,1.7,0.2,Iris-setosa 5.1,3.7,1.5,0.4,Iris-setosa 4.6,3.6,1.0,0.2,Iris-setosa 5.1,3.3,1.7,0.5,Iris-setosa 4.8,3.4,1.9,0.2,Iris-setosa 5.0,3.0,1.6,0.2,Iris-setosa 5.0,3.4,1.6,0.4,Iris-setosa 5.2,3.5,1.5,0.2,Iris-setosa 5.2,3.4,1.4,0.2,Iris-setosa 4.7,3.2,1.6,0.2,Iris-setosa 4.8,3.1,1.6,0.2,Iris-setosa 5.4,3.4,1.5,0.4,Iris-setosa 5.2,4.1,1.5,0.1,Iris-setosa 5.5,4.2,1.4,0.2,Iris-setosa 4.9,3.1,1.5,0.1,Iris-setosa 5.0,3.2,1.2,0.2,Iris-setosa 5.5,3.5,1.3,0.2,Iris-setosa 4.9,3.1,1.5,0.1,Iris-setosa 4.4,3.0,1.3,0.2,Iris-setosa 5.1,3.4,1.5,0.2,Iris-setosa 5.0,3.5,1.3,0.3,Iris-setosa 4.5,2.3,1.3,0.3,Iris-setosa 4.4,3.2,1.3,0.2,Iris-setosa 5.0,3.5,1.6,0.6,Iris-setosa 5.1,3.8,1.9,0.4,Iris-setosa 4.8,3.0,1.4,0.3,Iris-setosa 5.1,3.8,1.6,0.2,Iris-setosa 4.6,3.2,1.4,0.2,Iris-setosa 5.3,3.7,1.5,0.2,Iris-setosa 5.0,3.3,1.4,0.2,Iris-setosa 7.0,3.2,4.7,1.4,Iris-versicolor 6.4,3.2,4.5,1.5,Iris-versicolor 6.9,3.1,4.9,1.5,Iris-versicolor 5.5,2.3,4.0,1.3,Iris-versicolor 6.5,2.8,4.6,1.5,Iris-versicolor 5.7,2.8,4.5,1.3,Iris-versicolor 6.3,3.3,4.7,1.6,Iris-versicolor 4.9,2.4,3.3,1.0,Iris-versicolor 6.6,2.9,4.6,1.3,Iris-versicolor 5.2,2.7,3.9,1.4,Iris-versicolor 5.0,2.0,3.5,1.0,Iris-versicolor 5.9,3.0,4.2,1.5,Iris-versicolor 6.0,2.2,4.0,1.0,Iris-versicolor 6.1,2.9,4.7,1.4,Iris-versicolor 5.6,2.9,3.6,1.3,Iris-versicolor 6.7,3.1,4.4,1.4,Iris-versicolor 5.6,3.0,4.5,1.5,Iris-versicolor 5.8,2.7,4.1,1.0,Iris-versicolor 6.2,2.2,4.5,1.5,Iris-versicolor 5.6,2.5,3.9,1.1,Iris-versicolor 5.9,3.2,4.8,1.8,Iris-versicolor 6.1,2.8,4.0,1.3,Iris-versicolor 6.3,2.5,4.9,1.5,Iris-versicolor 6.1,2.8,4.7,1.2,Iris-versicolor 6.4,2.9,4.3,1.3,Iris-versicolor 6.6,3.0,4.4,1.4,Iris-versicolor 6.8,2.8,4.8,1.4,Iris-versicolor 6.7,3.0,5.0,1.7,Iris-versicolor 6.0,2.9,4.5,1.5,Iris-versicolor 5.7,2.6,3.5,1.0,Iris-versicolor 5.5,2.4,3.8,1.1,Iris-versicolor 5.5,2.4,3.7,1.0,Iris-versicolor 5.8,2.7,3.9,1.2,Iris-versicolor 6.0,2.7,5.1,1.6,Iris-versicolor 5.4,3.0,4.5,1.5,Iris-versicolor 6.0,3.4,4.5,1.6,Iris-versicolor 6.7,3.1,4.7,1.5,Iris-versicolor 6.3,2.3,4.4,1.3,Iris-versicolor 5.6,3.0,4.1,1.3,Iris-versicolor 5.5,2.5,4.0,1.3,Iris-versicolor 5.5,2.6,4.4,1.2,Iris-versicolor 6.1,3.0,4.6,1.4,Iris-versicolor 5.8,2.6,4.0,1.2,Iris-versicolor 5.0,2.3,3.3,1.0,Iris-versicolor 5.6,2.7,4.2,1.3,Iris-versicolor 5.7,3.0,4.2,1.2,Iris-versicolor 5.7,2.9,4.2,1.3,Iris-versicolor 6.2,2.9,4.3,1.3,Iris-versicolor 5.1,2.5,3.0,1.1,Iris-versicolor 5.7,2.8,4.1,1.3,Iris-versicolor 6.3,3.3,6.0,2.5,Iris-virginica 5.8,2.7,5.1,1.9,Iris-virginica 7.1,3.0,5.9,2.1,Iris-virginica 6.3,2.9,5.6,1.8,Iris-virginica 6.5,3.0,5.8,2.2,Iris-virginica 7.6,3.0,6.6,2.1,Iris-virginica 4.9,2.5,4.5,1.7,Iris-virginica 7.3,2.9,6.3,1.8,Iris-virginica 6.7,2.5,5.8,1.8,Iris-virginica 7.2,3.6,6.1,2.5,Iris-virginica 6.5,3.2,5.1,2.0,Iris-virginica 6.4,2.7,5.3,1.9,Iris-virginica 6.8,3.0,5.5,2.1,Iris-virginica 5.7,2.5,5.0,2.0,Iris-virginica 5.8,2.8,5.1,2.4,Iris-virginica 6.4,3.2,5.3,2.3,Iris-virginica 6.5,3.0,5.5,1.8,Iris-virginica 7.7,3.8,6.7,2.2,Iris-virginica 7.7,2.6,6.9,2.3,Iris-virginica 6.0,2.2,5.0,1.5,Iris-virginica 6.9,3.2,5.7,2.3,Iris-virginica 5.6,2.8,4.9,2.0,Iris-virginica 7.7,2.8,6.7,2.0,Iris-virginica 6.3,2.7,4.9,1.8,Iris-virginica 6.7,3.3,5.7,2.1,Iris-virginica 7.2,3.2,6.0,1.8,Iris-virginica 6.2,2.8,4.8,1.8,Iris-virginica 6.1,3.0,4.9,1.8,Iris-virginica 6.4,2.8,5.6,2.1,Iris-virginica 7.2,3.0,5.8,1.6,Iris-virginica 7.4,2.8,6.1,1.9,Iris-virginica 7.9,3.8,6.4,2.0,Iris-virginica 6.4,2.8,5.6,2.2,Iris-virginica 6.3,2.8,5.1,1.5,Iris-virginica 6.1,2.6,5.6,1.4,Iris-virginica 7.7,3.0,6.1,2.3,Iris-virginica 6.3,3.4,5.6,2.4,Iris-virginica 6.4,3.1,5.5,1.8,Iris-virginica 6.0,3.0,4.8,1.8,Iris-virginica 6.9,3.1,5.4,2.1,Iris-virginica 6.7,3.1,5.6,2.4,Iris-virginica 6.9,3.1,5.1,2.3,Iris-virginica 5.8,2.7,5.1,1.9,Iris-virginica 6.8,3.2,5.9,2.3,Iris-virginica 6.7,3.3,5.7,2.5,Iris-virginica 6.7,3.0,5.2,2.3,Iris-virginica 6.3,2.5,5.0,1.9,Iris-virginica 6.5,3.0,5.2,2.0,Iris-virginica 6.2,3.4,5.4,2.3,Iris-virginica 5.9,3.0,5.1,1.8,Iris-virginica
#python3 #---------------------------- # author: little shark # date: 2022/3/12 """ Kmean Realization """ from matplotlib import pyplot as plt import numpy as np import pandas as pd import random class Algo: def __init__(self): = None = None , = None, None = None self.dis_i = None self.rbf_i = None self.G_i = None self.GDis_i = True def loadData(self,file): """ Read file """ df=pd.read_csv(file,delimiter=",",header=None) =[:,0:-1].values =[:,-1].values ,= #print() def dataNorm(self): """ Normalized data """ assert != None or !=None, "First you must loadData!!!" #get min matrix min_=(axis=0).reshape(1,) #get max matrix max_=(axis=0).reshape(1,) #get range matrix range_=max_-min_ #update =(-min_)/range_ def euDistances(self,x): """ Get EuclideanDistances matrix """ assert != None or !=None, "First you must loadData!!!" # Input matrix size rows,cols = #define a empty matrix dis=(rows**2).reshape(rows,rows) #fill dis matrix by calculating distance of each point for i in range(rows): for j in range(rows): if i == j: dis[i][j] = 0 continue d=sum(pow(x[i]-x[j],2))**0.5 dis[i][j]=d dis[j][i]=d #get EuclideanDistances matrix =() self.dis_i=True return dis def kmean(self,k=3,maxIteration=100,x=None,print_=False): """ kmean entities """ assert != None or !=None, "First you must loadData!!!" assert () != None, "you must import a data to kmean" # Input matrix size rows,cols = # Define a list of labels representing the categories of all samples label=(rows) labelCode=[i for i in range(k)] # Calculate the distance matrix of the input matrix dis=(x) #------------------------------------------ # Select initial cluster centers, number = k # The two farthest points are used as cluster centers # max_=0 maxI=0 maxJ=0 for i in range(rows): for j in range(rows): if dis[i][j] > max_: max_ = dis[i][j] maxI = i maxJ = j k1I = maxI k2I = maxJ k1=x[k1I].reshape(1,cols) k2=x[k2I].reshape(1,cols) Ci=[k1I,k2I] # Preserve cluster-centered indexes C=[k1,k2] # Save the cluster center value #Step 3, retrieve other cluster centers # for i in range(2,k): tempI=[] temp=[] for j in range(len(C)): #step 3.1 Closest point to the center of the cluster minI=0 min_=99999 index=0 for d in dis[Ci[j]]: # Omit itself and the points already retrieved if d==0 or index in Ci: index+=1 continue if d<min_: minI=index min_=d index+=1 #step 3.2 Adding distance sizes and their indexes (dis[Ci[j]][minI]) (minI) #step 3.3 Select the next cluster center with the largest distance from the closest point to the selected cluster center. nextKI=tempI[(max(temp))] nextK=x[nextKI].reshape(1,cols) (nextKI) (nextK) #------------------------------------ #updatelabel for i in range(rows): sample=x[i].reshape(1,cols) # Determine the closest cluster center minI=0 min_=99999 for j in range(k): disV=(sample,C[j]) #print(dis) if disV<min_: minI=j min_=disV label[i]=labelCode[minI] #------------------------------------ # Updating cluster centers for iter in range(maxIteration): for j in range(k): # Get data under the same category dataSet=x[label == labelCode[j]] if len(dataSet) == 0: continue # Calculate the average mean_=(axis=0).reshape(1,cols) # Updating cluster centers C[j]=mean_ #------------------------------------ #updatelabel for i in range(rows): sample=x[i].reshape(1,cols) # Determine the closest cluster center minI=0 min_=99999 for j in range(k): disV=(sample,C[j]) #print(dis) if disV<min_: minI=j min_=disV label[i]=labelCode[minI] cp = (label,C,x) sp = (C) conCoe = (label,C,x) if print_: print("Iteration %-4d CP = %-7.4f CP = %-7.4f coef = %-7.4f"%(iter,cp,sp,conCoe)) return label,C def getDistance(self,x1,x2): """ Compute the geometric distance between two vectors """ return pow(((x1-x2)**2).sum(),0.5) def CP(self,label,c,x): """ Closeness evaluation indicators """ k = len(c) labelCode = (label) cp = 0 for i in range(k): #Homocluster matrices x1 = x[label == i] if len(x1) == 0:continue # size rows, cols = # Euclidean distances from cluster centers and sum_ = pow(((x1 - c[i])**2).sum(),0.5) cp += sum_/rows return cp/k def SP(self,c): """ Indicators of interval evaluation """ k = len(c) sp = 0 for i in range(k): for j in range(1,k): sp += (c[i],c[j]) return 2*sp/(k**2-k) def calAi(self,v,vs): """ Calculating ai in contour coefficients v: a vector vs: co-cluster vector of this vector """ rows,cols = ai = pow((vs - v)**2,0.5).sum()/rows return ai def calBi(self,v,vs,vsLabel): """ Calculate bi in the contour coefficients v: a vector vs: non-homocluster vector of this vector vsCode: label of the non-homocluster vector of this vector """ ks = (vsLabel) k = len(ks) min_ = 99999 for i in range(k): # Homocluster vectors in non-homoclusters x1 = vs[vsLabel == i] if len(x1) == 0:continue # Average distance of homogeneous clusters in non-homogeneous clusters disAver = (v,x1) if disAver < min_: min_ = disAver return min_ def contourCoefficient(self,label,c,x): """ Contour coefficient """ k = len(c) rows,cols = S = (rows) for i in range(rows): # Sample clusters x1 = x[label == label[i]] # Non-homoclustered samples x2 = x[label != label[i]] label2 = label[label != label[i]] # Current Sample sample = x[i] #ai,bi Calculate ai = (sample,x1) bi = (sample,x2,label2) s = (bi-ai)/max([ai,bi]) S[i] = s return () if __name__ == "__main__": #------------------------- # Directly define the number of clusters a=Algo() #Load data ("data set/iris (2).data") # Run the algorithm label,c = (x=,k=3) kCode = (label) print(label) #Drawing out the data partitioning classes for k in kCode: ([label == k][:,0],[label == k][:,2],alpha=0.6,label="sample %d"%k) # Draw the center of the cluster for i in range(len(c)): (x=c[i][:,0],y=c[i][:,2],color="red",marker="*",label="center %d"%i) () () #---------------------------- #Exhaustive k method #exit() coef = [] CP = [] SP = [] for k in range(2,10): label,c = (x=,k=k) ((label,c,)) ((c)) ((label,c,)) print("* When k = %d done *"%k) (131) (range(2,10),SP,alpha=0.8,linestyle="--",color="red",label="SP") ("K") () (132) (range(2,10),coef,alpha=0.8,linestyle=":",color="green",label="coef") ("K") () (133) (range(2,10),CP,alpha=0.8,linestyle="-",color="blue",label="CP") ("K") () ()
4. Output results
4.1 Command line output
Explanation: the first output is the clustering division of all samples, here does not disrupt the order of the samples to analyze, so the same variety of samples is a continuous chunk, roughly speaking, the direct definition of the division of the k = 3 effect is not bad. The latter output is the exhaustive k method, and calculate the corresponding evaluation index process.
4.2 Visualization
Explanation: Different colors represent different categories of data, the red star represents the center of the cluster, and the dimensions of the data here show the column 1 and column 3 dimensions.
4.3 Indicators of the poor-k method
Explanation: the first figure is the spacing indicator SP is used to measure the average distance between two cluster centers, with the increase of k value, there are more obvious fluctuations, you can see that when k = 3 there is a minimum value, indicating that k = 3 when the clustering between the classes is the closest, ideally we would like to have this indicator is larger, indicating that the category differentiation is obvious; the second is the contour coefficient, with the increase in the value of k is gradually decreasing; third is the closeness indicator, which measures the distance of the inner data from the center of the cluster, and shows a little fluctuation as the value of k increases; from the figure any one of these indicators does not accurately show that k=3 (the true number of clusters) is a good number of clusters, and perhaps these indicators are not suitable for finding the number of clusters to be defined.
Simply from the profile coefficient, in the case of knowing k = 3, the closer to 1 represents the clustering of the more reasonable, obviously unrealistic, and from the formula when k = 1, the profile coefficient will be very close to 1, is it "everyone is a family, all the four seas are brothers"? Unlikely, and k = 2 when the coefficient is 0.75, is it better to be divided into 2? Obviously in the knowledge of the true classification is also not suitable, thenAre the results of our statistical analysis really the same as the real thing?
First of all, from the kmean, he is an unsupervised algorithm, there is no a priori labeling of the data, he is used to measure the similarity of the data, and the higher degree of similarity of the data into one category, reflecting the "idea of closer attributes in the same species", but he is limited by the choice of the initial k; and then from the classification indicators, these indicators are mainly used to measure the degree of intra-cluster and inter-cluster division, if k is larger, more samples may be "separate", then the SP, CP and contour coefficient between them can be seen. These indicators are mainly used to measure the degree of intra-cluster and inter-cluster division, if k is larger, the more samples may "go their own way", then the SP, CP and profile coefficient between them can be seen to be smaller and smaller, from the significance of these indicators does not accurately reflect the number of clusters.
Furthermore, the human cognitive defects, we know what we know, we can not intuitively recognize the objective law, only through we can perceive everything to observe and then summarize the inference of the objective law, our current knowledge is the "predecessor with measuring tools and magnifying glass" a little bit of accumulation, we think what we think! What we think is the product of practice, tools, ideas are human activities under the crystallization, that is to say, there are deficiencies, but it is we can take out of the "family".
The main point I want to make here is not to be too superstitious about indicators when analyzing data, especially when doing analysis on data that is not known to have a true distribution. If these indicators could accurately and consistently reflect the real data, then we would have already dominated the universe and predicted the future.
Back to the previous topic:Are the results of our statistical analysis really the same as the real thing? As of now, these measurements and analytical methods are all we can come up with, and as mankind progresses these things will either evolve or become obsolete. These tools just reflect the reality to a certain extent, maybe very close, maybe very different... As for the final result before the real cognizance of the objective law, we will never know, just take a series of products of our human activities to label it, maybe the future of mankind looking back at the present, will also find it interesting.
To this article on the use of python to implement the kmean algorithm is introduced to this article, more related python to implement the kmean algorithm content, please search for my previous articles or continue to browse the following related articles I hope you will support me more in the future!