An in-depth analysis of the principles of the python Kmeans algorithm

I. Overview

First of all, you need to introduce unsupervised learning first, the so-called unsupervised learning, that is, the labeling information in the training samples is positional, and the goal is to reveal the intrinsic nature of the data as well as the laws through the learning of unlabeled training samples. In layman's terms, it is based on some of the intrinsic properties of the data to find out its intrinsic laws. The most widely used algorithm in this category is "clustering".

Clustering algorithms can perform data reduction on data, i.e., reduce the magnitude of the data while keeping it as complete as possible for subsequent processing. The results of clustered data can also be applied or analyzed directly.

The Kmeans algorithm is one of the more basic clustering algorithms.

II. Starting with a sample

We now have these points on the two-dimensional plane

   x         y
1.658985    4.285136
-3.453687    3.424321
4.838138    -1.151539
-5.379713    -3.362104
0.972564    2.924086
-3.567919    1.531611
0.450614    -3.302219
-3.487105    -1.724432
2.668759    1.594842
-3.156485    3.191137
3.165506    -3.999838
-2.786837    -3.099354
4.208187    2.984927
-2.123337    2.943366
0.704199    -0.479481
-0.392370    -3.963704
2.831667    1.574018
-0.790153    3.343144
2.943496    -3.357075
...

Its distribution in the two-dimensional plane is roughly like this:

Okay, so the dots appear to be vaguely organized into 4 "clusters", so we can assume that it's meant to be organized into 4 "clusters". (Although we can "see" that it is to be divided into 4 "clusters", it could actually be divided into other ones, say 5.) The "4 clusters" is what we can see. In practice, it should be calculated by the machine, as described below.

After finding the 4 "clusters", we need to find the center of each "cluster". We can "see" the approximate center, but the machine doesn't know. So how does the machine know? The answer is through the vector distance, also called vector similarity. There are many ways to calculate this similarity, such as Euclidean distance, Manhattan distance, Chebyshev distance, etc. The machine can calculate the similarity by using the vector distance.

We use here the Euclidean distance, which is really a straight line distance between two points in the reaction space.

Knowing this, we can start to get the machine to calculate the 4 "clusters".

The main approach is to generate 4 random points, assuming that these 4 points are the centers of the 4 "clusters". Calculate the distance from each point in the plane to the 4 centers, and each point in the plane selects the center with the closest distance as its own center.

At this point, we have completed the first step of dividing all the points in the plane into 4 "clusters". But just now the centers are random, so the 4 clusters are obviously not the result we want. What should we do? Do the following:

Now that there are 4 clusters, a new centroid for each cluster is calculated based on all the points in each cluster. This new centroid is then superior to the last old centroid because it is more central. The first step is then repeated using the new centroid. That is, the distances are again calculated for all the points in the plane and distributed to the 4 new clusters. Keep iterating until the error is smaller.

This is how the Kmeans algorithm works.

III. Analysis of knowledge points

3.1 Determining the number of "clusters"

The 4 clusters mentioned above are in fact the K in Kmeans, and the first thing you need to do is to choose a K as the number of clusters. The above example is subjective, but in most cases we can't intuitively tell how many K's are better. So what should we do?

We can start with a lower K value. Use the results of the simpler Kmeans algorithm (i.e. fewer iterations, not the best, but the fastest). Calculate the distance from each point to the center of the "cluster" to which it belongs, and then sum the results, which is the error value.

Then increase the K value and calculate the error value again. For example, in the above example, we can start with K=2 and calculate the error value of the Kmeans algorithm for K values from 2 to 7.
This will result in a graph similar to this one:

The Error can be interpreted as the error of the Kmeans, and as it is divided into more and more "clusters", the error will be smaller and smaller.

But isn't it better to have more clusters? The answer is no. Sometimes too many clusters can be detrimental to our ability to get the results we want or to do the next step in the process.

So we usually choose the critical point where the rate of error reduction is relatively flat, such as 4 in the figure above.

It can be noticed that after dividing into 4 clusters, increasing the number of clusters does not reduce the error much. And taking 4 clusters is also consistent with what we see.

3.2 European distance

Euclidean distance is a commonly adopted definition of distance as the true distance between two points in m-dimensional space, calculated as follows:

And this example kind of is in two-dimensional space kind, so the calculation formula of this example is as follows:

IV. Code and results

The code that loads the data, using numpy, is first loaded into the matrix type.

import numpy as np

def loadDataSet(fileName):
  '''
  Load dataset
  :param fileName.
  :return.
  '''
  # Initialize an empty list
  dataSet = []
  # Read the file
  fr = open(fileName)
  # Loop over all lines of the file
  for line in ():
    # Cutting each row of data
    curLine = ().split('\t')
    # Convert the data to floating-point type for later calculations.
    # fltLine = [float(x) for x in curLine]
    # Append data to dataMat
    fltLine = list(map(float,curLine))  # Maps all elements to float type
    (fltLine)
  # Return dataMat
  return (dataSet)

The next step is to generate K initial plasmas, i.e., centroids. Here, k "clusters" are generated by random generation.

def randCent(dataMat, k):
  '''
  Construct a set of K random centers of mass for a given dataset.
  The random centers of mass must be within the boundaries of the entire dataset, which can be accomplished by finding the minimum and maximum values for each dimension of the dataset.
  Then generate a random number between 0 and 1.0 and pass the range and minimum values to ensure that the random point is within the boundary of the data.
  :param dataMat: :param k: :param
  :param k.
  :return.
  :param dataMat: :param k: :return.''
  # Get sample size and eigenvalues
  m, n = (dataMat)
  # Initialize the center of mass, create (k,n) zero-populated matrices
  centroids = (((k, n)))
  # Cyclic traversal of eigenvalues
  for j in range(n):
    # Calculate the minimum value for each column
    minJ = min(dataMat[:, j])
    # Calculate range values for each column
    rangeJ = float(max(dataMat[:, j]) - minJ)
    # Calculate the center of mass of each column and assign the value to centroids
    centroids[:, j] = (minJ + rangeJ * (k, 1))
  # Back to the center of mass
  return centroids

European distance calculation

def distEclud(vecA, vecB):
  '''
  Euclidean distance calculation function
  :param vecA.
  :param vecB.
  :return.
  '''
  return (sum((vecA - vecB, 2)))

The cost method performs a simplified kMeans, i.e., fewer iterations, and calculates the error (i.e., the distance from the current point to the center of mass of the cluster, which will be used later to evaluate the effectiveness of the clustering).

def cost(dataMat, k, distMeas=distEclud, createCent=randCent,iterNum=300):
  '''
  Calculate the amount of error, through this method to determine how much k is more appropriate, this is actually a simplified version of kMeans
  :param dataMat: data set
  :param k: number of clusters
  :param distMeans: Calculate the distance
  :param createCent: create initial center of mass
  :param iterNum: default iteration number
  :return.
  '''
  # Get sample size and number of features
  m, n = (dataMat)
  # Initialize a matrix to store the cluster assignment results for each point
  # clusterAssment contains two columns: one records the cluster index value, the second column stores the error (the error is the distance from the current point to the center of mass of the cluster, which will be used later to evaluate the effect of clustering).
  clusterAssment = (((m, 2)))
  # Create centers of mass, K random centers of mass
  centroids = createCent(dataMat, k)
  clusterChanged = True
  while iterNum > 0:
    clusterChanged = False
    # Iterate over all the data to find the center of mass closest to each point, #
    # This can be done by traversing all the centers of mass for each point and calculating the distance from the point to each center of mass
    for i in range(m):
      minDist = 
      minIndex = -1
      for j in range(k):
        # Calculate the distance from the data point to the center of mass
        # Calculate the distance using the distance formula given by the distMeas parameter, the default distance function is distEclud
        distJI = distMeas(centroids[j, :], dataMat[i, :])
        # print(distJI)
        # If the distance is smaller than minDist, update minDist and the index of the minimum center of mass.
        if distJI < minDist:
          minDist = distJI
          minIndex = j
      # Update the cluster assignment result to be the square of the index, minDist of the smallest center of mass.
      clusterAssment[i, :] = minIndex, minDist ** 2
      iterNum -= 1;
    # print(centroids)
    # Iterate over all the centers of mass and update their values
    for cent in range(k):
      # Get all the points of a given cluster by filtering the data
      ptsInClust = dataMat[(clusterAssment[:, 0].A == cent)[0]]
      # Calculate the mean of the points, axis=0 means that the mean is calculated along the columns of the matrix.
      centroids[cent, :] = (ptsInClust, axis=0)
  # Returns the value of the error after a given number of iterations
  return (clusterAssment[:,1].sum(0))[0,0]

Finally, the Kmeans algorithm can be invoked to perform the calculation.

def kMeans(dataMat, k, distMeas=distEclud, createCent=randCent):
  '''
  K centers of mass are created, then each store is assigned to the nearest center of mass, and the centers are recalculated.
  This process is repeated several times until the results of the cluster assignment of the data points no longer change
  :param dataMat: data set
  :param k: number of clusters
  :param distMeans: Calculate distances
  :param createCent: create the initial center of mass
  :return.
  '''
  # Get sample size and number of features
  m, n = (dataMat)
  # Initialize a matrix to store the cluster assignment results for each point
  # clusterAssment contains two columns: one records the cluster index value, the second column stores the error (the error is the distance from the current point to the center of mass of the cluster, which will be used later to evaluate the effect of clustering).
  clusterAssment = (((m, 2)))
  # Create centers of mass, randomize K centers of mass
  centroids = createCent(dataMat, k)
  # Initialize the flag variable, used to determine whether the iteration continues, if True, then continue iteration
  clusterChanged = True
  while clusterChanged:
    clusterChanged = False
    # Iterate over all the data to find the center of mass closest to each point, #
    # This can be done by traversing all the centers of mass for each point and calculating the distance from the point to each center of mass
    for i in range(m):
      minDist = 
      minIndex = -1
      for j in range(k):
        # Calculate the distance from the data point to the center of mass
        # Calculate the distance using the distance formula given by the distMeas parameter, the default distance function is distEclud
        distJI = distMeas(centroids[j, :], dataMat[i, :])
        # If the distance is smaller than minDist, update minDist and the index of the smallest center of mass.
        if distJI < minDist:
          minDist = distJI
          minIndex = j
      # Update the clusterChanged flag if the cluster allocation result changes at any point.
      if clusterAssment[i, 0] != minIndex: clusterChanged = True
      # Update the cluster assignment result to be the square of the index, minDist of the smallest center of mass.
      clusterAssment[i, :] = minIndex, minDist ** 2
    # print(centroids)
    # Iterate over all the centers of mass and update their values
    for cent in range(k):
      # Get all the points of a given cluster by filtering the data
      ptsInClust = dataMat[(clusterAssment[:, 0].A == cent)[0]]
      # Calculate the mean of all points, axis=0 means that the mean is calculated along the columns of the matrix.
      centroids[cent, :] = (ptsInClust, axis=0)
  # Return all class center-of-mass and point assignment results
  return centroids, clusterAssment

How much does choosing a different k value affect the results? Let's take a look and see. Below are the results for k values from 2 to 6.
The red square in the figure is the center point of the cluster, and the points belonging to each cluster are indicated by different colors.

K = 2

K = 3

K = 4

K = 5

K = 6

This is the whole content of this article.