SoFunction
Updated on 2024-11-16

Principles and implementation of k-means and k-means++ in python

preamble

The k-means algorithm is an unsupervised clustering algorithm, which is simpler to implement. k-means++ can be understood as an enhanced version of k-means, which is more friendly than k-means in the way it initializes the centroids.

k-means principle

The steps for the implementation of k-means are as follows:

  • Randomly select k points from the sample as cluster centroids
  • For any sample point, find its distance to the k clustering centers, and then, group the sample points to the clustering center with the smallest distance until all the sample points have been grouped (clustered into k classes)
  • Average each cluster, and then use each of the k means as the new centroid of the respective cluster
  • Repeat steps 2 and 3 until the position of the center point is not changing or the change in the position of the center point is less than the threshold value

Pros:

  • The principle is simple and relatively easy to implement
  • Faster convergence and better clustering results

Drawbacks:

  • The selection of the initial center point is random and bad initial values may be selected.

Principles of k-means++

k-means++ is an enhanced version of k-means, which initially selects the clustering centroids as widely dispersed as possible, which effectively reduces the number of iterations and accelerates the speed of computingThe realization steps are as follows:

  • A point from the sample is randomly selected as the center of clustering
  • Calculate the distance from each sample point to the cluster center that has been selected, denoted by D(X): the larger D(X) is, the greater the probability that it will be selected as the next cluster center
  • The next clustering center is selected using a roulette wheel approach (the larger D(X), the higher the probability of being selected as a clustering center)
  • Repeat step 2 until k clustering centers are selected
  • After selecting the k clustering centers, the clusters are clustered using the standard k-means algorithm

It has to be clarified here that there is literature that puts theThe point with the largest distance from the selected cluster center is selected as the next center point, this statement is not quite accurate, accurate to say that theThe point with the largest distance from the selected cluster center has the highest probability of being selected as the next center point, but not necessarily the change point, because it's not good to always take the maximum (when it comes to special data, such as having a point that is far away from all the points of a cluster).

In general the initialization part, always give some random. Because the data is random.

Despite the extra time spent in computing the initial points, the k-mean itself converges quickly during the iterations, so the algorithm actually reduces the computation time.

The focus is now on utilizingrouletteway to select the next clustering center, we illustrate how K-means++ selects the initial clustering center with an example.

If there are 8 samples in the dataset, the distribution and the corresponding ordinal numbers are shown below:

在这里插入图片描述

We start by selecting point 6 as the first clustering center using step 1 of k-means++, and then proceed to step 2, which calculates the distance D(X) from each sample point to the clustering center that has been selected, as follows:

在这里插入图片描述

  • D(X) is the distance of each sample point from the selected cluster center (i.e., the first cluster center)
  • P(X) the probability that each sample is selected as the next clustering center
  • Sum is the cumulative sum of probabilities P(x), which is used in the roulette wheel method to select out the second clustering center.

Then perform the third step of k-means++: utilize therouletteway to select the next clustering center.The method is to randomly generate a random number between 0 and 1, determine which interval it belongs to, and then the corresponding serial number of the interval is the second clustering center that is chosen

In the above graph point 1 interval is [0,0.2), point 2 has an interval of [0.2, 0.525) and point 4 has an interval of [0.65,0.9)

From the above table, it can be intuitively seen that the sum of the total probabilities of No. 1, No. 2, No. 3, and No. 4 is 0.9, and these four points happen to be the four points that are far away from the first initial clustering center (i.e., point No. 6), and thus the second clustering center selected has a high probability of falling on one of these four points, with point No. 2 having the highest probability of being selected as the next clustering center.

k-means and k-means++ code implementation

Here the center point is chosen to be the feature (not the index) of the sample, this is done to facilitate the calculation, and the clustering points (points around the center point) are chosen to be the index of the sample.

k-means implementation

# Define the Euclidean distance
import numpy as np
def get_distance(x1, x2):
    return (((x1-x2)))
import random
# Define the center initialization function, the center point is chosen as the sample feature
def center_init(k, X):
    n_samples, n_features = 
    centers = ((k, n_features))
    selected_centers_index = []
    for i in range(k):
        # Each cycle randomly selects a category center, determining not to duplicate centers
        sel_index = (list(set(range(n_samples))-set(selected_centers_index)))
        centers[i] = X[sel_index]
        selected_centers_index.append(sel_index)
    return centers
# Determines how close a sample point is to a center, the index of the center is returned.
## For example, if there are three center points, the return is 0, 1, 2 ##
def closest_center(sample, centers):
    closest_i = 0
    closest_dist = float('inf')
    for i, c in enumerate(centers):
        # Select the category to which the centroid with the smallest distance belongs based on the Euclidean distance judgment
        distance = get_distance(sample, c)
        if distance < closest_dist:
            closest_i = i
            closest_dist = distance
    return closest_i
# Define the process of constructing clusters
# The content of each clustering store is an index of the sample, i.e., clustering on the sample index, for ease of operation
def create_clusters(centers, k, X):
    clusters = [[] for _ in range(k)]
    for sample_i, sample in enumerate(X):
        # Classify samples into the nearest category area
        center_i = closest_center(sample, centers)
        # Indexes where samples are stored
        clusters[center_i].append(sample_i)
    return clusters
# Calculate new centroids based on the clustering results from the previous step
def calculate_new_centers(clusters, k, X):
    n_samples, n_features = 
    centers = ((k, n_features))
    # With the current mean of each class sample as the new center point
    for i, cluster in enumerate(clusters):  # cluster is the index of each class after categorization
        new_center = (X[cluster], axis=0) # Average by column
        centers[i] = new_center
    return centers
# Get the clustering category to which each sample belongs
def get_cluster_labels(clusters, X):
    y_pred = ((X)[0])
    for cluster_i, cluster in enumerate(clusters):
        for sample_i in cluster:
            y_pred[sample_i] = cluster_i
            #print('Group the sample {} into {} classes'.format(sample_i,cluster_i))
    return y_pred
# Define the kmeans algorithm flow based on each of the above processes
def Mykmeans(X, k, max_iterations,init):
    # 1. Initialize the center point
    if init == 'kmeans':
        centers = center_init(k, X)
    else: centers = get_kmeansplus_centers(k, X)
    # Iterate over the solution
    for _ in range(max_iterations):
        # 2. Clustering based on current centroids
        clusters = create_clusters(centers, k, X)
        # Save the current center point
        pre_centers = centers
        # 3. Calculate new centroids based on clustering results
        new_centers = calculate_new_centers(clusters, k, X)
        # 4. Set the convergence condition as whether the center point changes or not
        diff = new_centers - pre_centers
        # Explain that the center point has not changed, stop updating
        if () == 0:
            break
    # Return the final clustering labels
    return get_cluster_labels(clusters, X)
# Test execution
X = ([[0,2],[0,0],[1,0],[5,0],[5,2]])
# Set the clustering categories to 2 and the maximum number of iterations to 10
labels = Mykmeans(X, k = 2, max_iterations = 10,init = 'kmeans')
# Print the label of the category to which each sample belongs
print("Final categorized results",labels)
## The output is [1. 1. 1. 0. 0.]
# Verify with sklearn
from  import KMeans
X = ([[0,2],[0,0],[1,0],[5,0],[5,2]])
kmeans = KMeans(n_clusters=2,init = 'random').fit(X)
# The results may be different due to the randomness of center
print(kmeans.labels_)

k-means++ implementation

## Getting the kmean++ center point
def get_kmeansplus_centers(k, X):
    n_samples, n_features = 
    init_one_center_i = (range(n_samples))
    centers = []
    (X[init_one_center_i])
    dists = [ 0 for _ in range(n_samples)]

    # Execution
    for _ in range(k-1):
        total = 0
        for sample_i,sample in enumerate(X):
            # Get the shortest distance
            closet_i = closest_center(sample,centers)
            d = get_distance(X[closet_i],sample)
            dists[sample_i] = d
            total += d
        total = total * ()

        for sample_i,d in enumerate(dists): # Roulette wheel method to select the next clustering center
            total -= d
            if total > 0:
                continue
            # Pick a new center point
            (X[sample_i])
            break
    return centers
X = ([[0,2],[0,0],[1,0],[5,0],[5,2]])
# Set the clustering categories to 2 and the maximum number of iterations to 10
labels = Mykmeans(X, k = 2, max_iterations = 10,init = 'kmeans++')
print("Final categorized results",labels)
## The output is [1. 1. 1. 0. 0.]
# Verify with sklearn
X = ([[0,2],[0,0],[1,0],[5,0],[5,2]])
kmeans = KMeans(n_clusters=2,init='k-means++').fit(X)
print(kmeans.labels_)

reference document

K-means and K-means++
K-means principles, optimization and applications

To this point this article on python k-means and k-means + + principle and implementation of the article is introduced to this, more related python k-means and k-means + + content please search for my previous articles or continue to browse the following related articles I hope you will support me more in the future!