Detailed implementation of K-means algorithm in Python

Introduction to the K-means algorithm

K-means is one of the more commonly used algorithms in machine learning, belonging to the unsupervised learning algorithms, which is often used for data clustering, only need to specify the number of clusters for it can automatically aggregate the data into multiple classes, the data in the same cluster has a higher degree of similarity, and different clusters have a lower degree of similarity of the data.

The K-MEANS algorithm is an algorithm that takes as input the number of clusters k, and a database containing n data objects, and outputs k clusters that satisfy the criterion of minimum variance. k-means algorithm accepts an input quantity k; it then divides n data objects into k clusters in order to make the obtained clusters satisfy the criteria of: higher similarity among objects in the same cluster; and lower similarity among objects in different clusters.

core idea

A division scheme of k class clusters is found iteratively such that the overall error obtained when the mean of these k class clusters is used to represent the samples of the corresponding class is minimized.

The k clusters have the following characteristics: the clusters themselves are as compact as possible, while the clusters are as separate as possible.

The k-means algorithm is based on the least error sum of squares criterion, thePros and cons of K-menas:

Pros:

simple principle
quick
Better scalability for large datasets

Drawbacks:

Need to specify the number of clusters K
Sensitivity to outliers
Sensitive to initial values

The clustering process of K-means

The clustering process is similar to a gradient descent algorithm, which builds a cost function and iterates to make the cost function value smaller and smaller

The initial centers of the c classes are chosen appropriately;
In the kth iteration, for any sample, find its distance to the c centers and assign that sample to the class where the center with the shortest distance is located;
Update the center value of the class using methods such as averaging;
For all c clustering centers, if the values remain the same after updating using the iterative method of (2)(3), the iteration ends, otherwise the iteration continues.

The greatest advantage of the algorithm is its simplicity and speed. The key to the algorithm is the selection of the initial center and the distance formula.

K-means Example Demonstration

Some parameters of km in python:

(
  n_clusters=8,
  init='k-means++', 
  n_init=10, 
  max_iter=300, 
  tol=0.0001, 
  precompute_distances='auto', 
  verbose=0, 
  random_state=None, 
  copy_x=True, 
  n_jobs=1, 
  algorithm='auto'
  )
n_clusters: Number of clusters，I.e. how many categories do you want to cluster into
init: Method for obtaining the initial cluster center
n_init: Get the number of iterations of the initial cluster center，To compensate for the initial center of mass，By default, the algorithm will initialize10center of gravity，Implementation Algorithms，Then return the best result。
max_iter: Maximum number of iterations（on account ofkmeansAlgorithm implementation requires iteration）
tol: tolerance level，assume (office)kmeansConditions for convergence of the running criterion
precompute_distances:Is it necessary to calculate the distance in advance，This parameter makes a trade-off between space and time，If it isTrue will put the entire distance matrix into memory，auto It will default to the default value when the data sample is larger thanfeaturs*samples is greater than the number of12e6 whenFalse,False When the core implementation is done by utilizing theCpython rounding out
verbose: redundancy model（I don't know what that means.，I don't usually change the defaults anyway.）
random_state: State conditions for randomly generated cluster centers。
copy_x: A flag for whether or not the data has been modified，in the event thatTrue，assume (office)复制了就不会修改数据。bool existscikit-learn Many interfaces will have this parameter in the，That is, whether or not to continue with the input datacopy manipulate，so that user input data is not modified。This has to be understood.Python It's the memory mechanism that's clearer。
n_jobs: parallelism
algorithm: kmeans的Implementation Algorithms，there are：'auto', ‘full', ‘elkan', included among these ‘full'expressed in terms ofEMtouch
虽然there are很多参数，But all have been given default values。So we generally don't need to pass in these parameters,parametric。It can be called as needed。

A code example is shown below

from  import KMeans
from  import joblib
from sklearn import cluster
import numpy as np

# Generate 10*3 matrices
data = (10,3)
print data
# Clustered into 4 categories
estimator=KMeans(n_clusters=4)
# fit_predict means fit+predict, can also be written separately
res=estimator.fit_predict(data)
# Predicted category labeling results
lable_pred=estimator.labels_
# Cluster center values for each category
centroids=estimator.cluster_centers_
# Sum of cluster center mean vectors
inertia=estimator.inertia_

print lable_pred
print centroids
print inertia

Code execution results
[0 2 1 0 2 2 0 3 2 0]

[[ 0.3028348  0.25183096 0.62493622]
 [ 0.88481287 0.70891813 0.79463764]
 [ 0.66821961 0.54817207 0.30197415]
 [ 0.11629904 0.85684903 0.7088385 ]]
 
0.570794546829

In order to describe more intuitively, this time to do a demonstration on the graph, because the image is plotted on the two-dimensional is more intuitive, so the data is adjusted to two-dimensional, selected 100 points plotted, clustering category for 3 classes

from  import KMeans
from  import joblib
from sklearn import cluster
import numpy as np
import  as plt

data = (100,2)
estimator=KMeans(n_clusters=3)
res=estimator.fit_predict(data)
lable_pred=estimator.labels_
centroids=estimator.cluster_centers_
inertia=estimator.inertia_
#print res
print lable_pred
print centroids
print inertia

for i in range(len(data)):
  if int(lable_pred[i])==0:
    (data[i][0],data[i][1],color='red')
  if int(lable_pred[i])==1:
    (data[i][0],data[i][1],color='black')
  if int(lable_pred[i])==2:
    (data[i][0],data[i][1],color='blue')
()

You can see that the clustering effect is still good, a test of the clustering efficiency of k-means was conducted to widen the dimension to 50 dimensions

Data size	Elapsed time	Data dimensions
10,000	4s	50-dimensional
100,000	30s	50-dimensional
1,000,000	4'13s	50-dimensional

For millions of data, the fitting time is still acceptable, the efficiency is still good, the saving of the model is similar to other machine learning algorithms.

from  import joblib
(km,"model/km_model.m")

summarize

Above is this article on the detailed implementation of the K-means algorithm in Python all content, I hope to help you. Interested friends can continue to refer to this site:

Python Implementation of Scheduling Algorithm Code Details

Python algorithm outputs all the arithmetic equations formed by arrays 1-9 that result in 100

Python Programming Implementation of Ant Colony Algorithm Details

If there are deficiencies, welcome to leave a message to point out. Thank you friends for the support of this site!