Introduction to the K-means algorithm
K-means is one of the more commonly used algorithms in machine learning, belonging to the unsupervised learning algorithms, which is often used for data clustering, only need to specify the number of clusters for it can automatically aggregate the data into multiple classes, the data in the same cluster has a higher degree of similarity, and different clusters have a lower degree of similarity of the data.
The K-MEANS algorithm is an algorithm that takes as input the number of clusters k, and a database containing n data objects, and outputs k clusters that satisfy the criterion of minimum variance. k-means algorithm accepts an input quantity k; it then divides n data objects into k clusters in order to make the obtained clusters satisfy the criteria of: higher similarity among objects in the same cluster; and lower similarity among objects in different clusters.
core idea
A division scheme of k class clusters is found iteratively such that the overall error obtained when the mean of these k class clusters is used to represent the samples of the corresponding class is minimized.
The k clusters have the following characteristics: the clusters themselves are as compact as possible, while the clusters are as separate as possible.
The k-means algorithm is based on the least error sum of squares criterion, thePros and cons of K-menas:
Pros:
simple principle
quick
Better scalability for large datasets
Drawbacks:
Need to specify the number of clusters K
Sensitivity to outliers
Sensitive to initial values
The clustering process of K-means
The clustering process is similar to a gradient descent algorithm, which builds a cost function and iterates to make the cost function value smaller and smaller
The initial centers of the c classes are chosen appropriately;
In the kth iteration, for any sample, find its distance to the c centers and assign that sample to the class where the center with the shortest distance is located;
Update the center value of the class using methods such as averaging;
For all c clustering centers, if the values remain the same after updating using the iterative method of (2)(3), the iteration ends, otherwise the iteration continues.
The greatest advantage of the algorithm is its simplicity and speed. The key to the algorithm is the selection of the initial center and the distance formula.
K-means Example Demonstration
Some parameters of km in python:
( n_clusters=8, init='k-means++', n_init=10, max_iter=300, tol=0.0001, precompute_distances='auto', verbose=0, random_state=None, copy_x=True, n_jobs=1, algorithm='auto' ) n_clusters: Number of clusters,I.e. how many categories do you want to cluster into init: Method for obtaining the initial cluster center n_init: Get the number of iterations of the initial cluster center,To compensate for the initial center of mass,By default, the algorithm will initialize10center of gravity,Implementation Algorithms,Then return the best result。 max_iter: Maximum number of iterations(on account ofkmeansAlgorithm implementation requires iteration) tol: tolerance level,assume (office)kmeansConditions for convergence of the running criterion precompute_distances:Is it necessary to calculate the distance in advance,This parameter makes a trade-off between space and time,If it isTrue will put the entire distance matrix into memory,auto It will default to the default value when the data sample is larger thanfeaturs*samples is greater than the number of12e6 whenFalse,False When the core implementation is done by utilizing theCpython rounding out verbose: redundancy model(I don't know what that means.,I don't usually change the defaults anyway.) random_state: State conditions for randomly generated cluster centers。 copy_x: A flag for whether or not the data has been modified,in the event thatTrue,assume (office)复制了就不会修改数据。bool existscikit-learn Many interfaces will have this parameter in the,That is, whether or not to continue with the input datacopy manipulate,so that user input data is not modified。This has to be understood.Python It's the memory mechanism that's clearer。 n_jobs: parallelism algorithm: kmeans的Implementation Algorithms,there are:'auto', ‘full', ‘elkan', included among these ‘full'expressed in terms ofEMtouch 虽然there are很多参数,But all have been given default values。So we generally don't need to pass in these parameters,parametric。It can be called as needed。
A code example is shown below
from import KMeans from import joblib from sklearn import cluster import numpy as np # Generate 10*3 matrices data = (10,3) print data # Clustered into 4 categories estimator=KMeans(n_clusters=4) # fit_predict means fit+predict, can also be written separately res=estimator.fit_predict(data) # Predicted category labeling results lable_pred=estimator.labels_ # Cluster center values for each category centroids=estimator.cluster_centers_ # Sum of cluster center mean vectors inertia=estimator.inertia_ print lable_pred print centroids print inertia Code execution results [0 2 1 0 2 2 0 3 2 0] [[ 0.3028348 0.25183096 0.62493622] [ 0.88481287 0.70891813 0.79463764] [ 0.66821961 0.54817207 0.30197415] [ 0.11629904 0.85684903 0.7088385 ]] 0.570794546829
In order to describe more intuitively, this time to do a demonstration on the graph, because the image is plotted on the two-dimensional is more intuitive, so the data is adjusted to two-dimensional, selected 100 points plotted, clustering category for 3 classes
from import KMeans from import joblib from sklearn import cluster import numpy as np import as plt data = (100,2) estimator=KMeans(n_clusters=3) res=estimator.fit_predict(data) lable_pred=estimator.labels_ centroids=estimator.cluster_centers_ inertia=estimator.inertia_ #print res print lable_pred print centroids print inertia for i in range(len(data)): if int(lable_pred[i])==0: (data[i][0],data[i][1],color='red') if int(lable_pred[i])==1: (data[i][0],data[i][1],color='black') if int(lable_pred[i])==2: (data[i][0],data[i][1],color='blue') ()
You can see that the clustering effect is still good, a test of the clustering efficiency of k-means was conducted to widen the dimension to 50 dimensions
Data size | Elapsed time | Data dimensions |
---|---|---|
10,000 | 4s | 50-dimensional |
100,000 | 30s | 50-dimensional |
1,000,000 | 4'13s | 50-dimensional |
For millions of data, the fitting time is still acceptable, the efficiency is still good, the saving of the model is similar to other machine learning algorithms.
from import joblib (km,"model/km_model.m")
summarize
Above is this article on the detailed implementation of the K-means algorithm in Python all content, I hope to help you. Interested friends can continue to refer to this site:
Python Implementation of Scheduling Algorithm Code Details
Python algorithm outputs all the arithmetic equations formed by arrays 1-9 that result in 100
Python Programming Implementation of Ant Colony Algorithm Details
If there are deficiencies, welcome to leave a message to point out. Thank you friends for the support of this site!