Python Data-enabled Operations
1. Introduction to content
in order toPython
utilizationKeans
Simple examples of applications that perform cluster analysis Introduction to cluster analysis.
cluster analysis
maybeclustering
It is the task of grouping a set of objects in such a way that objects in the same group (called clusters) are (in some sense) more similar (in some sense) to objects in other groups (clusters). It is the main task of exploratory data mining and a common technique for statistical data analysis used in many fields including machine learning, pattern recognition, image analysis, information retrieval, bioinformatics, data compression, and computer graphics.
2、General Application Scenarios
(1)Group categorization of target users:Clustering the target group based on variables selected for operational or commercial purposes, dividing the target group into several subgroups with distinctive characteristic differences, and adopting refined and personalized operations and services for these subgroups in operational activities to enhance the efficiency of operations and commercial effects.
(2)Value mix of different products:Clustering of numerous product categories by specific indicator variables. The product system is subdivided into multi-dimensional product portfolios with different values and purposes, on the basis of which corresponding product development plans, operation plans and service plans are formulated.
(3)Explore and discover isolated points and outliers:Primarily a wind control application. Isolated points may have a risk component of fraud.
3、Common methods of clustering
It is divided into types of algorithms based on division, hierarchy, density, lattice, statistics, models, etc. Typical algorithms include K-means (classical clustering algorithm), DBSCAN, two-step clustering, BIRCH, spectral clustering, and so on.
4. Keans clustering implementation
import numpy as np import as plt from import KMeans from sklearn import metrics import random # 100 randomly generated data sets containing 3 sets of features feature = [[(),(),()] for i in range(100)] label = [int((0,2)) for i in range(100)] # Convert data formats x_feature = (feature) # Training clustering models n_clusters = 3 # Set the number of clusters model_kmeans = KMeans(n_clusters=n_clusters, random_state=0) # Build clustering model objects model_kmeans.fit(x_feature) # Training clustering models y_pre = model_kmeans.predict(x_feature) # Predictive clustering model y_pre
The realization is shown in the figure:
5. Assessment indicators for clustering
inertias is a property of the K-mean model object that indicates the sum of the samples' distance to the nearest clustering centers, which is used as an unsupervised evaluation metric in the absence of real classification result labels. The smaller the value, the better, the smaller the value proves that the distribution of the samples between classes is more concentrated, i.e., the distance within classes is smaller.
# Sum of closest clustering centers of samples inertias = model_kmeans.inertia_
adjusted_rand_s:Adjusted Rand Index, the Rand Index calculates a similarity measure between two clusters by considering all pairs of samples and pairs of counts assigned in the same or different clusters in both predicted and true clustering. The Adjusted Rand Index is obtained by adjusting the Rand Index to a value close to 0 independent of sample size and category, which takes the range of [-1, 1], where a negative number means a bad result, and the closer it is to 1 the better it is meaning that the clustering result matches the real situation better.
# Adjusted Rand Index adjusted_rand_s = metrics.adjusted_rand_score(label, y_pre)
mutual_info_s:Mutual Information (MI), Mutual Information is the amount of information contained in one random variable about another random variable, in this case a measure of similarity between two labels of the same data that results in a non-negative value.
# Mutual information mutual_info_s = metrics.mutual_info_score(label, y_pre)
adjusted_mutual_info_s:Adjusted Mutual Information (AMI), Adjusted Mutual Information is an adjusted score for the mutual information score. It takes into account the fact that MI is usually higher for cluster sets with a larger number of clusters, regardless of whether more information is actually shared, and it corrects for this effect by adjusting the probability of cluster sets. When two clusters are identical (i.e., perfectly matched), the AMI returns a value of 1; random partitions (independently labeled) are expected on average to have an AMI of about 0, which may also be negative.
# Adjusted mutual information adjusted_mutual_info_s = metrics.adjusted_mutual_info_score(label, y_pre)
homogeneity_s:Homogeneity score (Homogeneity), the clustering result will satisfy homogeneity if all clusters contain only data points that are members of a single class. Its value range [0,1] The larger value means that the clustering results are more consistent with the real situation.
# Homogenization score homogeneity_s = metrics.homogeneity_score(label, y_pre)
completeness_s:Completeness score (Completeness), the clustering result satisfies if all data points that are members of a given class are elements of the same cluster
completeness。its range of values[0,1],Larger values mean that the clustering results match the real situation better。 # Integrity score completeness_s = metrics.completeness_score(label, y_pre)
v_measure_s:It is the harmonic mean between homogeneity and completeness, v = 2 (homogeneity completeness)/(homogeneity + completeness). It takes values in the range [0,1], with larger values implying that the clustering results match the real situation better.
v_measure_s = metrics.v_measure_score(label, y_pre)
silhouette_s:Silhouette, which is used to calculate the average silhouette coefficient for all samples, is calculated using the average intracluster distance and the average nearest cluster distance for each sample, and it is an unsupervised evaluation metric. Its highest value is 1 and its worst value is -1,Values near 0 indicate overlapping clusters and negative values usually indicate that the samples have been assigned to the wrong cluster.
# Average profile factor silhouette_s = metrics.silhouette_score(x_feature, y_pre, metric='euclidean')
calinski_harabaz_s:This score is defined as the ratio of intra-cluster dispersion to inter-cluster dispersion, and it is an unsupervised evaluation metric.
# Calinski and Harabaz scores calinski_harabaz_s = metrics.calinski_harabasz_score(x_feature, y_pre)
6, clustering effect visualization
# Model effects visualization centers = model_kmeans.cluster_centers_ # Category centers colors = ['#4EACC5', '#FF9C34', '#4E9A06'] # Set colors for different categories () # Create a canvas for i in range(n_clusters): # Cyclic read categories index_sets = (y_pre == i) # Find an indexed collection of the same class cluster = x_feature[index_sets] # Classify data of the same class into a clustered subset (cluster[:, 0], cluster[:, 1], c=colors[i], marker='.') # Show sample points within clustered subsets (centers[i][0], centers[i][1], 'o', markerfacecolor=colors[i], markeredgecolor='k', markersize=6) # Showing the center of each clustered subset () # Show Images
As pictured:
7. Data forecasting
# Model applications new_X = [1, 3.6,9.9] cluster_label = model_kmeans.predict((new_X).reshape(1,-1)) print ('The clustering prediction result is: %d' % cluster_label)
To this point this article on Python data operation of KMeans clustering analysis summary of the article is introduced to this, more related Python data operation content please search for my previous articles or continue to browse the following related articles I hope that you will support me more in the future!