SoFunction
Updated on 2024-11-13

Summary of KMeans Cluster Analysis for Python Data-Based Operations

Python Data-enabled Operations

1. Introduction to content

in order toPython utilizationKeans Simple examples of applications that perform cluster analysis Introduction to cluster analysis.

cluster analysis maybeclustering It is the task of grouping a set of objects in such a way that objects in the same group (called clusters) are (in some sense) more similar (in some sense) to objects in other groups (clusters). It is the main task of exploratory data mining and a common technique for statistical data analysis used in many fields including machine learning, pattern recognition, image analysis, information retrieval, bioinformatics, data compression, and computer graphics.

2、General Application Scenarios

(1)Group categorization of target users:Clustering the target group based on variables selected for operational or commercial purposes, dividing the target group into several subgroups with distinctive characteristic differences, and adopting refined and personalized operations and services for these subgroups in operational activities to enhance the efficiency of operations and commercial effects.

(2)Value mix of different products:Clustering of numerous product categories by specific indicator variables. The product system is subdivided into multi-dimensional product portfolios with different values and purposes, on the basis of which corresponding product development plans, operation plans and service plans are formulated.

(3)Explore and discover isolated points and outliers:Primarily a wind control application. Isolated points may have a risk component of fraud.

3、Common methods of clustering

It is divided into types of algorithms based on division, hierarchy, density, lattice, statistics, models, etc. Typical algorithms include K-means (classical clustering algorithm), DBSCAN, two-step clustering, BIRCH, spectral clustering, and so on.

4. Keans clustering implementation

import numpy as np
import  as plt
from  import KMeans
from sklearn import metrics
import random

# 100 randomly generated data sets containing 3 sets of features
feature = [[(),(),()] for i in range(100)]
label = [int((0,2)) for i in range(100)]

# Convert data formats
x_feature = (feature)

# Training clustering models
n_clusters = 3  # Set the number of clusters
model_kmeans = KMeans(n_clusters=n_clusters, random_state=0)  # Build clustering model objects
model_kmeans.fit(x_feature)  # Training clustering models
y_pre = model_kmeans.predict(x_feature)  # Predictive clustering model
y_pre

The realization is shown in the figure:

5. Assessment indicators for clustering

inertias is a property of the K-mean model object that indicates the sum of the samples' distance to the nearest clustering centers, which is used as an unsupervised evaluation metric in the absence of real classification result labels. The smaller the value, the better, the smaller the value proves that the distribution of the samples between classes is more concentrated, i.e., the distance within classes is smaller.

# Sum of closest clustering centers of samples
inertias = model_kmeans.inertia_  

adjusted_rand_s:Adjusted Rand Index, the Rand Index calculates a similarity measure between two clusters by considering all pairs of samples and pairs of counts assigned in the same or different clusters in both predicted and true clustering. The Adjusted Rand Index is obtained by adjusting the Rand Index to a value close to 0 independent of sample size and category, which takes the range of [-1, 1], where a negative number means a bad result, and the closer it is to 1 the better it is meaning that the clustering result matches the real situation better.

# Adjusted Rand Index
adjusted_rand_s = metrics.adjusted_rand_score(label, y_pre)  

mutual_info_s:Mutual Information (MI), Mutual Information is the amount of information contained in one random variable about another random variable, in this case a measure of similarity between two labels of the same data that results in a non-negative value.

# Mutual information
mutual_info_s = metrics.mutual_info_score(label, y_pre) 

adjusted_mutual_info_s:Adjusted Mutual Information (AMI), Adjusted Mutual Information is an adjusted score for the mutual information score. It takes into account the fact that MI is usually higher for cluster sets with a larger number of clusters, regardless of whether more information is actually shared, and it corrects for this effect by adjusting the probability of cluster sets. When two clusters are identical (i.e., perfectly matched), the AMI returns a value of 1; random partitions (independently labeled) are expected on average to have an AMI of about 0, which may also be negative.

# Adjusted mutual information
adjusted_mutual_info_s = metrics.adjusted_mutual_info_score(label, y_pre)  

homogeneity_s:Homogeneity score (Homogeneity), the clustering result will satisfy homogeneity if all clusters contain only data points that are members of a single class. Its value range [0,1] The larger value means that the clustering results are more consistent with the real situation.

# Homogenization score
homogeneity_s = metrics.homogeneity_score(label, y_pre)  

completeness_s:Completeness score (Completeness), the clustering result satisfies if all data points that are members of a given class are elements of the same cluster

completeness。its range of values[0,1],Larger values mean that the clustering results match the real situation better。

# Integrity score
completeness_s = metrics.completeness_score(label, y_pre)  

v_measure_s:It is the harmonic mean between homogeneity and completeness, v = 2 (homogeneity completeness)/(homogeneity + completeness). It takes values in the range [0,1], with larger values implying that the clustering results match the real situation better.

v_measure_s = metrics.v_measure_score(label, y_pre)  

silhouette_s:Silhouette, which is used to calculate the average silhouette coefficient for all samples, is calculated using the average intracluster distance and the average nearest cluster distance for each sample, and it is an unsupervised evaluation metric. Its highest value is 1 and its worst value is -1,Values near 0 indicate overlapping clusters and negative values usually indicate that the samples have been assigned to the wrong cluster.

# Average profile factor
silhouette_s = metrics.silhouette_score(x_feature, y_pre, metric='euclidean')  

calinski_harabaz_s:This score is defined as the ratio of intra-cluster dispersion to inter-cluster dispersion, and it is an unsupervised evaluation metric.

# Calinski and Harabaz scores
calinski_harabaz_s = metrics.calinski_harabasz_score(x_feature, y_pre)  

6, clustering effect visualization

# Model effects visualization
centers = model_kmeans.cluster_centers_  # Category centers
colors = ['#4EACC5', '#FF9C34', '#4E9A06'] # Set colors for different categories
()  # Create a canvas
for i in range(n_clusters):  # Cyclic read categories
    index_sets = (y_pre == i)  # Find an indexed collection of the same class
    cluster = x_feature[index_sets]  # Classify data of the same class into a clustered subset
    (cluster[:, 0], cluster[:, 1], c=colors[i], marker='.')  # Show sample points within clustered subsets
    (centers[i][0], centers[i][1], 'o', markerfacecolor=colors[i], markeredgecolor='k',
             markersize=6)  # Showing the center of each clustered subset
()  # Show Images

As pictured:

7. Data forecasting

# Model applications
new_X = [1, 3.6,9.9]
cluster_label = model_kmeans.predict((new_X).reshape(1,-1))
print ('The clustering prediction result is: %d' % cluster_label)

To this point this article on Python data operation of KMeans clustering analysis summary of the article is introduced to this, more related Python data operation content please search for my previous articles or continue to browse the following related articles I hope that you will support me more in the future!