The process of python data mining algorithm in detail

1, first briefly describe the process of data mining

Step 1: Data selection

It can be obtained from business raw data, publicly available datasets, and also through crawlers.

Step 2: Data pre-processing

The data is highly likely to be noisy, incomplete, and other flaws that require data normalization bymin-max standardized, z-score standardized, modified standard z-score.

Step 3: Eigenvalue data conversion

An analytical model that takes data and extracts features to make that data conform to a specific data mining algorithm. There are many data models, which will be explained in detail later.

Step 4: Model Training

Choosing a good data mining algorithm to train on the data

Step 5: Test the model + evaluate the effect

There are two mainstream approaches:

ten-fold cross validation: The dataset is randomly partitioned into ten equal parts, each time using 9 parts of the data for the training set and 1 part of the data for the test set, and so on for 10 iterations. The key to ten-fold cross-validation is a more even division into ten equal parts.

N-fold cross validation is also known asretain oneMethod: train with almost all the data, then leave one for testing and iterate each data test. The advantage of the leave-one-out method is: determinism.

Step 6: Model Use

Predictions are made on the data using the trained model.

Step 7: Interpretation and evaluation

The data-mined information is analyzed and interpreted and applied to practical areas of work.

2, the main algorithm model explanation -- based on sklearn

1) Linear regression: you want all points to fall on a straight line and all points to be closest to the line. The values of a and b in y=ax+b are first assumed, and then the sum of the distances from each data point to this line is calculated, with the goal of minimizing this sum!

from sklearn.linear_model import LinearRegression
# Define linear regression models
model = LinearRegression(fit_intercept=True, normalize=False, 
    copy_X=True, n_jobs=1)
"""
Parameters
---
    fit_intercept: whether or not to compute the intercept. false - model has no intercepts
    normalize: this parameter is ignored when fit_intercept is set to False. If true, the regression coefficient X before regression will be normalized by subtracting the mean and dividing by the l2-paradigm.
     n_jobs: specify the number of threads
"""

2) Logistic regression: a dichotomous algorithm for two-classification problems. Need to predict the "approximate form" of the function, e.g. whether it is linear or nonlinear.

It was mentioned above that the dataset needs a linear boundary. Different data require different boundaries.

from sklearn.linear_model import LogisticRegression
# Define the logistic regression model
model = LogisticRegression(penalty='l2', dual=False, tol=0.0001, C=1.0, 
    fit_intercept=True, intercept_scaling=1, class_weight=None, 
    random_state=None, solver='liblinear', max_iter=100, multi_class='ovr', 
    verbose=0, warm_start=False, n_jobs=1)
 
"""Parameters
---
    penalty: use the specified regularization term (default: l2)
    dual: n_samples > n_features take False (default)
    C: inverse of regularization strength, the smaller the value the greater the regularization strength
    n_jobs: specify the number of threads
    random_state: random number generator
    fit_intercept: whether to require constants
"""

3) Plain Bayesian Algorithm NB: Used to determine the probability of something happening, I have used this algorithm to make an opinion classifier. Turn some statements into 01 two-dimensional matrix, calculate the frequency of occurrence of words, so as to determine what the emotional color of the statement is like.

Very efficient, but with some probability of error

from sklearn import naive_bayes
model = naive_bayes.GaussianNB() # Gaussian Bayes
model = naive_bayes.MultinomialNB(alpha=1.0, fit_prior=True, class_prior=None)
model = naive_bayes.BernoulliNB(alpha=1.0, binarize=0.0, fit_prior=True, class_prior=None)
"""
MultinomialNB commonly used for text categorization problems
Parameters
---
    alpha: smoothing parameter
    fit_prior: whether to learn the prior probability of the class; false-use a uniform prior probability
    class_prior: whether to specify the class prior probability; if so, it cannot be adjusted according to the parameter
    binarize: threshold for binarization; if None, the input is assumed to consist of binary vectors
"""

4) Decision Tree DT: A tree structure similar to a flowchart that uses a branching approach to illustrate each possible outcome of a decision. Each node in the tree represents a test of a particular variable - each branch is the result of that test.

from sklearn import tree 
model = (criterion='gini', max_depth=None, 
    min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, 
    max_features=None, random_state=None, max_leaf_nodes=None, 
    min_impurity_decrease=0.0, min_impurity_split=None,
     class_weight=None, presort=False)
"""Parameters
---
    criterion : feature selection criterion gini/entropy
    max_depth: the maximum depth of the tree, None - as far as possible down the split
    min_samples_split: split internal nodes, the minimum sample tree needed
    min_samples_leaf: minimum number of samples needed for leaf nodes
    max_features: maximum number of features to find the optimal split point
    max_leaf_nodes: prioritize growth to the maximum number of leaf nodes
    min_impurity_decrease: nodes will be split if this separation results in a decrease in impurity greater than or equal to this value.
"""

(5) Support Vector Machine SVM: it is to determine the linear separable indivisible, can use a straight line to split the two types of data! Theory can be extended to three-dimensional, even thinking above the feature space. Three-dimensional use of planes to separate data, four-dimensional and four-dimensional and above because humans can not intuitively perceive, so the drawing can not be, but can separate data, the existence of such a plane is called a hyperplane.

from  import SVC
model = SVC(C=1.0, kernel='rbf', gamma='auto')
"""Parameters
---
    C: penalty parameter for the error term C
    gamma: nuclear correlation coefficient. Floating point number, If gamma is 'auto' then 1/n_features will be used instead.
"""

6) k Nearest Neighbor Algorithm KNN: An algorithm that uses a measure of the distance between different feature values to classify data.

Given a collection of samples, here called the training set, and each data in the sample contains a label. For a new input of data that does not contain labels, the top k, usually k is less than 20, labels with the highest number of occurrences in the labels of the nearest data in the k drams are selected as the labels of that newly added data by calculating the distances between this new data and each of the samples.

The K-nearest neighbor algorithm, that is, given a training dataset, for a new input instance, find the same instance in the training dataset as that instancenearest neighborK instances of theMost of these K instances belong to a class, then classify that input instance into this class. (It's similar to the real-life idea that the majority rules the minority.) In light of this statement, let's look at a graphic quoted from Wikipedia:

If K = 3, the 3 nearest neighbors of the green dot are 2 small red triangles and 1 small blue square, theThe subordination of minorities to majorities.Based on statistical methods, it was determined that this point to be classified in green belongs to the class of triangles in red.

If K = 5 and the 5 nearest neighbors of the green dot are 2 red triangles and 3 blue squares, theor the subordination of the minority to the majority.Based on the statistical approach, it was determined that this to-be-classified point in green belongs to the class of squares in blue.

from sklearn import neighbors
#define kNN classification models
model = (n_neighbors=5, n_jobs=1) # Classification
model = (n_neighbors=5, n_jobs=1) # Returns
"""Parameters
---
    n_neighbors: number of neighbors to use
    n_jobs: number of parallel tasks
"""

7）K-means clustering (K-means):

Define the number of target clusters k, e.g., k = 3
Randomly initialized k clustering centers (controids)
Calculate the Euclidean Distance from each data point to the K clustering centers, and then classify the data points into the class corresponding to the clustering center with the smallest Euclidean Distance
For each category, recalculate its clustering center;
Repeat steps 3-4 above until some abort condition is reached (number of iterations, minimum error change, etc.)

import pandas as pd
import  as plt
from  import KMeans
 
df = ({"x": [25, 34, 22, 27, 33, 33, 31, 22, 35, 34, 67, 54, 57, 43, 50, 57, 59, 52, 65, 47, 49, 48, 35, 33, 44, 45, 38, 43, 51, 46],
                   "y": [79, 51, 53, 78, 59, 74, 73, 57, 69, 75, 51, 32, 40, 47, 53, 36, 35, 59, 59, 50, 25, 20, 14, 12, 20, 5,  29, 27, 8,  7]})
kmeans = KMeans(n_clusters=3).fit(df)
centroids = kmeans.cluster_centers_
# Print the cluster center
print(type(centroids), centroids)
# Visualization of clustering results
fig, ax = ()
(df['x'],df['y'],c=kmeans.labels_.astype(float),s=50, alpha=0.5)
(centroids[:, 0], centroids[:, 1], c='red', s=50)
()

Unlike KNN, K-mean clustering is unsupervised learning.

Supervised learning knows what to learn from the object (data), whereas unsupervised learning does not need to know the goal to be searched for, it is based on algorithms to get common features of the data. For example, in terms of classification and clustering, classification knows in advance the categories to be obtained, whereas clustering is different, it is only based on similarity and divides the objects into different clusters.

ps): There are two types of problems we encounter in machine learning all the time, one is regression problem and the other is classification problem. We literally understand, it is easy to know that the classification problem is actually to divide our existing data into a number of classes, and then for the new data, we divide it according to the class; while the regression problem is to fit the existing data into a function, and according to the fitted function to predict the new data. The difference between the two lies in the type of output variable. Regression is quantitative output, or predicting continuous variables; classification problem book quantitative output, predicting discrete variables. po a picture I saw on Knowledge that explains it well:

3, sklearn comes with the method joblib to save the trained model

from  import joblib
 
# Save the model
(model, '')
#Load the model
model = ('')

Reference Links:

/post/6961934412518785054

/post/6844903513504530446

Machine Learning Logistic Regression (pure python implementation) - Nuggets ()

Machine Learning Notes 5 - Support Vector Machines 1 - Nuggets ()

This article on python data mining algorithms is introduced to this, more related python data mining algorithms, please search for my previous articles or continue to browse the following related articles I hope you will support me more in the future!