Overview of the k-nearest neighbor algorithm
Simply put, the k-nearest neighbor algorithm uses the method of measuring the distance between different feature values for classification.
k-nearest neighbor algorithm
Advantages: high accuracy, insensitivity to outliers, no data entry assumptions.
Disadvantages: High computational complexity and high space complexity. Applicable data range: numerical and nominal.
The k-Nearest Neighbor Algorithm (kNN), which works on the principle that there exists a collection of sample data, also known as the training sample set, and that there are labels for each of the data in the sample set, i.e., we know the correspondence between each of the data in the sample set and the classification to which it belongs. After inputting new data without labels, each feature of the new data is compared with the corresponding feature of the data in the sample set, and then the algorithm extracts the classification labels of the data with the most similar features (nearest neighbors) in the sample set. In general, we select only the top k most similar data in the sample data set, which is where k comes from in the k-nearest neighbor algorithm, and usually k is an integer not larger than 20. Finally, the classification with the highest number of occurrences among the k most similar data is chosen as the classification of the new data.
General flow of the k-nearest neighbor algorithm
Data collection: any method can be used.
Prepare the data: the values needed for the distance calculation, preferably in a structured data format.
Analyzing data: any method can be used.
Training algorithm: this step does not apply to the k-nearest neighbor algorithm.
Test algorithm: calculate the error rate.
Using the algorithm: first you need to input the sample data and structured outputs, then run the k-nearest neighbor algorithm to determine which classification the input data belongs to, respectively, and finally the application performs subsequent processing on the computed classification.
The following will go through the coding to understand the KNN algorithm:
import numpy as np import as plt from sklearn import datasets ['-serif']=['SimHei'] # Used to display Chinese labels properly # Prepare the dataset iris=datasets.load_iris() X= print('X:\n',X) Y= print('Y:\n',Y) # deal with binary classification problems, so only for rows where Y=0,1, then take the first two columns of X from those rows x=X[Y<2,:2] print() print('x:\n',x) y=Y[Y<2] print('y:\n',y) #Points with target=0 are labeled red, points with target=1 are labeled blue, the horizontal coordinates of the points are the first column of data, and the vertical coordinates of the points are the second column of data. (x[y==0,0],x[y==0,1],color='red') (x[y==1,0],x[y==1,1],color='green') (5.6,3.2,color='blue') x_1=([5.6,3.2]) ('Red points labeled 0,green points labeled 1, points to be predicted in blue')
As shown in the figure, we want to predict the blue point in the figure so that we can determine which category he belongs to, we use the Euclidean distance formula to calculate the distance between two vector points.
After calculating the distance between all points, the data can be sorted in order from smallest to largest. Count the number of categories for the first k data points with the closest distance, and return the one with the most votes as the category of the blue point.
# Using European distance calculations distances=[(((x_t-x_1)**2)) for x_t in x] # Sort the array and return the sorted index. d=(distances) nearest=(distances) k=6 topk_y=[y[i] for i in nearest[:k]] from collections import Counter #Statistical return dictionary for topk_y votes=Counter(topk_y) # Returns the element of class 1 with the most votes print(votes) predict_y=votes.most_common(1)[0][0] print(predict_y) ()
Counter({1: 4, 0: 2}) 1
From the results, it can be seen that for k=6, four of the six nearest o'clock to the blue point belong to the green color and two belong to the red color, and finally the label of the blue point is predicted to be green.
The functionality we have just implemented in the code can be encapsulated into a class:
import numpy as np from collections import Counter from metrics import accuracy_score class KNNClassifier: def __init__(self,k): assert k>=1,'k must be valid' =k self._x_train=None self._y_train=None def fit(self,x_train,y_train): self._x_train=x_train self._y_train=y_train return self def _predict(self,x): d=[(((x_i-x)**2)) for x_i in self._x_train] nearest=(d) top_k=[self._y_train[i] for i in nearest[:]] votes=Counter(top_k) return votes.most_common(1)[0][0] def predict(self,X_predict): y_predict=[self._predict(x1) for x1 in X_predict] return (y_predict) def __repr__(self): return 'knn(k=%d):'% def score(self,x_test,y_test): y_predict=(x_test) return sum(y_predict==y_test)/len(x_test)
Model selection, dividing the training and test sets
model_selection.py
import numpy as np def train_test_split(x,y,test_ratio=0.2,seed=None): if seed: (seed) # Generate random serial numbers for samples shuffed_indexes=(len(x)) print(shuffed_indexes) # Test set 20% of total sample test_size=int(test_ratio*len(x)) test_indexes=shuffed_indexes[:test_size] train_indexes=shuffed_indexes[test_size:] x_test=x[test_indexes] y_test=y[test_indexes] x_train=x[train_indexes] y_train=y[train_indexes] return x_train,x_test,y_train,y_test ''' sklearnhit the nail on the headtrain_test_split from sklearn.model_selection import train_test_split x_train,x_test,y_train,y_test=tran_test_split(x,y,test_size=0.2,random_state=666) '''
Below we use two different ways?of measuring the correctness of the model
from sklearn import datasets from sklearn.model_selection import train_test_split from import KNeighborsClassifier iris=datasets.load_iris() x= y= #sklearn comes with train_test_split x_train,x_test,y_train,y_test=train_test_split(x,y) knn_classifier=KNeighborsClassifier(6) knn_classifier.fit(x_train,y_train) y_predict=knn_classifier.predict(x_test) scores=knn_classifier.score(x_test,y_test) print('acc:{}'.format(sum(y_predict==y_test)/len(x_test)),scores) #Using our own written from model_selection import train_test_split X_train,X_test,y_train,y_test=train_test_split(x,y) from KNN import KNNClassifier from metrics import accuracy_score my_knn=KNNClassifier(k=6) my_knn.fit(X_train,y_train) y_predict=my_knn.predict(X_test) print(accuracy_score(y_test,y_predict)) score=my_knn.score(X_test,y_test) print(score)
After getting the correctness rate, to further improve the correctness rate on the test set, we need to tune the model
Hyperparameters: parameters that need to be set before the algorithm runs (find good hyperparameters through domain knowledge, empirical values, experimental search)
Model parameters: parameters learned during the algorithm
There are no model parameters in KNN, and k in the KNN algorithm is a typical hyperparameter, and we will use experimental search to find good hyperparameters
Looking for the best k:
def main(): from sklearn import datasets digits=datasets.load_digits() x= y= from sklearn.model_selection import train_test_split x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.2,random_state=666) from import KNeighborsClassifier # Looking for the best k best_k=-1 best_score=0 for i in range(1,11): knn_clf=KNeighborsClassifier(n_neighbors=i) knn_clf.fit(x_train,y_train) scores=knn_clf.score(x_test,y_test) if scores>best_score: best_score=scores best_k=i print('Best k is :%d, best score is :%.4f'%(best_k,best_score)) if __name__ == '__main__': main()
The best k is 4 and the best score is 0.9917.
So are there any other hyperparameters?
Documentation in sklearn
.KNeighborsClassifier
Parameters: |
n_neighbors : int, optional (default = 5)
weights : str or callable, optional (default = ‘uniform')
algorithm : {‘auto', ‘ball_tree', ‘kd_tree', ‘brute'}, optional
leaf_size : int, optional (default = 30)
p : integer, optional (default = 2)
metric : string or callable, default ‘minkowski'
metric_params : dict, optional (default = None)
n_jobs : int, optional (default = 1)
|
---|
n_neighbors: defaults to 5, which is the value of k for k-NN, picking the k nearest points.
weights: default is uniform, arguments can be uniform, distance, or user-defined functions. uniform is an equal weight, which means that all neighboring points have equal weights. distance is an unequal weight, which means that points that are closer together have a greater impact than those that are farther away. A user-defined function that takes an array of distances and returns a set of weights with the same number of dimensions.
algorithm: fast k-nearest neighbor search algorithm, the default parameter is auto, which can be understood as the algorithm decides the appropriate search algorithm by itself. In addition, the user can also specify their own search algorithm ball_tree, kd_tree, brute method for search, brute is a brute force search, that is, linear scanning, when the training set is very large, the calculation is very time-consuming. kd_tree, the construction of the kd tree to store the data in order to retrieve it quickly tree data structure, the kd tree, which is also known as the data structure of the binary tree. The tree constructed with median cut, each node is a hyper rectangle, and is efficient in dimensions less than 20. ball tree was invented to overcome the high-latitude failure of the kd tree, and is constructed by splitting the sample space with center of mass C and radius r. Each node is a hyper sphere.
leaf_size: default is 30, this is the size of the constructed kd tree and ball tree. The setting of this value affects the speed of tree construction and searching, as well as the amount of memory required to store the tree. You need to choose the optimal size according to the nature of the problem.
metric: used for distance metrics, the default metric is minkowski, which is the Euclidean distance (Euclidean metric) with p=2.
p: distance metric formula. In the above subsection, we use the Euclidean distance formula for distance metric. In addition to this, there are other metrics such as Manhattan distance. This parameter defaults to 2, which means that the Euclidean distance formula is used for the distance metric by default. It can also be set to 1 to use the Manhattan distance formula for distance metrics.
metric_params: other key parameters of the distance formula, this can be ignored, just use the default None.
n_jobs: parallel processing setting. Defaults to 1, the number of parallel jobs for proximity point search. If -1, then all cores of the CPU are used for parallel work.
Considering the distance? Not considering the distance?
When K=3, as shown in the figure below, the blue point wins by the voting method, so the green point is categorized as the one with the blue point. But this way of categorization we have considered the three points closest to the green node, but ignored the distance from these three points to the green point, as can be seen in the figure the red point is actually the closest to the green point, when we consider the distance, we need to attach a weight value to the distance, the closer the closer, the greater the weight, usually the derivative of the distance as the weight:
In addition, KNN considering distance solves the corresponding problem in flat tickets
Finding optimal hyperparameter weights: ['uniform','distance']
def main(): from sklearn import datasets digits=datasets.load_digits() x= y= from sklearn.model_selection import train_test_split x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.2,random_state=666) from import KNeighborsClassifier # Looking for the best k,weights # best_k=-1 best_score=0 best_method='' for method in ['uniform','distance']: for i in range(1,11): knn_clf=KNeighborsClassifier(n_neighbors=i,weights=method) knn_clf.fit(x_train,y_train) scores=knn_clf.score(x_test,y_test) if scores>best_score: best_score=scores best_k=i best_method=method print('Best k for:%d,best score:%.4f,best method%s'%(best_k,best_score,best_method)) if __name__ == '__main__': main()
More on the definition of distance can be obtained by collaring a hyperparameter p:
Search for p corresponding to the optimal Minkowski distance:
def main(): from sklearn import datasets digits=datasets.load_digits() x= y= from sklearn.model_selection import train_test_split x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.2,random_state=666) from import KNeighborsClassifier # Looking for the best k,weights # best_k=-1 best_score=0 best_p=-1 for i in range(1,11): for p in range(1,6): knn_clf=KNeighborsClassifier(n_neighbors=i,weights='distance',p=p) knn_clf.fit(x_train,y_train) scores=knn_clf.score(x_test,y_test) if scores>best_score: best_score=scores best_k=i best_p=p print('Best k is :%d, best score is :%.4f, best p is %d'%(best_k,best_score,best_p)) if __name__ == '__main__': main()
The best k is:3,the best score is:0.9889,the best p is2
As we can see from the above example, there may be interdependencies between some hyperparameters, for example, in the program above, the hyperparameter p is only involved when weights='distance'. How can we better list all the hyperparameters we want at once, and run the program once to find the hyperparameters we want? Using Grid Search we can help us solve this problem, encapsulated in Grid Search in sklearn.
The grid search provided by GridSearchCV generates candidates across the board from the values of the grid parameters determined by the param_grid parameter. For example, the following param_grid.
param_grid = [ {'C': [1, 10, 100, 1000], 'kernel': ['linear']}, {'C': [1, 10, 100, 1000], 'gamma': [0.001, 0.0001], 'kernel': ['rbf']}, ]
Explore a detailed explanation of two lattices: one with a linear kernel and C taking values in [1,10,100,1000]; and the other with an RBF kernel, with cross-products of C values in the range [1,10, 100,1000] and gamma taking values in [0.001, 0.0001].
In this example:
def main(): import numpy as np from sklearn import datasets digits=datasets.load_digits() x= y= from sklearn.model_selection import train_test_split x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.2,random_state=666) from import KNeighborsClassifier #Grid Search defines the parameters to be searched param_grid=[ { 'weights':['uniform'], 'n_neighbors':[i for i in range(1,11)] }, { 'weights':['distance'], 'n_neighbors':[i for i in range(1,11)], 'p':[i for i in range(1,6)] } ] knn_clf=KNeighborsClassifier() from sklearn.model_selection import GridSearchCV #n_jobs using a few cores to process, -1 on behalf of the computer has a few cores to use a few cores for parallel processing, the search process verbose can be information output to help understand the search state grid_search=GridSearchCV(knn_clf,param_grid,n_jobs=-1,verbose=1) grid_search.fit(x_train,y_train) #Return to grid search for the best classifier print(grid_search.best_estimator_) # Return the parameters of the optimal classifier for grid searches print(grid_search.best_params_) # Returns the score of the best classifier for the grid search print(grid_search.best_score_) knn_clf=grid_search.best_estimator_ print(knn_clf.score(x_test,y_test)) if __name__ == '__main__': main() Fitting 3 folds for each of 60(10+50) candidates, totalling 180 fits [Parallel(n_jobs=-1)]: Done 34 tasks | elapsed: 1.6s [Parallel(n_jobs=-1)]: Done 180 out of 180 | elapsed: 30.6s finished KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski', metric_params=None, n_jobs=1, n_neighbors=3, p=3, weights='distance') {'n_neighbors': 3, 'weights': 'distance', 'p': 3} 0.985386221294 0.983333333333
Another very important concept in measuring distance is actually data normalizationFeature Scaling
From the above example, we can see that if the discovery time is measured in days, the distance between the samples is dominated by the discovery time, and if the discovery time is measured in years, the distance between the samples is dominated by the size of the tumors, which suggests that the results of calculating the distances directly will be biased if we don't process the data of the samples.
data normalization
Solution: Map all data to the same scale
import numpy as np import as plt #Most value normalization x=(0,100,size=100) print(x) normal_x=((x))/((x)-(x)) print(normal_x) #mean variance normalization x2=(0,100,(50,2)) print(x2) x2=(x2,dtype=float) x2[:,0]=(x2[:,0]-(x2[:,0]))/(x2[:,0]) x2[:,1]=(x2[:,1]-(x2[:,1]))/(x2[:,1]) (x2[:,0],x2[:,1]) ()
x:[49 27 88 47 6 89 9 98 17 72 46 46 80 62 28 38 0 27 22 14 2 79 70 73 15 57 85 6 11 76 59 62 23 32 82 78 0 45 8 82 13 81 99 61 43 21 45 61 93 63 66 57 78 60 50 8 29 63 74 8 25 55 10 69 3 77 41 24 15 23 21 31 36 78 94 52 12 1 23 99 8 12 15 37 75 75 27 14 31 75 6 56 29 96 23 0 22 98 86 10] normal_x:[ 0.49494949 0.27272727 0.88888889 0.47474747 0.06060606 0.8989899 0.09090909 0.98989899 0.17171717 0.72727273 0.46464646 0.46464646 0.80808081 0.62626263 0.28282828 0.38383838 0. 0.27272727 0.22222222 0.14141414 0.02020202 0.7979798 0.70707071 0.73737374 0.15151515 0.57575758 0.85858586 0.06060606 0.11111111 0.76767677 0.5959596 0.62626263 0.23232323 0.32323232 0.82828283 0.78787879 0. 0.45454545 0.08080808 0.82828283 0.13131313 0.81818182 1. 0.61616162 0.43434343 0.21212121 0.45454545 0.61616162 0.93939394 0.63636364 0.66666667 0.57575758 0.78787879 0.60606061 0.50505051 0.08080808 0.29292929 0.63636364 0.74747475 0.08080808 0.25252525 0.55555556 0.1010101 0.6969697 0.03030303 0.77777778 0.41414141 0.24242424 0.15151515 0.23232323 0.21212121 0.31313131 0.36363636 0.78787879 0.94949495 0.52525253 0.12121212 0.01010101 0.23232323 1. 0.08080808 0.12121212 0.15151515 0.37373737 0.75757576 0.75757576 0.27272727 0.14141414 0.31313131 0.75757576 0.06060606 0.56565657 0.29292929 0.96969697 0.23232323 0. 0.22222222 0.98989899 0.86868687 0.1010101 ]
sklearnhit the nail on the headStandardScaler from sklearn.model_selection import train_test_split x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.2,random_state=666) StandardScaler in #sklearn. from import StandardScaler standardscaler=StandardScaler() (x_train) # Mean value print(standardscaler.mean_) # variance print(standardscaler.scale_) x_train_standard=(x_train) x_test_standard=(x_test) print(x_train_standard) from import KNeighborsClassifier knn_clf=KNeighborsClassifier(n_neighbors=3) knn_clf.fit(x_train_standard,y_train) scores=knn_clf.score(x_test_standard,y_test) print(scores)
Summary:
In addition, regression problems can be solved using the K nearest neighbor algorithm to predict house prices, grades, etc. Predicting with mean distance
The above this K Nearest Neighbor Algorithm (KNN) - sklearn + python implementation is all that I have shared with you, I hope it can give you a reference, and I hope that you can support me more.