K Nearest Neighbor Algorithm (KNN) - sklearn+python implementation

Overview of the k-nearest neighbor algorithm

Simply put, the k-nearest neighbor algorithm uses the method of measuring the distance between different feature values for classification.

k-nearest neighbor algorithm

Advantages: high accuracy, insensitivity to outliers, no data entry assumptions.

Disadvantages: High computational complexity and high space complexity. Applicable data range: numerical and nominal.

The k-Nearest Neighbor Algorithm (kNN), which works on the principle that there exists a collection of sample data, also known as the training sample set, and that there are labels for each of the data in the sample set, i.e., we know the correspondence between each of the data in the sample set and the classification to which it belongs. After inputting new data without labels, each feature of the new data is compared with the corresponding feature of the data in the sample set, and then the algorithm extracts the classification labels of the data with the most similar features (nearest neighbors) in the sample set. In general, we select only the top k most similar data in the sample data set, which is where k comes from in the k-nearest neighbor algorithm, and usually k is an integer not larger than 20. Finally, the classification with the highest number of occurrences among the k most similar data is chosen as the classification of the new data.

General flow of the k-nearest neighbor algorithm

Data collection: any method can be used.

Prepare the data: the values needed for the distance calculation, preferably in a structured data format.

Analyzing data: any method can be used.

Training algorithm: this step does not apply to the k-nearest neighbor algorithm.

Test algorithm: calculate the error rate.

Using the algorithm: first you need to input the sample data and structured outputs, then run the k-nearest neighbor algorithm to determine which classification the input data belongs to, respectively, and finally the application performs subsequent processing on the computed classification.

The following will go through the coding to understand the KNN algorithm:

import numpy as np
import  as plt
from sklearn import datasets
['-serif']=['SimHei'] # Used to display Chinese labels properly
# Prepare the dataset
iris=datasets.load_iris()
X=
print('X:\n',X)
Y=
print('Y:\n',Y)
 
# deal with binary classification problems, so only for rows where Y=0,1, then take the first two columns of X from those rows
x=X[Y<2,:2]
print()
print('x:\n',x)
y=Y[Y<2]
print('y:\n',y)
#Points with target=0 are labeled red, points with target=1 are labeled blue, the horizontal coordinates of the points are the first column of data, and the vertical coordinates of the points are the second column of data.
(x[y==0,0],x[y==0,1],color='red')
(x[y==1,0],x[y==1,1],color='green')
(5.6,3.2,color='blue')
x_1=([5.6,3.2])
('Red points labeled 0,green points labeled 1, points to be predicted in blue')

As shown in the figure, we want to predict the blue point in the figure so that we can determine which category he belongs to, we use the Euclidean distance formula to calculate the distance between two vector points.

After calculating the distance between all points, the data can be sorted in order from smallest to largest. Count the number of categories for the first k data points with the closest distance, and return the one with the most votes as the category of the blue point.

# Using European distance calculations
distances=[(((x_t-x_1)**2)) for x_t in x]
# Sort the array and return the sorted index.
d=(distances)
nearest=(distances)
k=6
topk_y=[y[i] for i in nearest[:k]]
from collections import Counter
#Statistical return dictionary for topk_y
votes=Counter(topk_y)
# Returns the element of class 1 with the most votes
print(votes)
predict_y=votes.most_common(1)[0][0]
print(predict_y)
()

Counter({1: 4, 0: 2})
1

From the results, it can be seen that for k=6, four of the six nearest o'clock to the blue point belong to the green color and two belong to the red color, and finally the label of the blue point is predicted to be green.

The functionality we have just implemented in the code can be encapsulated into a class:

import numpy as np
from collections import Counter
from metrics import accuracy_score
class KNNClassifier:
 def __init__(self,k):
 assert k>=1,'k must be valid'
 =k
 self._x_train=None
 self._y_train=None
 
 def fit(self,x_train,y_train):
 self._x_train=x_train
 self._y_train=y_train
 return self
 
 def _predict(self,x):
 d=[(((x_i-x)**2)) for x_i in self._x_train]
 nearest=(d)
 top_k=[self._y_train[i] for i in nearest[:]]
 votes=Counter(top_k)
 return votes.most_common(1)[0][0]
 
 def predict(self,X_predict):
 y_predict=[self._predict(x1) for x1 in X_predict]
 return (y_predict)
 
 def __repr__(self):
 return 'knn(k=%d):'%
 
 def score(self,x_test,y_test):
 y_predict=(x_test)
 return sum(y_predict==y_test)/len(x_test)

Model selection, dividing the training and test sets

model_selection.py

import numpy as np
def train_test_split(x,y,test_ratio=0.2,seed=None):
 if seed:
 (seed)
 # Generate random serial numbers for samples
 shuffed_indexes=(len(x))
 print(shuffed_indexes)
 # Test set 20% of total sample
 test_size=int(test_ratio*len(x))
 test_indexes=shuffed_indexes[:test_size]
 train_indexes=shuffed_indexes[test_size:]
 x_test=x[test_indexes]
 y_test=y[test_indexes]
 x_train=x[train_indexes]
 y_train=y[train_indexes]
 return x_train,x_test,y_train,y_test
 
'''
sklearnhit the nail on the headtrain_test_split
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test=tran_test_split(x,y,test_size=0.2,random_state=666)
'''

Below we use two different ways?of measuring the correctness of the model

from sklearn import datasets
from sklearn.model_selection import train_test_split
from  import KNeighborsClassifier
iris=datasets.load_iris()
x=
y=
 
#sklearn comes with train_test_split
x_train,x_test,y_train,y_test=train_test_split(x,y)
knn_classifier=KNeighborsClassifier(6)
knn_classifier.fit(x_train,y_train)
y_predict=knn_classifier.predict(x_test)
scores=knn_classifier.score(x_test,y_test)
print('acc:{}'.format(sum(y_predict==y_test)/len(x_test)),scores)
 
 
#Using our own written
from model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(x,y)
from KNN import KNNClassifier
from metrics import accuracy_score
my_knn=KNNClassifier(k=6)
my_knn.fit(X_train,y_train)
y_predict=my_knn.predict(X_test)
print(accuracy_score(y_test,y_predict))
score=my_knn.score(X_test,y_test)
print(score)

After getting the correctness rate, to further improve the correctness rate on the test set, we need to tune the model

Hyperparameters: parameters that need to be set before the algorithm runs (find good hyperparameters through domain knowledge, empirical values, experimental search)

Model parameters: parameters learned during the algorithm

There are no model parameters in KNN, and k in the KNN algorithm is a typical hyperparameter, and we will use experimental search to find good hyperparameters

Looking for the best k:

def main():
 from sklearn import datasets
 digits=datasets.load_digits()
 x=
 y=
 from sklearn.model_selection import train_test_split
 x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.2,random_state=666)
 from  import KNeighborsClassifier
 # Looking for the best k
 best_k=-1
 best_score=0
 for i in range(1,11):
 knn_clf=KNeighborsClassifier(n_neighbors=i)
 knn_clf.fit(x_train,y_train)
 scores=knn_clf.score(x_test,y_test)
 if scores>best_score:
  best_score=scores
  best_k=i
 print('Best k is :%d, best score is :%.4f'%(best_k,best_score))
if __name__ == '__main__':
 main()

The best k is 4 and the best score is 0.9917.

So are there any other hyperparameters?

Documentation in sklearn

.KNeighborsClassifier

Parameters:	n_neighbors : int, optional (default = 5) Number of neighbors to use by default for `kneighbors` queries. weights : str or callable, optional (default = ‘uniform') weight function used in prediction. Possible values: ‘uniform' : uniform weights. All points in each neighborhood are weighted equally. ‘distance' : weight points by the inverse of their distance. in this case, closer neighbors of a query point will have a greater influence than neighbors which are further away. [callable] : a user-defined function which accepts an array of distances, and returns an array of the same shape containing the weights. algorithm : {‘auto', ‘ball_tree', ‘kd_tree', ‘brute'}, optional Algorithm used to compute the nearest neighbors: ‘ball_tree' will use `BallTree` ‘kd_tree' will use `KDTree` ‘brute' will use a brute-force search. ‘auto' will attempt to decide the most appropriate algorithm based on the values passed to `fit` method. Note: fitting on sparse input will override the setting of this parameter, using brute force. leaf_size : int, optional (default = 30) Leaf size passed to BallTree or KDTree. This can affect the speed of the construction and query, as well as the memory required to store the tree. The optimal value depends on the nature of the problem. p : integer, optional (default = 2) Power parameter for the Minkowski metric. When p = 1, this is equivalent to using manhattan_distance (l1), and euclidean_distance (l2) for p = 2. For arbitrary p, minkowski_distance (l_p) is used. metric : string or callable, default ‘minkowski' the distance metric to use for the tree. The default metric is minkowski, and with p=2 is equivalent to the standard Euclidean metric. See the documentation of the DistanceMetric class for a list of available metrics. metric_params : dict, optional (default = None) Additional keyword arguments for the metric function. n_jobs : int, optional (default = 1) The number of parallel jobs to run for neighbors search. If `-1`, then the number of jobs is set to the number of CPU cores. Doesn't affect `fit` method.

Parameters:

n_neighbors : int, optional (default = 5)

Number of neighbors to use by default for kneighbors queries.

weights : str or callable, optional (default = ‘uniform')

weight function used in prediction. Possible values:

‘uniform' : uniform weights. All points in each neighborhood are weighted equally.

‘distance' : weight points by the inverse of their distance. in this case, closer neighbors of a query point will have a greater influence than neighbors which are further away.

[callable] : a user-defined function which accepts an array of distances, and returns an array of the same shape containing the weights.

algorithm : {‘auto', ‘ball_tree', ‘kd_tree', ‘brute'}, optional

Algorithm used to compute the nearest neighbors:

‘ball_tree' will use BallTree

‘kd_tree' will use KDTree

‘brute' will use a brute-force search.

‘auto' will attempt to decide the most appropriate algorithm based on the values passed to fit method.

Note: fitting on sparse input will override the setting of this parameter, using brute force.

leaf_size : int, optional (default = 30)

Leaf size passed to BallTree or KDTree. This can affect the speed of the construction and query, as well as the memory required to store the tree. The optimal value depends on the nature of the problem.

p : integer, optional (default = 2)

Power parameter for the Minkowski metric. When p = 1, this is equivalent to using manhattan_distance (l1), and euclidean_distance (l2) for p = 2. For arbitrary p, minkowski_distance (l_p) is used.

metric : string or callable, default ‘minkowski'

the distance metric to use for the tree. The default metric is minkowski, and with p=2 is equivalent to the standard Euclidean metric. See the documentation of the DistanceMetric class for a list of available metrics.

metric_params : dict, optional (default = None)

Additional keyword arguments for the metric function.

n_jobs : int, optional (default = 1)

The number of parallel jobs to run for neighbors search. If -1, then the number of jobs is set to the number of CPU cores. Doesn't affect fit method.

n_neighbors: defaults to 5, which is the value of k for k-NN, picking the k nearest points.

weights: default is uniform, arguments can be uniform, distance, or user-defined functions. uniform is an equal weight, which means that all neighboring points have equal weights. distance is an unequal weight, which means that points that are closer together have a greater impact than those that are farther away. A user-defined function that takes an array of distances and returns a set of weights with the same number of dimensions.

algorithm: fast k-nearest neighbor search algorithm, the default parameter is auto, which can be understood as the algorithm decides the appropriate search algorithm by itself. In addition, the user can also specify their own search algorithm ball_tree, kd_tree, brute method for search, brute is a brute force search, that is, linear scanning, when the training set is very large, the calculation is very time-consuming. kd_tree, the construction of the kd tree to store the data in order to retrieve it quickly tree data structure, the kd tree, which is also known as the data structure of the binary tree. The tree constructed with median cut, each node is a hyper rectangle, and is efficient in dimensions less than 20. ball tree was invented to overcome the high-latitude failure of the kd tree, and is constructed by splitting the sample space with center of mass C and radius r. Each node is a hyper sphere.

leaf_size: default is 30, this is the size of the constructed kd tree and ball tree. The setting of this value affects the speed of tree construction and searching, as well as the amount of memory required to store the tree. You need to choose the optimal size according to the nature of the problem.

metric: used for distance metrics, the default metric is minkowski, which is the Euclidean distance (Euclidean metric) with p=2.

p: distance metric formula. In the above subsection, we use the Euclidean distance formula for distance metric. In addition to this, there are other metrics such as Manhattan distance. This parameter defaults to 2, which means that the Euclidean distance formula is used for the distance metric by default. It can also be set to 1 to use the Manhattan distance formula for distance metrics.

metric_params: other key parameters of the distance formula, this can be ignored, just use the default None.

n_jobs: parallel processing setting. Defaults to 1, the number of parallel jobs for proximity point search. If -1, then all cores of the CPU are used for parallel work.

Considering the distance? Not considering the distance?

When K=3, as shown in the figure below, the blue point wins by the voting method, so the green point is categorized as the one with the blue point. But this way of categorization we have considered the three points closest to the green node, but ignored the distance from these three points to the green point, as can be seen in the figure the red point is actually the closest to the green point, when we consider the distance, we need to attach a weight value to the distance, the closer the closer, the greater the weight, usually the derivative of the distance as the weight:

In addition, KNN considering distance solves the corresponding problem in flat tickets

Finding optimal hyperparameter weights: ['uniform','distance']

def main():
 from sklearn import datasets
 digits=datasets.load_digits()
 x=
 y=
 from sklearn.model_selection import train_test_split
 x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.2,random_state=666)
 from  import KNeighborsClassifier
 # Looking for the best k,weights #
 best_k=-1
 best_score=0
 best_method=''
 for method in ['uniform','distance']:
 for i in range(1,11):
  knn_clf=KNeighborsClassifier(n_neighbors=i,weights=method)
  knn_clf.fit(x_train,y_train)
  scores=knn_clf.score(x_test,y_test)
  if scores>best_score:
  best_score=scores
  best_k=i
  best_method=method
 print('Best k for:%d,best score:%.4f,best method%s'%(best_k,best_score,best_method))
if __name__ == '__main__':
 main()

More on the definition of distance can be obtained by collaring a hyperparameter p:

Search for p corresponding to the optimal Minkowski distance:

def main():
 from sklearn import datasets
 digits=datasets.load_digits()
 x=
 y=
 from sklearn.model_selection import train_test_split
 x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.2,random_state=666)
 from  import KNeighborsClassifier
 # Looking for the best k,weights #
 best_k=-1
 best_score=0
 best_p=-1
 
 for i in range(1,11):
 for p in range(1,6):
  knn_clf=KNeighborsClassifier(n_neighbors=i,weights='distance',p=p)
  knn_clf.fit(x_train,y_train)
  scores=knn_clf.score(x_test,y_test)
  if scores>best_score:
  best_score=scores
  best_k=i
  best_p=p
 print('Best k is :%d, best score is :%.4f, best p is %d'%(best_k,best_score,best_p))
if __name__ == '__main__':
 main()

The best k is:3,the best score is:0.9889,the best p is2

As we can see from the above example, there may be interdependencies between some hyperparameters, for example, in the program above, the hyperparameter p is only involved when weights='distance'. How can we better list all the hyperparameters we want at once, and run the program once to find the hyperparameters we want? Using Grid Search we can help us solve this problem, encapsulated in Grid Search in sklearn.

The grid search provided by GridSearchCV generates candidates across the board from the values of the grid parameters determined by the param_grid parameter. For example, the following param_grid.

param_grid = [
 {'C': [1, 10, 100, 1000], 'kernel': ['linear']},
 {'C': [1, 10, 100, 1000], 'gamma': [0.001, 0.0001], 'kernel': ['rbf']},
 ]

Explore a detailed explanation of two lattices: one with a linear kernel and C taking values in [1,10,100,1000]; and the other with an RBF kernel, with cross-products of C values in the range [1,10, 100,1000] and gamma taking values in [0.001, 0.0001].

In this example:

def main():
 import numpy as np
 from sklearn import datasets
 digits=datasets.load_digits()
 x=
 y=
 from sklearn.model_selection import train_test_split
 x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.2,random_state=666)
 from  import KNeighborsClassifier
 
 #Grid Search defines the parameters to be searched
 param_grid=[
 {
  'weights':['uniform'],
  'n_neighbors':[i for i in range(1,11)]
 },
 {
  'weights':['distance'],
  'n_neighbors':[i for i in range(1,11)],
  'p':[i for i in range(1,6)]
 }
 ]
 knn_clf=KNeighborsClassifier()
 from sklearn.model_selection import GridSearchCV
 #n_jobs using a few cores to process, -1 on behalf of the computer has a few cores to use a few cores for parallel processing, the search process verbose can be information output to help understand the search state
 grid_search=GridSearchCV(knn_clf,param_grid,n_jobs=-1,verbose=1)
 grid_search.fit(x_train,y_train)
 #Return to grid search for the best classifier
 print(grid_search.best_estimator_)
 # Return the parameters of the optimal classifier for grid searches
 print(grid_search.best_params_)
 # Returns the score of the best classifier for the grid search
 print(grid_search.best_score_)
 knn_clf=grid_search.best_estimator_
 print(knn_clf.score(x_test,y_test))
if __name__ == '__main__':
 main()
Fitting 3 folds for each of 60（10+50） candidates, totalling 180 fits
[Parallel(n_jobs=-1)]: Done 34 tasks | elapsed: 1.6s
[Parallel(n_jobs=-1)]: Done 180 out of 180 | elapsed: 30.6s finished
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
  metric_params=None, n_jobs=1, n_neighbors=3, p=3,
  weights='distance')
{'n_neighbors': 3, 'weights': 'distance', 'p': 3}
0.985386221294
0.983333333333

Another very important concept in measuring distance is actually data normalizationFeature Scaling

From the above example, we can see that if the discovery time is measured in days, the distance between the samples is dominated by the discovery time, and if the discovery time is measured in years, the distance between the samples is dominated by the size of the tumors, which suggests that the results of calculating the distances directly will be biased if we don't process the data of the samples.

data normalization

Solution: Map all data to the same scale

import numpy as np
import  as plt
#Most value normalization
x=(0,100,size=100)
print(x)
normal_x=((x))/((x)-(x))
print(normal_x)
#mean variance normalization
x2=(0,100,(50,2))
print(x2)
x2=(x2,dtype=float)
x2[:,0]=(x2[:,0]-(x2[:,0]))/(x2[:,0])
x2[:,1]=(x2[:,1]-(x2[:,1]))/(x2[:,1])
(x2[:,0],x2[:,1])
()

x:[49 27 88 47 6 89 9 98 17 72 46 46 80 62 28 38 0 27 22 14 2 79 70 73 15
 57 85 6 11 76 59 62 23 32 82 78 0 45 8 82 13 81 99 61 43 21 45 61 93 63
 66 57 78 60 50 8 29 63 74 8 25 55 10 69 3 77 41 24 15 23 21 31 36 78 94
 52 12 1 23 99 8 12 15 37 75 75 27 14 31 75 6 56 29 96 23 0 22 98 86 10]
normal_x:[ 0.49494949 0.27272727 0.88888889 0.47474747 0.06060606 0.8989899
 0.09090909 0.98989899 0.17171717 0.72727273 0.46464646 0.46464646
 0.80808081 0.62626263 0.28282828 0.38383838 0.  0.27272727
 0.22222222 0.14141414 0.02020202 0.7979798 0.70707071 0.73737374
 0.15151515 0.57575758 0.85858586 0.06060606 0.11111111 0.76767677
 0.5959596 0.62626263 0.23232323 0.32323232 0.82828283 0.78787879
 0.  0.45454545 0.08080808 0.82828283 0.13131313 0.81818182
 1.  0.61616162 0.43434343 0.21212121 0.45454545 0.61616162
 0.93939394 0.63636364 0.66666667 0.57575758 0.78787879 0.60606061
 0.50505051 0.08080808 0.29292929 0.63636364 0.74747475 0.08080808
 0.25252525 0.55555556 0.1010101 0.6969697 0.03030303 0.77777778
 0.41414141 0.24242424 0.15151515 0.23232323 0.21212121 0.31313131
 0.36363636 0.78787879 0.94949495 0.52525253 0.12121212 0.01010101
 0.23232323 1.  0.08080808 0.12121212 0.15151515 0.37373737
 0.75757576 0.75757576 0.27272727 0.14141414 0.31313131 0.75757576
 0.06060606 0.56565657 0.29292929 0.96969697 0.23232323 0.
 0.22222222 0.98989899 0.86868687 0.1010101 ]

sklearnhit the nail on the headStandardScaler
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.2,random_state=666)
StandardScaler in #sklearn.
from  import StandardScaler
standardscaler=StandardScaler()
(x_train)
# Mean value
print(standardscaler.mean_)
# variance
print(standardscaler.scale_)
x_train_standard=(x_train)
x_test_standard=(x_test)
print(x_train_standard)
from  import KNeighborsClassifier
knn_clf=KNeighborsClassifier(n_neighbors=3)
knn_clf.fit(x_train_standard,y_train)
scores=knn_clf.score(x_test_standard,y_test)
print(scores)

Summary:

In addition, regression problems can be solved using the K nearest neighbor algorithm to predict house prices, grades, etc. Predicting with mean distance

The above this K Nearest Neighbor Algorithm (KNN) - sklearn + python implementation is all that I have shared with you, I hope it can give you a reference, and I hope that you can support me more.