Python Language Implementation of K-Nearest Neighbor Algorithm for Machine Learning

write sth. upfront

Recently, I started to learn machine learning, I found a book about machine learning online, called "Machine Learning in Action". Coincidentally, the algorithms in this book are realized in Python language, just before I learned some Python basics, so this book for me, undoubtedly is a gift of charcoal ah. Next, I'd better tell you about the actual stuff.

What is the K-Nearest Neighbor Algorithm?

Simply put, the K-Nearest Neighbor algorithm is a method of measuring the distance between different feature values for classification. It works as follows: there exists a collection of sample data, also called the training sample set, and each data in the sample set has a label, i.e., we know the correspondence between each data in the sample set and the classification to which it belongs, and after inputting new data without labels, each feature of the new data is compared with the corresponding feature of the data in the sample set, and then the algorithm extracts the classification labels of the data in the sample set with the most similar features. . Generally, we only select the top k most similar data in the sample data set, which is the origin of the name of the K-nearest neighbor algorithm.

Q: Pro, do you build the K-Nearest Neighbor algorithm as a supervised or unsupervised learning?

Importing data using Python

From the working principle of K-Nearest Neighbor Algorithm we can see that to implement this algorithm for data classification, we have to need sample data at hand, without which how can we build the classification function. So, our first step is to import the sample data set.

Create the named module and write the code:

 from numpy import *
 import operator
 
 def createDataSet():
   group = array([[1.0,1.1],[1.0,1.0],[0,0],[0,0.1]])
   labels = ['A','A','B','B']
   return group, labels

In the code, we need to import two Python modules: the scientific computing package NumPy and the operator module.The NumPy library is a separate module from the Python development environment.The NumPy library is not installed by default in most versions of Python, so we will need to install this module separately here.

Download Address:/projects/numpy/files/

There are many versions, here I chose numpy-1.7.0-win32-superpack-python2.

Implementing the K-Nearest Neighbor Algorithm

The specific idea of the K-nearest neighbor algorithm is as follows:

(1) Calculate the distance between a point in a known category data set and the current point

(2) Sorted in increasing order of distance

(3) Select the k points with the smallest distance from the current point

(4) Determine the frequency of occurrence of the category in which the first k points are located

(5) Return the category with the highest frequency of occurrence in the first k points as the predicted classification for the current point

The code for implementing the K-nearest neighbor algorithm in Python language is given below:

 # coding : utf-8
 from numpy import *
 import operator 
 import kNN
 group, labels = ()
 def classify(inX, dataSet, labels, k):
   dataSetSize = [0] 
   diffMat = tile(inX, (dataSetSize,1)) - dataSet
   sqDiffMat = diffMat**2
   sqDistances = (axis=1)
   distances = sqDistances**0.5
   sortedDistances = ()
   classCount = {}
   for i in range(k):
     numOflabel = labels[sortedDistances[i]]
     classCount[numOflabel] = (numOflabel,0) + 1
   sortedClassCount = sorted((), key=(1),reverse=True)
   return sortedClassCount[0][0]
 my = classify([0,0], group, labels, 3)
 print my

The results of the operation are as follows:

The output is B: indicating that our new data ([0,0]) belongs to category B.

Code Details

I believe that there are many friends of the above code has a lot of incomprehension, next, I focus on explaining a few key points of this function, in order to facilitate the readers and myself to review the algorithm code.

Arguments to the classify function:

inX: input vector for classification
dataSet: set of training samples
labels: label vector
k: k in the k-nearest neighbor algorithm
shape: an attribute of array that describes the dimension of a multidimensional array.

tile (inX, (dataSetSize,1)): inX two-dimensional array, dataSetSize that generates the number of rows of the array, 1 means that the multiples of the columns. The entire line of code represents a two-dimensional array matrix before each element minus the corresponding element value of the latter array, so that the realization of the subtraction between the matrix, simple and convenient not to let you admire can not be!

axis=1: parameter equal to 1 indicates the summation of the numbers between rows in the matrix, and equal to 0 indicates the summation of the numbers between columns.

argsort(): non-descending sort of an array

(numOflabel,0) + 1: this line of code has to be said to be very beautiful. get(): this method is the method to access the dictionary item, that is, access the subscript key numOflabel item, if there is no such an item, then the initial value of 0. Then this item's value is added to 1. So the implementation of such an operation in Python is just one line of code, it is really very concise and efficient. It's very simple and efficient.

something be taken up later in speech or writing

That's pretty much it for the K-Nearest Neighbor algorithm (KNN) principle as well as the code implementation, the next task is to get more familiar with it and try to get to the point of bare knuckle knocking.

The above mentioned on is the whole content of this article, I hope you can enjoy it.