I. Introduction
The K-Nearest Neighbor algorithm, also known as the KNN algorithm, is the simplest algorithm in principle in data mining techniques.
How it works: given a training dataset with known labeled categories, after inputting new unlabeled data, find the k instances in the training dataset that are closest to the new data, and if the majority of these k instances belong to a certain category, then the new data belongs to that category. It is simply understood that those k nearest points to X will vote to decide which category X belongs to.
II. steps of k-nearest neighbor algorithm
(1) Calculate the distance between a point in a dataset of a known category and the current point;
(2) Sort in increasing order of distance;
(3) Select the k points with the smallest distance from the current point;
(4) Determine the frequency of occurrence of the category in which the first k points are located;
(5) Return the category with the highest frequency of occurrence in the previous k points as the predicted category for the current point.
III. Python Implementation
Determine if a movie is a romance or an action movie.
Movie Title | Funny Shots | Embrace the camera. | shot | Movie Type | |
---|---|---|---|---|---|
0 | Kung Fu Panda | 39 | 0 | 31 | comedy |
1 | Ip Man 3 | 3 | 2 | 65 | action movie |
2 | Fall of London | 2 | 3 | 55 | action movie |
3 | surrogate lover | 9 | 38 | 2 | romance film |
4 | New Step by Step | 8 | 34 | 17 | romance film |
5 | lit. spy's shadow is heavy (idiom); fig. a major figure in the spy world | 5 | 2 | 57 | action movie |
6 | Kung Fu Panda | 39 | 0 | 31 | comedy |
7 | mermaids | 21 | 17 | 5 | comedy |
8 | Baby in Charge | 45 | 2 | 9 | comedy |
9 | Chinatown Probe | 23 | 3 | 17 | ? |
Euclidean distance
Building data sets
rowdata = { "Title of the movie": ['Kung Fu Panda', 'Ip Man 3', 'The Fall of London', 'Acting Lover', 'New Step by Step', 'The Spy'., 'Kung Fu Panda', 'Mermaid', 'Baby in Charge'], "Funny Shots.": [39,3,2,9,8,5,39,21,45], "Embrace the camera.": [0,2,3,38,34,2,0,17,2], "Fight footage.": [31,65,55,2,17,57,31,5,9], "Movie genre": ["Comedies.", "Action Movie", "Action Movie", "Romance movie.", "Romance movie.", "Action Movie", "Comedies.", "Comedies.", "Comedies."] }
Calculate the distance between a point in a dataset of a known category and the current point
new_data = [24,67] dist = list((((movie_data.iloc[:6,1:3]-new_data)**2).sum(1))**0.5)
Sort the distances in ascending order and select the k points with the smallest distance "easy to fit - more on this in a later column".
k = 4 dist_l = ({'dist': dist, 'labels': (movie_data.iloc[:6, 3])}) dr = dist_l.sort_values(by='dist')[:k]
Determine the probability of occurrence of the category for the first k points
re = [:,'labels'].value_counts() [0]
Select the category with the highest frequency as the predicted category for the current point
result = [] ([0]) result
IV. Determination of the effectiveness of dating site matches
# Import data sets datingTest = pd.read_table('',header=None) () # Analyze data %matplotlib inline import matplotlib as mpl import as plt # Color-coding different labels Colors = [] for i in range([0]): m = [i,-1] # Tags if m=='didntLike': ('black') if m=='smallDoses': ('orange') if m=='largeDoses': ('red') # Plot the scatterplot between two-by-two features ['-serif']=['Simhei'] #Font set to bold in the diagram pl=(figsize=(12,8)) # Create a canvas fig1=pl.add_subplot(221) # Create a two-row, two-column canvas and put it in the first one # ([:,1],[:,2],marker='.',c=Colors) ('Ratio of time spent playing video games') ('Liters of ice cream consumed per week') fig2=pl.add_subplot(222) ([:,0],[:,1],marker='.',c=Colors) ('Frequent flyer miles per year') ('Ratio of time spent playing video games') fig3=pl.add_subplot(223) ([:,0],[:,2],marker='.',c=Colors) ('Frequent flyer miles per year') ('Liters of ice cream consumed per week') () # Data normalization def minmax(dataSet): minDf = () maxDf = () normSet = (dataSet - minDf )/(maxDf - minDf) return normSet datingT = ([minmax([:, :3]), [:,3]], axis=1) () # Split the training and test sets def randSplit(dataSet,rate=0.9): n = [0] m = int(n*rate) train = [:m,:] test = [m:,:] = range([0]) return train,test train,test = randSplit(datingT) # Classifier test code for dating sites def datingClass(train,test,k): n = [1] - 1 # Subtract the label column m = [0] # of rows result = [] for i in range(m): dist = list(((([:, :n] - [i, :n]) ** 2).sum(1))**5) dist_l = ({'dist': dist, 'labels': ([:, n])}) dr = dist_l.sort_values(by = 'dist')[: k] re = [:, 'labels'].value_counts() ([0]) result = (result) test['predict'] = result # Add a column acc = ([:,-1]==[:,-2]).mean() print(f'The model prediction accuracy is{acc}') return test datingClass(train,test,5) # 95%
V. Handwritten digit recognition
import os # Get the labeled training set def get_train(): path = 'digits/trainingDigits' trainingFileList = (path) train = () img = [] # The first column of the original image is converted to a picture inside 0 and 1, one line labels = [] # The second column's original label for i in range(len(trainingFileList)): filename = trainingFileList[i] txt = pd.read_csv(f'digits/trainingDigits/{filename}', header = None) #32 lines. num = '' # 32 lines converted to 1 line for i in range([0]): num += [i,:] (num[0]) filelable = ('_')[0] (filelable) train['img'] = img train['labels'] = labels return train train = get_train() # Get the labeled test set def get_test(): path = 'digits/testDigits' testFileList = (path) test = () img = [] # The first column of the original image is converted to a picture inside 0 and 1, one line labels = [] # The second column's original label for i in range(len(testFileList)): filename = testFileList[i] txt = pd.read_csv(f'digits/testDigits/{filename}', header = None) #32 lines. num = '' # 32 lines converted to 1 line for i in range([0]): num += [i,:] (num[0]) filelable = ('_')[0] (filelable) test['img'] = img test['labels'] = labels return test test = get_test() # Classifier test code for handwritten numbers from Levenshtein import hamming def handwritingClass(train, test, k): n = [0] m = [0] result = [] for i in range(m): dist = [] for j in range(n): d = str(hamming([j,0], [i,0])) (d) dist_l = ({'dist':dist, 'labels':([:,1])}) dr = dist_l.sort_values(by='dist')[:k] re = [:,'labels'].value_counts() ([0]) result = (result) test['predict'] = result acc = ([:,-1] == [:,-2]).mean() print(f'The model prediction accuracy is{acc}') return test handwritingClass(train, test, 3) # 97.8%
VI. Algorithm advantages and disadvantages
vantage
(1) Simple and easy to use, easy to understand, high accuracy, mature theory, can be used for both classification and regression;
(2) Can be used for both numerical and discrete data;
(3) No data entry assumption;
(4) Suitable for categorizing rare events.
drawbacks
(1) High computational complexity; high spatial complexity;
(2) The calculation is large, so generally the value is very large suitable for not using this, but a single sample can not be too small, otherwise it is easy to misclassification;
(3) The problem of sample imbalance (i.e., some categories have large sample sizes while others have small sample sizes);
(4) Comprehensibility is relatively poor and does not give the intrinsic meaning of the data
To this article on the Python implementation of the K-nearest neighbor algorithm sample code is introduced to this article, more related Python K-nearest neighbor algorithm content, please search for my previous articles or continue to browse the following related articles I hope you will support me in the future more!