SoFunction
Updated on 2024-11-21

Sample code for Python implementation of the K-nearest neighbor algorithm

I. Introduction

The K-Nearest Neighbor algorithm, also known as the KNN algorithm, is the simplest algorithm in principle in data mining techniques.

How it works: given a training dataset with known labeled categories, after inputting new unlabeled data, find the k instances in the training dataset that are closest to the new data, and if the majority of these k instances belong to a certain category, then the new data belongs to that category. It is simply understood that those k nearest points to X will vote to decide which category X belongs to.

II. steps of k-nearest neighbor algorithm

(1) Calculate the distance between a point in a dataset of a known category and the current point;

(2) Sort in increasing order of distance;

(3) Select the k points with the smallest distance from the current point;

(4) Determine the frequency of occurrence of the category in which the first k points are located;

(5) Return the category with the highest frequency of occurrence in the previous k points as the predicted category for the current point.

III. Python Implementation

Determine if a movie is a romance or an action movie.

Movie Title Funny Shots Embrace the camera. shot Movie Type  
0 Kung Fu Panda 39 0 31 comedy
1 Ip Man 3 3 2 65 action movie
2 Fall of London 2 3 55 action movie
3 surrogate lover 9 38 2 romance film
4 New Step by Step 8 34 17 romance film
5 lit. spy's shadow is heavy (idiom); fig. a major figure in the spy world 5 2 57 action movie
6 Kung Fu Panda 39 0 31 comedy
7 mermaids 21 17 5 comedy
8 Baby in Charge 45 2 9 comedy
9 Chinatown Probe 23 3 17

Euclidean distance

Building data sets

rowdata = {
    "Title of the movie": ['Kung Fu Panda', 'Ip Man 3', 'The Fall of London', 'Acting Lover', 'New Step by Step', 'The Spy'., 'Kung Fu Panda', 'Mermaid', 'Baby in Charge'],
    "Funny Shots.": [39,3,2,9,8,5,39,21,45],
    "Embrace the camera.": [0,2,3,38,34,2,0,17,2],
    "Fight footage.": [31,65,55,2,17,57,31,5,9],
    "Movie genre": ["Comedies.", "Action Movie", "Action Movie", "Romance movie.", "Romance movie.", "Action Movie", "Comedies.", "Comedies.", "Comedies."]
}

Calculate the distance between a point in a dataset of a known category and the current point

new_data = [24,67]
dist = list((((movie_data.iloc[:6,1:3]-new_data)**2).sum(1))**0.5)

Sort the distances in ascending order and select the k points with the smallest distance "easy to fit - more on this in a later column".

k = 4
dist_l = ({'dist': dist, 'labels': (movie_data.iloc[:6, 3])}) 
dr = dist_l.sort_values(by='dist')[:k]

Determine the probability of occurrence of the category for the first k points

re = [:,'labels'].value_counts()
[0]

Select the category with the highest frequency as the predicted category for the current point

result = []
([0])
result

IV. Determination of the effectiveness of dating site matches

# Import data sets
datingTest = pd.read_table('',header=None)
()

# Analyze data
%matplotlib inline
import matplotlib as mpl
import  as plt

# Color-coding different labels
Colors = []
for i in range([0]):
    m = [i,-1]  # Tags
    if m=='didntLike':
        ('black')
    if m=='smallDoses':
        ('orange')
    if m=='largeDoses':
        ('red')

# Plot the scatterplot between two-by-two features
['-serif']=['Simhei'] #Font set to bold in the diagram
pl=(figsize=(12,8))  # Create a canvas

fig1=pl.add_subplot(221)  # Create a two-row, two-column canvas and put it in the first one #
([:,1],[:,2],marker='.',c=Colors)
('Ratio of time spent playing video games')
('Liters of ice cream consumed per week')

fig2=pl.add_subplot(222)
([:,0],[:,1],marker='.',c=Colors)
('Frequent flyer miles per year')
('Ratio of time spent playing video games')

fig3=pl.add_subplot(223)
([:,0],[:,2],marker='.',c=Colors)
('Frequent flyer miles per year')
('Liters of ice cream consumed per week')
()


# Data normalization
def minmax(dataSet):
    minDf = ()
    maxDf = ()
    normSet = (dataSet - minDf )/(maxDf - minDf)
    return normSet

datingT = ([minmax([:, :3]), [:,3]], axis=1)
()

# Split the training and test sets
def randSplit(dataSet,rate=0.9):
    n = [0] 
    m = int(n*rate)
    train = [:m,:]
    test = [m:,:]
     = range([0])
    return train,test

train,test = randSplit(datingT)


# Classifier test code for dating sites
def datingClass(train,test,k):
    n = [1] - 1  # Subtract the label column
    m = [0]  # of rows
    result = []
    for i in range(m):
        dist = list(((([:, :n] - [i, :n]) ** 2).sum(1))**5)
        dist_l = ({'dist': dist, 'labels': ([:, n])})
        dr = dist_l.sort_values(by = 'dist')[: k]
        re = [:, 'labels'].value_counts()
        ([0])
    result = (result)  
    test['predict'] = result  # Add a column
    acc = ([:,-1]==[:,-2]).mean()
    print(f'The model prediction accuracy is{acc}')
    return test


datingClass(train,test,5)  # 95%

V. Handwritten digit recognition

import os


# Get the labeled training set
def get_train():
    path = 'digits/trainingDigits'
    trainingFileList = (path)
    train = ()
    img = []  # The first column of the original image is converted to a picture inside 0 and 1, one line
    labels = []  # The second column's original label
    for i in range(len(trainingFileList)):
        filename = trainingFileList[i]
        txt = pd.read_csv(f'digits/trainingDigits/{filename}', header = None) #32 lines.
        num = ''
        # 32 lines converted to 1 line
        for i in range([0]):
            num += [i,:]
        (num[0])
        filelable = ('_')[0]
        (filelable)
    train['img'] = img
    train['labels'] = labels
    return train
    
train = get_train()   



# Get the labeled test set
def get_test():
    path = 'digits/testDigits'
    testFileList = (path)
    test = ()
    img = []  # The first column of the original image is converted to a picture inside 0 and 1, one line
    labels = []  # The second column's original label
    for i in range(len(testFileList)):
        filename = testFileList[i]
        txt = pd.read_csv(f'digits/testDigits/{filename}', header = None) #32 lines.
        num = ''
        # 32 lines converted to 1 line
        for i in range([0]):
            num += [i,:]
        (num[0])
        filelable = ('_')[0]
        (filelable)
    test['img'] = img
    test['labels'] = labels
    return test

test = get_test()

# Classifier test code for handwritten numbers
from Levenshtein import hamming

def handwritingClass(train, test, k):
    n = [0]
    m = [0]
    result = []
    for i in range(m):
        dist = []
        for j in range(n):
            d = str(hamming([j,0], [i,0]))
            (d)
        dist_l = ({'dist':dist, 'labels':([:,1])})
        dr = dist_l.sort_values(by='dist')[:k]
        re = [:,'labels'].value_counts()
        ([0])
    result = (result)
    test['predict'] = result
    acc = ([:,-1] == [:,-2]).mean()
    print(f'The model prediction accuracy is{acc}')
    return test

handwritingClass(train, test, 3)  # 97.8%

VI. Algorithm advantages and disadvantages

vantage

(1) Simple and easy to use, easy to understand, high accuracy, mature theory, can be used for both classification and regression;

(2) Can be used for both numerical and discrete data;

(3) No data entry assumption;

(4) Suitable for categorizing rare events.

drawbacks

(1) High computational complexity; high spatial complexity;

(2) The calculation is large, so generally the value is very large suitable for not using this, but a single sample can not be too small, otherwise it is easy to misclassification;

(3) The problem of sample imbalance (i.e., some categories have large sample sizes while others have small sample sizes);

(4) Comprehensibility is relatively poor and does not give the intrinsic meaning of the data

To this article on the Python implementation of the K-nearest neighbor algorithm sample code is introduced to this article, more related Python K-nearest neighbor algorithm content, please search for my previous articles or continue to browse the following related articles I hope you will support me in the future more!