Python Machine Learning Case Study Tutorial - Implementation of K Nearest Neighbor Algorithm

K Nearest Neighbor belongs to a classification algorithm, his explanation is the easiest, close to the vermilion is red, close to the ink is black, we want to see what kind of a person, look at what kind of his friends can be. Of course, other also involved to see which aspects and friends closer (object features), how to be considered close to friends, eat together or shopping together is considered close (distance function), according to the friend's excellent not good how to judge the target task is not good (classification algorithm), whether different degrees of excellence of friends and different degrees of proximity to consider (distance weight), to see how many friends are appropriate (k value), can be expressed in the form of scores ), whether the degree of excellence can be expressed in the form of a score (probability distribution).

K Nearest Neighbor Concept:

It works on the principle that there exists a collection of sample data, also known as the training sample set, and there exists a label for each data in the sample set, i.e., we know the correspondence between each data in the sample set and the classification to which it belongs. After inputting new data without labels, each feature of the new data is compared with the corresponding feature of the data in the sample set, and then the algorithm extracts the classification labels of the most similar data (nearest neighbors) of the sample. In general, we select only the top k most similar data in the sample data set, which is where k comes from in the k-nearest neighbor algorithm, and usually k is an integer not larger than 20. Finally, the classification with the highest number of occurrences among the k most similar data is chosen as the classification for the new data.

Today we use the k-nearest neighbor algorithm to construct a price model for liquor.

Constructing data sets

Construct a sample dataset of wines. The price of white wine is highly dependent on grade and age.

from random import random,randint
import math

# Price modeling based on grade and age
def wineprice(rating,age):
 peak_age=rating-50

 # Prices based on rank
 price=rating/2
 if age>peak_age:
  # After the "peak year", the quality will deteriorate over the next 5 years
  price=price*(5-(age-peak_age)/2)
 else:
  # Prices increase to 5 times their original value as they approach the "peak year"
  price=price*(5*((age+1)/peak_age))
 if price<0: price=0
 return price

# Generate a batch of schema data to represent a sample dataset
def wineset1():
 rows=[]
 for i in range(300):
  # Randomly generated ages and grades
  rating=random()*50+50
  age=random()*50

  # Get a reference price
  price=wineprice(rating,age)

  # Add some noise
  price*=(random()*0.2+0.9)

  # Add to the data set
  ({'input':(rating,age),'result':price})
 return rows

Distance between data

Using k nearest neighbors, it is first necessary to know those nearest neighbors, which also requires knowing the distance between the data. We use the Euclidean distance as the distance between the data.

# Use Euclidean distances, define distances
def euclidean(v1,v2):
 d=0.0
 for i in range(len(v1)):
  d+=(v1[i]-v2[i])**2
 return (d)

Get the k sample data with the closest distance to the new data

# Calculate the distance between the given predicted good and any other good in the original dataset. data original dataset, vec1 predicted good
def getdistances(data,vec1):
 distancelist=[]

 # Iterate over each item in the dataset
 for i in range(len(data)):
  vec2=data[i]['input']

  # Add distance to distance list
  ((euclidean(vec1,vec2),i))

 # Distance sorted
 ()
 return distancelist #Back to distance list

Predict the attributes of the new data based on the k nearest sample data

1、Simple to find the average value

# Average the first k results with the smallest distance value
def knnestimate(data,vec1,k=5):
 # Get sorted distance values
 dlist=getdistances(data,vec1)
 avg=0.0

 # Average the first k results
 for i in range(k):
  idx=dlist[i][1]
  avg+=data[idx]['result']
 avg=avg/k
 return avg

2. Weighted average

If direct averaging is used, there is a possibility that among the first k nearest neighbors, there may be data that is very far away, but he still belongs to the nearest first K data. When this situation exists, the direct mean of the first k sample data will be biased, so a weighted average is used to assign smaller weights to the more distant nodes and larger weights to the more recent nodes.

So how exactly do you assign weights based on the distance between data? Here three decreasing functions are used as weight assignment methods.

2.1, Use the inverse function to assign weights to the nearest neighbors.

def inverseweight(dist,num=1.0,const=0.1):
 return num/(dist+const)

2.2. use the subtraction function to assign weights to the nearest neighbors. Not available when the nearest distances are all greater than const.

def subtractweight(dist,const=1.0):
 if dist>const:
  return 0
 else:
  return const-dist

2.3. assign weights to distances using Gaussian functions.

def gaussian(dist,sigma=5.0):
 return **(-dist**2/(2*sigma**2))

With the weight assignment method in place, the weighted average can be calculated below.

# Weighted average of the first k results with the smallest distance value
def weightedknn(data,vec1,k=5,weightf=gaussian):
 # Get the distance value
 dlist=getdistances(data,vec1)
 avg=0.0
 totalweight=0.0

 # To get a weighted average
 for i in range(k):
  dist=dlist[i][0]
  idx=dlist[i][1]
  weight=weightf(dist)
  avg+=weight*data[idx]['result']
  totalweight+=weight
 if totalweight==0: return 0
 avg=avg/totalweight
 return avg

First test

The above completes the functionality for new data prediction using k nearest neighbors, and we test it below.

if __name__=='__main__': # Only run this function when executing the current module
 data = wineset1()  #Creating the first datasets
 result=knnestimate(data,(95.0,3.0)) # Predictions based on nearest neighbor direct averaging
 print(result)

 result=weightedknn(data,(95.0,3.0),weightf=inverseweight) # Use the inverse function as a weight distribution method for weighted averaging
 print(result)
 result = weightedknn(data, (95.0, 3.0), weightf=subtractweight) # Use the subtraction function as a weight distribution method for weighted averaging
 print(result)
 result = weightedknn(data, (95.0, 3.0), weightf=gaussian) # Use Gaussian function for weight distribution method for weighted average
 print(result)

cross-validation

Cross validation is used to verify that your algorithm or algorithm parameters are good or bad, for example, in the weighted assignment algorithm above we have three ways to do it, which one is better? We can use cross validation to see.

Randomly select 95% of the sample dataset as a training set and 5% as new data, make predictions on the new data and compare them with known results to see the effect of the algorithm.

To implement cross-validation, the function of dividing the sample dataset into two subsets, the training set and the new data, is implemented.

# Delineate the data. test test dataset share. The other datasets are training data
def dividedata(data,test=0.05):
 trainset=[]
 testset=[]
 for row in data:
  if random()<test:
   (row)
  else:
   (row)
 return trainset,testset

Also be able to apply algorithms to calculate the degree of error between predicted and true results.

# Use the dataset to count the errors in the results of the predictions made using the algorithm, and judge the algorithm once and for all. algf is the algorithm function, trainset is the training dataset, and testset is the prediction dataset
def testalgorithm(algf,trainset,testset):
 error=0.0
 for row in testset:
  guess=algf(trainset,row['input']) # This step should be consistent with the format of the sample data, the list of elements for a dictionary, each dictionary contains input and result attributes. And the function parameters are lists and tuples
  error+=(row['result']-guess)**2
  #print row['result'],guess
 #print error/len(testset)
 return error/len(testset)

With data splitting and algorithm performance error statistics function. We can then perform several such experiments on the original dataset and statistically average the errors.

# Combine data splitting and error statistics. Split the dataset multiple times and verify that the algorithm is good or bad
def crossvalidate(algf,data,trials=100,test=0.1):
 error=0.0
 for i in range(trials):
  trainset,testset=dividedata(data,test)
  error+=testalgorithm(algf,trainset,testset)
 return error/trials

Cross-validation testing

if __name__=='__main__': # Only run this function when executing the current module
 data = wineset1()  #Creating the first datasets
 print(data)
  error = crossvalidate(knnestimate,data) # Evaluating the Direct Means Algorithm
 print('Mean error:'+str(error))

 def knn3(d,v): return knnestimate(d,v,k=3) # Define a function pointer. The arguments are a list of d, a tuple of v
 error = crossvalidate(knn3, data)   # Evaluating the Direct Means Algorithm
 print('Mean error:' + str(error))

 def knninverse(d,v): return weightedknn(d,v,weightf=inverseweight) # Define a function pointer. The arguments are a list of d, a tuple of v
 error = crossvalidate(knninverse, data)   # Evaluate the use of inverse functions for weight assignment methods
 print('Mean error:' + str(error))

Variables of different types, value fields, useless variables

There may not be the same type of data with the same range of values in each attribute of the sample data, for example, the attribute of the wine above may contain the grade (0-100), the age of the wine (0-50), the capacity of the wine (three capacities of 375.0 ml, 750.0 ml, and 1500.0 ml), and there may even be useless data contained in the sample data that we obtain, such as the production of the wine flow line number (an integer between 1 and 20). When calculating the sample distance, changes in attributes with a large range of values will seriously affect the changes in attributes with a small range of values, so much so that the results will ignore the attributes with a small range of values. Also changes in useless attributes can increase the distance between data.

So it is necessary to scale the attributes of the sample data to a suitable range and to be able to remove invalid attributes.

Constructing a new data set

# Construct new datasets to model problems with different types of variables
def wineset2():
 rows=[]
 for i in range(300):
  rating=random()*50+50 #Wine class
  age=random()*50   #Wine ageing
  aisle=float(randint(1,20)) #Wine channel number (irrelevant attribute)
  bottlesize=[375.0,750.0,1500.0][randint(0,2)] #Wine volume
  price=wineprice(rating,age) #Wine Prices
  price*=(bottlesize/750)
  price*=(random()*0.2+0.9)
  ({'input':(rating,age,aisle,bottlesize),'result':price})
 return rows

Implementing the ability to scale the value of an attribute proportionally

# Scale the attributes proportionally. scale is the scaling ratio of the value of each attribute.
def rescale(data,scale):
 scaleddata=[]
 for row in data:
  scaled=[scale[i]*row['input'][i] for i in range(len(scale))]
  ({'input':scaled,'result':row['result']})
 return scaleddata

That leaves the last final question, just how much do the individual attributes scale. This is an optimization problem, and we can find the optimal solution through optimization techniques. The cost function to be optimized is the error value between the predicted result and the real result after scaling. The smaller the error value, the better. The error value is calculated using the same crossvalidate function as in the previous crossvalidation.

Construct the cost function below

# Generate cost functions. Closures
def createcostfunction(algf,data):
 def costf(scale):
  sdata=rescale(data,scale)
  return crossvalidate(algf,sdata,trials=10)
 return costf

weightdomain=[(0,10)]*4  # Set the value range of the problem solution of scaling to 0-10, which can be set by yourself, for optimizing the algorithm

For optimization techniques, seehttps:///article/

test code

if __name__=='__main__': # Only run this function when executing the current module
 #======== scaling optimization===
 data = wineset2() # Batch 2 dataset created
 print(data)
 import optimization
 costf=createcostfunction(knnestimate,data)  #Creating a cost function
 result = (weightdomain,costf,step=2) # Use annealing algorithms to find optimal solutions
 print(result)

asymmetrical distribution

For sample datasets that contain multiple distributions, the output results we also want to be non-unique. We can represent this using probabilistic outcomes, outputting the value and probability of occurrence of each outcome.

For example, it is possible for wine to be purchased from a discount store, a characteristic that is not recorded in the sample data set. So there are two forms of distribution of prices in the sample data.

Constructing data sets

def wineset3():
 rows=wineset1()
 for row in rows:
  if random()<0.5:
   # The wine was bought from a discount store
   row['result']*=0.6
 return rows

If in the form of outcome probabilities, we want to be able to calculate the probability value for a specified range

# Calculate the probability. data sample dataset, vec1 prediction data, low, high range of results, weightf is a function that assigns weights based on distance
def probguess(data,vec1,low,high,k=5,weightf=gaussian):
 dlist=getdistances(data,vec1) # Get a list of distances
 nweight=0.0
 tweight=0.0

 for i in range(k):
  dist=dlist[i][0] # Distance
  idx=dlist[i][1] #Index number
  weight=weightf(dist) # weights
  v=data[idx]['result'] #RealResults

  # Is the current data point within the specified range?
  if v>=low and v<=high:
   nweight+=weight # Sum of the weights of the specified range
  tweight+=weight  # Sum of total weights
 if tweight==0: return 0

 # Probability is equal to the weight values that lie within the specified range divided by all weight values
 return nweight/tweight

For results with multiple outputs, expressed as probabilities and values, we can use a representation in the form of a cumulative probability distribution plot or a probability density plot.

Plotting cumulative probability distributions

from pylab import *

# Plot cumulative probability distribution. data sample dataset, vec1 predicted data, HIGH takes the highest point, k nearest neighbor range, weightf weight assignment
def cumulativegraph(data,vec1,high,k=5,weightf=gaussian):
 t1=arange(0.0,high,0.1)
 cprob=array([probguess(data,vec1,0,v,k,weightf) for v in t1]) # Probability of different outcomes resulting from the prediction
 plot(t1,cprob)
 show()

Plotting probability densities

# Plotting probability densities
def probabilitygraph(data,vec1,high,k=5,weightf=gaussian,ss=5.0):
 # Establish a range of values representing prices
 t1=arange(0.0,high,0.1)

 # Get all the probabilities for the entire range of values #
 probs=[probguess(data,vec1,v,v+0.1,k,weightf) for v in t1]

 # Smooth the probability values by adding the Gaussian calculation of the nearest neighbor probabilities
 smoothed=[]
 for i in range(len(probs)):
  sv=0.0
  for j in range(0,len(probs)):
   dist=abs(i-j)*0.1
   weight=gaussian(dist,sigma=ss)
   sv+=weight*probs[j]
  (sv)
 smoothed=array(smoothed)

 plot(t1,smoothed)
 show()

test code

if __name__=='__main__': # Only run this function when executing the current module

 data = wineset3() # Batch 3 dataset created
 print(data)
 cumulativegraph(data,(1,1),120) # Plotting cumulative probability densities
 probabilitygraph(data,(1,1),6) # Plotting probability densities

This is the whole content of this article.