K Nearest Neighbor belongs to a classification algorithm, his explanation is the easiest, close to the vermilion is red, close to the ink is black, we want to see what kind of a person, look at what kind of his friends can be. Of course, other also involved to see which aspects and friends closer (object features), how to be considered close to friends, eat together or shopping together is considered close (distance function), according to the friend's excellent not good how to judge the target task is not good (classification algorithm), whether different degrees of excellence of friends and different degrees of proximity to consider (distance weight), to see how many friends are appropriate (k value), can be expressed in the form of scores ), whether the degree of excellence can be expressed in the form of a score (probability distribution).
K Nearest Neighbor Concept:
It works on the principle that there exists a collection of sample data, also known as the training sample set, and there exists a label for each data in the sample set, i.e., we know the correspondence between each data in the sample set and the classification to which it belongs. After inputting new data without labels, each feature of the new data is compared with the corresponding feature of the data in the sample set, and then the algorithm extracts the classification labels of the most similar data (nearest neighbors) of the sample. In general, we select only the top k most similar data in the sample data set, which is where k comes from in the k-nearest neighbor algorithm, and usually k is an integer not larger than 20. Finally, the classification with the highest number of occurrences among the k most similar data is chosen as the classification for the new data.
Today we use the k-nearest neighbor algorithm to construct a price model for liquor.
Constructing data sets
Construct a sample dataset of wines. The price of white wine is highly dependent on grade and age.
from random import random,randint import math # Price modeling based on grade and age def wineprice(rating,age): peak_age=rating-50 # Prices based on rank price=rating/2 if age>peak_age: # After the "peak year", the quality will deteriorate over the next 5 years price=price*(5-(age-peak_age)/2) else: # Prices increase to 5 times their original value as they approach the "peak year" price=price*(5*((age+1)/peak_age)) if price<0: price=0 return price # Generate a batch of schema data to represent a sample dataset def wineset1(): rows=[] for i in range(300): # Randomly generated ages and grades rating=random()*50+50 age=random()*50 # Get a reference price price=wineprice(rating,age) # Add some noise price*=(random()*0.2+0.9) # Add to the data set ({'input':(rating,age),'result':price}) return rows
Distance between data
Using k nearest neighbors, it is first necessary to know those nearest neighbors, which also requires knowing the distance between the data. We use the Euclidean distance as the distance between the data.
# Use Euclidean distances, define distances def euclidean(v1,v2): d=0.0 for i in range(len(v1)): d+=(v1[i]-v2[i])**2 return (d)
Get the k sample data with the closest distance to the new data
# Calculate the distance between the given predicted good and any other good in the original dataset. data original dataset, vec1 predicted good def getdistances(data,vec1): distancelist=[] # Iterate over each item in the dataset for i in range(len(data)): vec2=data[i]['input'] # Add distance to distance list ((euclidean(vec1,vec2),i)) # Distance sorted () return distancelist #Back to distance list
Predict the attributes of the new data based on the k nearest sample data
1、Simple to find the average value
# Average the first k results with the smallest distance value def knnestimate(data,vec1,k=5): # Get sorted distance values dlist=getdistances(data,vec1) avg=0.0 # Average the first k results for i in range(k): idx=dlist[i][1] avg+=data[idx]['result'] avg=avg/k return avg
2. Weighted average
If direct averaging is used, there is a possibility that among the first k nearest neighbors, there may be data that is very far away, but he still belongs to the nearest first K data. When this situation exists, the direct mean of the first k sample data will be biased, so a weighted average is used to assign smaller weights to the more distant nodes and larger weights to the more recent nodes.
So how exactly do you assign weights based on the distance between data? Here three decreasing functions are used as weight assignment methods.
2.1, Use the inverse function to assign weights to the nearest neighbors.
def inverseweight(dist,num=1.0,const=0.1): return num/(dist+const)
2.2. use the subtraction function to assign weights to the nearest neighbors. Not available when the nearest distances are all greater than const.
def subtractweight(dist,const=1.0): if dist>const: return 0 else: return const-dist
2.3. assign weights to distances using Gaussian functions.
def gaussian(dist,sigma=5.0): return **(-dist**2/(2*sigma**2))
With the weight assignment method in place, the weighted average can be calculated below.
# Weighted average of the first k results with the smallest distance value def weightedknn(data,vec1,k=5,weightf=gaussian): # Get the distance value dlist=getdistances(data,vec1) avg=0.0 totalweight=0.0 # To get a weighted average for i in range(k): dist=dlist[i][0] idx=dlist[i][1] weight=weightf(dist) avg+=weight*data[idx]['result'] totalweight+=weight if totalweight==0: return 0 avg=avg/totalweight return avg
First test
The above completes the functionality for new data prediction using k nearest neighbors, and we test it below.
if __name__=='__main__': # Only run this function when executing the current module data = wineset1() #Creating the first datasets result=knnestimate(data,(95.0,3.0)) # Predictions based on nearest neighbor direct averaging print(result) result=weightedknn(data,(95.0,3.0),weightf=inverseweight) # Use the inverse function as a weight distribution method for weighted averaging print(result) result = weightedknn(data, (95.0, 3.0), weightf=subtractweight) # Use the subtraction function as a weight distribution method for weighted averaging print(result) result = weightedknn(data, (95.0, 3.0), weightf=gaussian) # Use Gaussian function for weight distribution method for weighted average print(result)
cross-validation
Cross validation is used to verify that your algorithm or algorithm parameters are good or bad, for example, in the weighted assignment algorithm above we have three ways to do it, which one is better? We can use cross validation to see.
Randomly select 95% of the sample dataset as a training set and 5% as new data, make predictions on the new data and compare them with known results to see the effect of the algorithm.
To implement cross-validation, the function of dividing the sample dataset into two subsets, the training set and the new data, is implemented.
# Delineate the data. test test dataset share. The other datasets are training data def dividedata(data,test=0.05): trainset=[] testset=[] for row in data: if random()<test: (row) else: (row) return trainset,testset
Also be able to apply algorithms to calculate the degree of error between predicted and true results.
# Use the dataset to count the errors in the results of the predictions made using the algorithm, and judge the algorithm once and for all. algf is the algorithm function, trainset is the training dataset, and testset is the prediction dataset def testalgorithm(algf,trainset,testset): error=0.0 for row in testset: guess=algf(trainset,row['input']) # This step should be consistent with the format of the sample data, the list of elements for a dictionary, each dictionary contains input and result attributes. And the function parameters are lists and tuples error+=(row['result']-guess)**2 #print row['result'],guess #print error/len(testset) return error/len(testset)
With data splitting and algorithm performance error statistics function. We can then perform several such experiments on the original dataset and statistically average the errors.
# Combine data splitting and error statistics. Split the dataset multiple times and verify that the algorithm is good or bad def crossvalidate(algf,data,trials=100,test=0.1): error=0.0 for i in range(trials): trainset,testset=dividedata(data,test) error+=testalgorithm(algf,trainset,testset) return error/trials
Cross-validation testing
if __name__=='__main__': # Only run this function when executing the current module data = wineset1() #Creating the first datasets print(data) error = crossvalidate(knnestimate,data) # Evaluating the Direct Means Algorithm print('Mean error:'+str(error)) def knn3(d,v): return knnestimate(d,v,k=3) # Define a function pointer. The arguments are a list of d, a tuple of v error = crossvalidate(knn3, data) # Evaluating the Direct Means Algorithm print('Mean error:' + str(error)) def knninverse(d,v): return weightedknn(d,v,weightf=inverseweight) # Define a function pointer. The arguments are a list of d, a tuple of v error = crossvalidate(knninverse, data) # Evaluate the use of inverse functions for weight assignment methods print('Mean error:' + str(error))
Variables of different types, value fields, useless variables
There may not be the same type of data with the same range of values in each attribute of the sample data, for example, the attribute of the wine above may contain the grade (0-100), the age of the wine (0-50), the capacity of the wine (three capacities of 375.0 ml, 750.0 ml, and 1500.0 ml), and there may even be useless data contained in the sample data that we obtain, such as the production of the wine flow line number (an integer between 1 and 20). When calculating the sample distance, changes in attributes with a large range of values will seriously affect the changes in attributes with a small range of values, so much so that the results will ignore the attributes with a small range of values. Also changes in useless attributes can increase the distance between data.
So it is necessary to scale the attributes of the sample data to a suitable range and to be able to remove invalid attributes.
Constructing a new data set
# Construct new datasets to model problems with different types of variables def wineset2(): rows=[] for i in range(300): rating=random()*50+50 #Wine class age=random()*50 #Wine ageing aisle=float(randint(1,20)) #Wine channel number (irrelevant attribute) bottlesize=[375.0,750.0,1500.0][randint(0,2)] #Wine volume price=wineprice(rating,age) #Wine Prices price*=(bottlesize/750) price*=(random()*0.2+0.9) ({'input':(rating,age,aisle,bottlesize),'result':price}) return rows
Implementing the ability to scale the value of an attribute proportionally
# Scale the attributes proportionally. scale is the scaling ratio of the value of each attribute. def rescale(data,scale): scaleddata=[] for row in data: scaled=[scale[i]*row['input'][i] for i in range(len(scale))] ({'input':scaled,'result':row['result']}) return scaleddata
That leaves the last final question, just how much do the individual attributes scale. This is an optimization problem, and we can find the optimal solution through optimization techniques. The cost function to be optimized is the error value between the predicted result and the real result after scaling. The smaller the error value, the better. The error value is calculated using the same crossvalidate function as in the previous crossvalidation.
Construct the cost function below
# Generate cost functions. Closures def createcostfunction(algf,data): def costf(scale): sdata=rescale(data,scale) return crossvalidate(algf,sdata,trials=10) return costf weightdomain=[(0,10)]*4 # Set the value range of the problem solution of scaling to 0-10, which can be set by yourself, for optimizing the algorithm
For optimization techniques, seehttps:///article/
test code
if __name__=='__main__': # Only run this function when executing the current module #======== scaling optimization=== data = wineset2() # Batch 2 dataset created print(data) import optimization costf=createcostfunction(knnestimate,data) #Creating a cost function result = (weightdomain,costf,step=2) # Use annealing algorithms to find optimal solutions print(result)
asymmetrical distribution
For sample datasets that contain multiple distributions, the output results we also want to be non-unique. We can represent this using probabilistic outcomes, outputting the value and probability of occurrence of each outcome.
For example, it is possible for wine to be purchased from a discount store, a characteristic that is not recorded in the sample data set. So there are two forms of distribution of prices in the sample data.
Constructing data sets
def wineset3(): rows=wineset1() for row in rows: if random()<0.5: # The wine was bought from a discount store row['result']*=0.6 return rows
If in the form of outcome probabilities, we want to be able to calculate the probability value for a specified range
# Calculate the probability. data sample dataset, vec1 prediction data, low, high range of results, weightf is a function that assigns weights based on distance def probguess(data,vec1,low,high,k=5,weightf=gaussian): dlist=getdistances(data,vec1) # Get a list of distances nweight=0.0 tweight=0.0 for i in range(k): dist=dlist[i][0] # Distance idx=dlist[i][1] #Index number weight=weightf(dist) # weights v=data[idx]['result'] #RealResults # Is the current data point within the specified range? if v>=low and v<=high: nweight+=weight # Sum of the weights of the specified range tweight+=weight # Sum of total weights if tweight==0: return 0 # Probability is equal to the weight values that lie within the specified range divided by all weight values return nweight/tweight
For results with multiple outputs, expressed as probabilities and values, we can use a representation in the form of a cumulative probability distribution plot or a probability density plot.
Plotting cumulative probability distributions
from pylab import * # Plot cumulative probability distribution. data sample dataset, vec1 predicted data, HIGH takes the highest point, k nearest neighbor range, weightf weight assignment def cumulativegraph(data,vec1,high,k=5,weightf=gaussian): t1=arange(0.0,high,0.1) cprob=array([probguess(data,vec1,0,v,k,weightf) for v in t1]) # Probability of different outcomes resulting from the prediction plot(t1,cprob) show()
Plotting probability densities
# Plotting probability densities def probabilitygraph(data,vec1,high,k=5,weightf=gaussian,ss=5.0): # Establish a range of values representing prices t1=arange(0.0,high,0.1) # Get all the probabilities for the entire range of values # probs=[probguess(data,vec1,v,v+0.1,k,weightf) for v in t1] # Smooth the probability values by adding the Gaussian calculation of the nearest neighbor probabilities smoothed=[] for i in range(len(probs)): sv=0.0 for j in range(0,len(probs)): dist=abs(i-j)*0.1 weight=gaussian(dist,sigma=ss) sv+=weight*probs[j] (sv) smoothed=array(smoothed) plot(t1,smoothed) show()
test code
if __name__=='__main__': # Only run this function when executing the current module data = wineset3() # Batch 3 dataset created print(data) cumulativegraph(data,(1,1),120) # Plotting cumulative probability densities probabilitygraph(data,(1,1),6) # Plotting probability densities
This is the whole content of this article.