Python string similarity
Using difflib module - to achieve two strings or text similarity comparison
First import the difflib module
import difflib
Example:
Str = 'Shanghai Center Tower' s1 = 'Mansion' s2 = 'Shanghai Center' s3 = 'Shanghai Center Building'
print((None, Str, s1).quick_ratio()) print((None, Str, s2).quick_ratio()) print((None, Str, s3).quick_ratio()) 0.5 0.8 0.8333333333333334
Python Similarity Evaluation
Distance" is often used when evaluating similarity:
1. I have used the cosine distance myself when calculating the similarity of pictures.
There is no mistake, not to study geometry, how to get involved in the angle cosine? Please don't worry, everyone. The cosine of the angle in geometry can be used to measure the difference between the directions of two vectors, and machine learning borrows this concept to measure the difference between sample vectors.
(1) The cosine of the angle between vector A(x1,y1) and vector B(x2,y2) in two dimensions:
(2) The cosine of the angle between two n-dimensional sample points a(x11,x12,...,x1n) and b(x21,x22,...,x2n)
Similarly, for two n-dimensional sample points a(x11,x12,...,x1n) and b(x21,x22,...,x2n), the degree of similarity between them can be measured using a notion similar to the cosine of the angle.
To wit:
The cosine of the angle is in the range [-1,1]. The larger the angle cosine is, the smaller the angle between the two vectors is, and the smaller the angle cosine is, the larger the angle between the two vectors is. When the directions of the two vectors coincide, the angle cosine takes the maximum value of 1, and when the directions of the two vectors are completely opposite, the angle cosine takes the minimum value of -1.
import numpy as np # Cosine similarity (Act 1): def cosin_distance2(vector1, vector2): user_item_matric = ((vector1, vector2)) sim = user_item_matric.dot(user_item_matric.T) norms = ([((sim))]) user_similarity = (sim / norms / )[0][1] return user_similarity data = ("data/all_features.npy") #sim = cosin_distance(data[22], data[828]) sim = cosin_distance2(data[22], data[828]) print(sim) # Cosine similarity (Act 2) from import cosine_similarity a = ([1, 2, 8, 4, 6]) a1 = (a) user_tag_matric = ((a, a1)) user_similarity = cosine_similarity(user_tag_matric) print(user_similarity[0][1]) # Cosine similarity (Act 3) from import pairwise_distances a = ([1, 2, 8, 4, 6]) a1 = (a) user_tag_matric = ((a, a1)) user_similarity = pairwise_distances(user_tag_matric, metric='cosine') print(1-user_similarity[0][1])
One point to note is that the Cosine distance calculated with pairwise_distances is a 1-(cosine similarity) result
2. European distance
The Euclidean distance is one of the easiest to understand distance calculations, derived from the formula for the distance between two points in Euclidean space
# 1) given two data points, calculate the euclidean distance between them def get_distance(data1, data2): points = zip(data1, data2) diffs_squared_distance = [pow(a - b, 2) for (a, b) in points] return (sum(diffs_squared_distance))
3. Manhattan distance
You can guess from the name how this distance is calculated. Imagine you have to drive from one intersection to another in Manhattan, is the driving distance a straight line distance between two points? Obviously not, unless you can cross a building. The actual driving distance is this "Manhattan distance". This is where the name Manhattan distance comes from, Manhattan distance is also known as CityBlock distance.
def Manhattan(vec1, vec2): npvec1, npvec2 = (vec1), (vec2) return (npvec1-npvec2).sum() # Manhattan_Distance,
4. Chebyshev distance
Ever played chess? The king can move to any of the 8 adjacent squares in one move. What is the minimum number of moves required for the king to move from square (x1,y1) to square (x2,y2)? Try walking yourself. You will find that the minimum number of steps is always max(| x2-x1 | , | y2-y1 | ) steps. There is a similar measure of distance called the Chebyshev distance.
def Chebyshev(vec1, vec2): npvec1, npvec2 = (vec1), (vec2) return max((npvec1-npvec2)) # Chebyshev_Distance
5. Minkowski distance
Min's distance is not a distance, but a set of distances defined
#!/usr/bin/env python from math import* from decimal import Decimal def nth_root(value,n_root): root_value=1/float(n_root) return round(Decimal(value)**Decimal(root_value),3) def minkowski_distance(x,y,p_value): return nth_root(sum(pow(abs(a-b),p_value) for a,b in zip(x,y)),p_value) print(minkowski_distance([0,3,4,5],[7,6,3,-1],3))
6. Standardized Euclidean distance
Standardized Euclidean distance is an improvement scheme for the shortcomings of simple Euclidean distance. The idea of standardized Euclidean distance: Since the distribution of the dimensional components of the data is not the same, okay! Then I will first "standardize" each component to the mean, variance and so on.
def Standardized_Euclidean(vec1,vec2,v): from scipy import spatial npvec = ([(vec1), (vec2)]) return (npvec, 'seuclidean', V=None) # Standardized Euclidean distance # /jinzhichaoshuiping/article/details/51019473
7. Marginal distance
def Mahalanobis(vec1, vec2): npvec1, npvec2 = (vec1), (vec2) npvec = ([npvec1, npvec2]) sub = [0]-[1] inv_sub = ((npvec1, npvec2)) return ((inv_sub, sub).dot()) # MahalanobisDistance
8. Edit Distance
def Edit_distance_str(str1, str2): import Levenshtein edit_distance_distance = (str1, str2) similarity = 1-(edit_distance_distance/max(len(str1), len(str2))) return {'Distance': edit_distance_distance, 'Similarity': similarity} # Levenshtein distance
where the input data are two arrays of the same dimension
The above is a personal experience, I hope it can give you a reference, and I hope you can support me more.