String Similarity in Python

Python string similarity

Using difflib module - to achieve two strings or text similarity comparison

First import the difflib module

import difflib

Example:

Str = 'Shanghai Center Tower'
s1  = 'Mansion'
s2  = 'Shanghai Center'
s3  = 'Shanghai Center Building'

print((None, Str, s1).quick_ratio())  
print((None, Str, s2).quick_ratio())  
print((None, Str, s3).quick_ratio())

0.5
0.8
0.8333333333333334

Python Similarity Evaluation

Distance" is often used when evaluating similarity:

1. I have used the cosine distance myself when calculating the similarity of pictures.

There is no mistake, not to study geometry, how to get involved in the angle cosine? Please don't worry, everyone. The cosine of the angle in geometry can be used to measure the difference between the directions of two vectors, and machine learning borrows this concept to measure the difference between sample vectors.

(1) The cosine of the angle between vector A(x1,y1) and vector B(x2,y2) in two dimensions:

(2) The cosine of the angle between two n-dimensional sample points a(x11,x12,...,x1n) and b(x21,x22,...,x2n)

Similarly, for two n-dimensional sample points a(x11,x12,...,x1n) and b(x21,x22,...,x2n), the degree of similarity between them can be measured using a notion similar to the cosine of the angle.

To wit:

The cosine of the angle is in the range [-1,1]. The larger the angle cosine is, the smaller the angle between the two vectors is, and the smaller the angle cosine is, the larger the angle between the two vectors is. When the directions of the two vectors coincide, the angle cosine takes the maximum value of 1, and when the directions of the two vectors are completely opposite, the angle cosine takes the minimum value of -1.

import numpy as np
# Cosine similarity (Act 1):
def cosin_distance2(vector1, vector2):
 
    user_item_matric = ((vector1, vector2))
    sim = user_item_matric.dot(user_item_matric.T)
    norms = ([((sim))])
    user_similarity = (sim / norms / )[0][1]
    return user_similarity
 
data = ("data/all_features.npy")
#sim = cosin_distance(data[22], data[828])
sim = cosin_distance2(data[22], data[828])
print(sim)
 
# Cosine similarity (Act 2)
from  import cosine_similarity
a = ([1, 2, 8, 4, 6])
a1 = (a)
user_tag_matric = ((a, a1))
user_similarity = cosine_similarity(user_tag_matric)
print(user_similarity[0][1])
 
# Cosine similarity (Act 3)
from  import pairwise_distances
a = ([1, 2, 8, 4, 6])
a1 = (a)
user_tag_matric = ((a, a1))
user_similarity = pairwise_distances(user_tag_matric, metric='cosine')
print(1-user_similarity[0][1])

One point to note is that the Cosine distance calculated with pairwise_distances is a 1-(cosine similarity) result

2. European distance

The Euclidean distance is one of the easiest to understand distance calculations, derived from the formula for the distance between two points in Euclidean space

# 1) given two data points, calculate the euclidean distance between them
def get_distance(data1, data2):
    points = zip(data1, data2)
    diffs_squared_distance = [pow(a - b, 2) for (a, b) in points]
    return (sum(diffs_squared_distance))

3. Manhattan distance

You can guess from the name how this distance is calculated. Imagine you have to drive from one intersection to another in Manhattan, is the driving distance a straight line distance between two points? Obviously not, unless you can cross a building. The actual driving distance is this "Manhattan distance". This is where the name Manhattan distance comes from, Manhattan distance is also known as CityBlock distance.

def Manhattan(vec1, vec2):
    npvec1, npvec2 = (vec1), (vec2)
    return (npvec1-npvec2).sum()
# Manhattan_Distance,

4. Chebyshev distance

Ever played chess? The king can move to any of the 8 adjacent squares in one move. What is the minimum number of moves required for the king to move from square (x1,y1) to square (x2,y2)? Try walking yourself. You will find that the minimum number of steps is always max(| x2-x1 | , | y2-y1 | ) steps. There is a similar measure of distance called the Chebyshev distance.

def Chebyshev(vec1, vec2):
    npvec1, npvec2 = (vec1), (vec2)
    return max((npvec1-npvec2))
# Chebyshev_Distance

5. Minkowski distance

Min's distance is not a distance, but a set of distances defined

#!/usr/bin/env python
 
from math import*
from decimal import Decimal
 
def nth_root(value,n_root):
    root_value=1/float(n_root)
    return round(Decimal(value)**Decimal(root_value),3)
 
def minkowski_distance(x,y,p_value):
    return nth_root(sum(pow(abs(a-b),p_value) for a,b in zip(x,y)),p_value)
 
print(minkowski_distance([0,3,4,5],[7,6,3,-1],3))

6. Standardized Euclidean distance

Standardized Euclidean distance is an improvement scheme for the shortcomings of simple Euclidean distance. The idea of standardized Euclidean distance: Since the distribution of the dimensional components of the data is not the same, okay! Then I will first "standardize" each component to the mean, variance and so on.

def Standardized_Euclidean(vec1,vec2,v):
    from scipy import spatial
    npvec = ([(vec1), (vec2)])
    return (npvec, 'seuclidean', V=None)
# Standardized Euclidean distance
# /jinzhichaoshuiping/article/details/51019473

7. Marginal distance

def Mahalanobis(vec1, vec2):
    npvec1, npvec2 = (vec1), (vec2)
    npvec = ([npvec1, npvec2])
    sub = [0]-[1]
    inv_sub = ((npvec1, npvec2))
    return ((inv_sub, sub).dot())
# MahalanobisDistance

8. Edit Distance

def Edit_distance_str(str1, str2):
    import Levenshtein
    edit_distance_distance = (str1, str2)
    similarity = 1-(edit_distance_distance/max(len(str1), len(str2)))
    return {'Distance': edit_distance_distance, 'Similarity': similarity}
# Levenshtein distance

where the input data are two arrays of the same dimension

The above is a personal experience, I hope it can give you a reference, and I hope you can support me more.