SoFunction
Updated on 2024-11-10

python text data similarity measure

Edit Distance

The edit distance, also known as the Levenshtein distance, is used to count the number of insertions, deletions, and substitutions when converting one string to another. For example, converting 'dad' to 'bad' requires one replacement operation with an edit distance of 1.

The .edit_distance function implements the edit distance.

from  import edit_distance

str1 = 'bad'
str2 = 'dad'
print(edit_distance(str1, str2))

N meta-syntactic similarity

The n-tuple syntax simply represents all possible consecutive sequences of n tokens in the text. n-tuple syntax specifically looks like this

import nltk

#Show 2-tuple syntax here
text1 = 'Chief Executive Officer'

The #bigram considers matching the beginning and the end, all using pad_right and pad_left
ceo_bigrams = ((),pad_right=True,pad_left=True)

print(list(ceo_bigrams))
[(None, 'Chief'), ('Chief', 'Executive'), 
('Executive', 'Officer'), ('Officer', None)]

2-tuple syntactic similarity computation

import nltk

#Show 2-tuple syntax here
def bigram_distance(text1, text2):
  The #bigram considers matching the beginning and the end, so pad_right and pad_left are used.
  text1_bigrams = ((),pad_right=True,pad_left=True)
  
  text2_bigrams = ((), pad_right=True, pad_left=True)
  
  # The length of the intersection
  distance = len(set(text1_bigrams).intersection(set(text2_bigrams)))
  
  return distance


text1 = 'Chief Executive Officer is manager'

text2 = 'Chief Technology Officer is technology manager'

print(bigram_distance(text1, text2)) # The similarity is 3

jaccard similarity

The jaccard distance measures the similarity of two sets, which is computed from (set 1 crosses set 2)/(union 1 crosses union 2).

implementation method

from  import jaccard_distance

# Here we represent the text as a single character
set1 = set(['a','b','c','d','a'])
set2 = set(['a','b','e','g','a'])

print(jaccard_distance(set1, set2))

0.6666666666666666

masi distance

The masi distance metric is a weighted version of the jaccard similarity and is used to generate less than jaccard distance values by adjusting the score when there is partial overlap between sets.

from  import jaccard_distance,masi_distance

# Here we represent the text as a single character
set1 = set(['a','b','c','d','a'])
set2 = set(['a','b','e','g','a'])

print(jaccard_distance(set1, set2))
print(masi_distance(set1, set2))

0.6666666666666666
0.22000000000000003

cosine similarity

nltk provides implementations of cosine similarity, such as having a word space

word_space = [w1,w2,w3,w4]

text1 = 'w1 w2 w1 w4 w1'
text2 = 'w1 w3 w2'

# Count the number of occurrences of words in each position according to word_space position

text1_vector = [3,1,0,1]
text2_vector = [1,1,1,0]

[3,1,0,1] means that w1 occurs 3 times, w2 occurs 1 time, w3 occurs 0 times, and w4 occurs 1 time.

Okay look at the code below to compute the cosine similarity between text1 and text2

from  import cosine_distance

text1_vector = [3,1,0,1]
text2_vector = [1,1,1,0]

print(cosine_distance(text1_vector,text2_vector))

0.303689376177

This is the whole content of this article.