Edit Distance
The edit distance, also known as the Levenshtein distance, is used to count the number of insertions, deletions, and substitutions when converting one string to another. For example, converting 'dad' to 'bad' requires one replacement operation with an edit distance of 1.
The .edit_distance function implements the edit distance.
from import edit_distance str1 = 'bad' str2 = 'dad' print(edit_distance(str1, str2))
N meta-syntactic similarity
The n-tuple syntax simply represents all possible consecutive sequences of n tokens in the text. n-tuple syntax specifically looks like this
import nltk #Show 2-tuple syntax here text1 = 'Chief Executive Officer' The #bigram considers matching the beginning and the end, all using pad_right and pad_left ceo_bigrams = ((),pad_right=True,pad_left=True) print(list(ceo_bigrams)) [(None, 'Chief'), ('Chief', 'Executive'), ('Executive', 'Officer'), ('Officer', None)]
2-tuple syntactic similarity computation
import nltk #Show 2-tuple syntax here def bigram_distance(text1, text2): The #bigram considers matching the beginning and the end, so pad_right and pad_left are used. text1_bigrams = ((),pad_right=True,pad_left=True) text2_bigrams = ((), pad_right=True, pad_left=True) # The length of the intersection distance = len(set(text1_bigrams).intersection(set(text2_bigrams))) return distance text1 = 'Chief Executive Officer is manager' text2 = 'Chief Technology Officer is technology manager' print(bigram_distance(text1, text2)) # The similarity is 3
jaccard similarity
The jaccard distance measures the similarity of two sets, which is computed from (set 1 crosses set 2)/(union 1 crosses union 2).
implementation method
from import jaccard_distance # Here we represent the text as a single character set1 = set(['a','b','c','d','a']) set2 = set(['a','b','e','g','a']) print(jaccard_distance(set1, set2))
0.6666666666666666
masi distance
The masi distance metric is a weighted version of the jaccard similarity and is used to generate less than jaccard distance values by adjusting the score when there is partial overlap between sets.
from import jaccard_distance,masi_distance # Here we represent the text as a single character set1 = set(['a','b','c','d','a']) set2 = set(['a','b','e','g','a']) print(jaccard_distance(set1, set2)) print(masi_distance(set1, set2))
0.6666666666666666
0.22000000000000003
cosine similarity
nltk provides implementations of cosine similarity, such as having a word space
word_space = [w1,w2,w3,w4] text1 = 'w1 w2 w1 w4 w1' text2 = 'w1 w3 w2' # Count the number of occurrences of words in each position according to word_space position text1_vector = [3,1,0,1] text2_vector = [1,1,1,0]
[3,1,0,1] means that w1 occurs 3 times, w2 occurs 1 time, w3 occurs 0 times, and w4 occurs 1 time.
Okay look at the code below to compute the cosine similarity between text1 and text2
from import cosine_distance text1_vector = [3,1,0,1] text2_vector = [1,1,1,0] print(cosine_distance(text1_vector,text2_vector))
0.303689376177
This is the whole content of this article.