Understanding Similarity Measures for Text Analysis
Check the article here: Understanding Similarity Measures for Text Analysis
The aim of this repository is analyze how similar two words or terms are to each other, Distance metrics are the most common techniques to use for mesure similarities between words, there is an area called Document similarity that is responsible for searching similarities between sentences or paragraphs of text, this document only focuses on individual words or terms. It is worth mentioning that these techniques could be used for spell-checking, autocomplete sentences or correct words or senteces.
Distance Metric of similarity
There are a lot of Distance metrics that you can use to compute and mesure similarities between entities of text. The measure d is called distance metric of similarity if and only if satisfies the next coditions:
- The distance between any two words say X and Y, must be always non-negative
- The distance between two words should always be zero if and only if they are identical
- The distance measure should always be symmetric, the distance between X and Y muts be equals to the distance between Y and X
- The distance measure should satisfy the triangle inequality property
- Due to the conditions mentioned above, the Kullback-leibler is a distance measure which violates the third condition and it can not be used for measure similarity between words.
all similarity distance measures of similarity are not distance metrics of similarity.
Feature engineering
Before start the computation of the distance metrics to check how close two words are, we need to convert the text to numeric values by mapping each letter to a unique numeric value.
Let’s imagine the following group of words:
words = ['Similar', 'similares', 'iguales', 'parecidas', 'similaridad', 'isimral', 'similat', 'somiral']
Converting to a vector of numerical values using Character vectorization
def __vectorize(self):
vectors = [np.array(list(w.lower())) for w in self.words]
vectors = [np.array([ord(c) for c in w]) for w in vectors]
return vectors
Converting to a vector of numerical values using Bag of character vectorization:
def __bag_characters_vectorize(self):
u = np.unique(np.hstack([list(w.lower()) for w in self.words ]))
w_c = [{ c1: c2 for c1, c2 in np.stack(np.unique(list(w.lower()), return_counts = True), axis= 1)}
for w in self.words]
bag = [np.array([int(wt.get(char, 0)) for char in u]) for wt in w_c]
return list(u), bag
Hamming Distance
It is the distance measured between two words assuming both has the same length or it is defined as the number of positions that have symbols between two words of equal length.
def __hamming_distance(self):
if self.u.shape != self.v.shape:
return 'Incorrect lengths'
elif self.norm:
return (self.u!=self.v).mean()
else:
return (self.u!=self.v).sum()
Manhattan Distance
Is similar to the Hamming distance, without counting the number of characters not matching and subtracting the difference between each pair of characters at each position of the two words.
def __manhattan_distance(self):
if self.u.shape != self.v.shape:
return 'Incorrect lengths'
elif self.norm:
return abs(self.u - self.v).mean()
else:
return abs(self.u - self.v).sum()
Euclidean Distance
It is the shortest straight path distance between two points.
def __euclidean_distance(self):
if self.u.shape != self.v.shape:
return 'Incorrect lengths'
return np.sqrt(np.sum(np.square(self.u - self.v)))
Levenshtein Distance or Wagner-Fischer algorithm
It is somewhat similar to the Hamming distance, is used to measure the distance between two sequence of words based on their differences. In Short, is the minimum number of editions needed (deletions, substitutions or aditions) to convert one word to the other, the length of both words need not be equal.
def __levenshtein_distance(self):
u = self.u.lower()
v = self.v.lower()
if u == v: return 0
elif len(u) == 0: return len(v)
elif len(v) == 0: return len(u)
matrix = []
d_u = [0] * (len(v) + 1)
d_v = [0] * (len(v) + 1)
for i in range(len(d_u)):
d_u[i] = i
for i in range(len(u)):
d_v[0] = i + 1
for j in range(len(v)):
cost = 0 if u[i] == v[j] else 1
d_v[j + 1] = min(d_v[j] + 1, d_u[j + 1] + 1, d_u[j] + cost)
for j in range(len(d_u)):
d_u[j] = d_v[j]
matrix.append(copy.copy(d_v))
distance = d_v[len(v)]
matrix = np.array(matrix)
matrix = matrix.T
matrix = matrix[1:,]
matrix = pd.DataFrame(data = matrix,
index= list(v),
columns= list(u))
Cosine Distance and Cosine similarity
Remember that:
The similarity between two words decreases when the distance between its vectors increases The similarity between two words increases when the distance between its vectors decreases.
If we have two words represented as a vectors using its numerical representation(Bag of character vectorization), the cosine similarity will give us the measure of the cosine of the angle between the vectors.
- Angle close to 0 means that the cosine similarity is close to 1, both words are very similar.
- Angle close to 90 means that the cosine similarity is close to 0, both words are not similar.
- Angle close to 180 means that cosine similarity is close to -1, both words unrelated.
def __cosine_distance_similarity(self):
distance = 1.0 - (np.dot(self.u, self.v)/
(np.sqrt(sum(np.square(self.u))) *
np.sqrt(sum(np.square(self.v))))
)
return distance
Summary
- Green is the first option, Yellow is the second option and Orange is the third option
- Hamming, agree with the first option, agree with the second option
- Manhattan, agree with the first option, not agree with the second and third option
- Euclidean, agree with the first option , not agree with the second option
- Levenshtein, agree with the first option and agree with the second option
- Cosine, not agree with the first option
Contributing and Feedback
Any ideas or feedback about this repository?. Help me to improve it.
Authors
- Created by Ramses Alexander Coraspe Valdez
- Created on 2021
License
This project is licensed under the terms of the Apache License.