Skip to the content.

Understanding Similarity Measures for Text Analysis

Check the article here: Understanding Similarity Measures for Text Analysis

The aim of this repository is analyze how similar two words or terms are to each other, Distance metrics are the most common techniques to use for mesure similarities between words, there is an area called Document similarity that is responsible for searching similarities between sentences or paragraphs of text, this document only focuses on individual words or terms. It is worth mentioning that these techniques could be used for spell-checking, autocomplete sentences or correct words or senteces.

image

Distance Metric of similarity

There are a lot of Distance metrics that you can use to compute and mesure similarities between entities of text. The measure d is called distance metric of similarity if and only if satisfies the next coditions:

all similarity distance measures of similarity are not distance metrics of similarity.

Feature engineering

Before start the computation of the distance metrics to check how close two words are, we need to convert the text to numeric values by mapping each letter to a unique numeric value.

Let’s imagine the following group of words:

words = ['Similar', 'similares', 'iguales', 'parecidas', 'similaridad', 'isimral', 'similat', 'somiral']

Converting to a vector of numerical values using Character vectorization

    def __vectorize(self):

        vectors = [np.array(list(w.lower())) for w in self.words]
        vectors = [np.array([ord(c) for c in w]) for w in vectors]
        return vectors

image

Converting to a vector of numerical values using Bag of character vectorization:

    def __bag_characters_vectorize(self):

        u = np.unique(np.hstack([list(w.lower()) for w in self.words ]))
        w_c = [{ c1: c2 for c1, c2 in np.stack(np.unique(list(w.lower()), return_counts = True), axis= 1)}
               for w in self.words]
        bag = [np.array([int(wt.get(char, 0)) for char in u]) for wt in w_c]

        return list(u), bag

image

Hamming Distance

It is the distance measured between two words assuming both has the same length or it is defined as the number of positions that have symbols between two words of equal length.

image

    def __hamming_distance(self):

        if self.u.shape != self.v.shape:
            return 'Incorrect lengths'
        elif self.norm:
            return (self.u!=self.v).mean()            
        else:
            return (self.u!=self.v).sum()

image

Manhattan Distance

Is similar to the Hamming distance, without counting the number of characters not matching and subtracting the difference between each pair of characters at each position of the two words.

image

    def __manhattan_distance(self):

        if self.u.shape != self.v.shape:
            return 'Incorrect lengths'
        elif self.norm:
            return abs(self.u - self.v).mean()            
        else:
            return abs(self.u - self.v).sum()

image

Euclidean Distance

It is the shortest straight path distance between two points.

image

    def __euclidean_distance(self):

        if self.u.shape != self.v.shape:
            return 'Incorrect lengths'
        return np.sqrt(np.sum(np.square(self.u - self.v)))

image

Levenshtein Distance or Wagner-Fischer algorithm

It is somewhat similar to the Hamming distance, is used to measure the distance between two sequence of words based on their differences. In Short, is the minimum number of editions needed (deletions, substitutions or aditions) to convert one word to the other, the length of both words need not be equal.

image

    def __levenshtein_distance(self):

        u = self.u.lower()
        v = self.v.lower()

        if u == v: return 0
        elif len(u) == 0: return len(v)
        elif len(v) == 0: return len(u)

        matrix = []

        d_u = [0] * (len(v) + 1)
        d_v = [0] * (len(v) + 1)

        for i in range(len(d_u)):
            d_u[i] = i
        
        for i in range(len(u)):
            d_v[0] = i + 1
            for j in range(len(v)):
                cost = 0 if u[i] == v[j] else 1
                d_v[j + 1] =  min(d_v[j] + 1, d_u[j + 1] + 1, d_u[j] + cost)
            for j in range(len(d_u)):
                d_u[j] = d_v[j]
            matrix.append(copy.copy(d_v))
        distance = d_v[len(v)]
        matrix = np.array(matrix)
        matrix = matrix.T
        matrix = matrix[1:,]
        matrix = pd.DataFrame(data = matrix,
                               index= list(v),
                               columns= list(u))

image

image

Cosine Distance and Cosine similarity

Remember that:

The similarity between two words decreases when the distance between its vectors increases The similarity between two words increases when the distance between its vectors decreases.

If we have two words represented as a vectors using its numerical representation(Bag of character vectorization), the cosine similarity will give us the measure of the cosine of the angle between the vectors.

image


    def __cosine_distance_similarity(self):

        distance = 1.0 - (np.dot(self.u, self.v)/
                          (np.sqrt(sum(np.square(self.u))) * 
                           np.sqrt(sum(np.square(self.v))))
                          )
        return distance

image

Summary

image

Contributing and Feedback

Any ideas or feedback about this repository?. Help me to improve it.

Authors

License

This project is licensed under the terms of the Apache License.