Utility Mill

Text_Similarity

See how similar two pieces of text are


Output


Instructions / Discussion

This utility returns a numeric score between 0 and 1. 0 being least similar and 1 being identical. As a rule of thumb, a value over 0.6 means the sequences are close matches (at least for the seqmatch method).

How the Compare Methods Work

Seq Match

Python's difflib documentation describes the formula for this score as 2.0*M / T where T is the total number of elements in both sequences, and M is the number of matches.

Jaro

As desribed in 'An Application of the Fellegi-Sunter Model of Record Linkage to the 1990 U.S. Decennial Census' by William E. Winkler and Yves hibaudeau.

Winkler

Based on the 'jaro' string comparator, but modifies it according to whether the first few characters are the same or not.

Bigram

Bigrams are two-character sub-strings contained in a string. For example, 'peter' contains the bigrams: pe,et,te,er. This routine counts the number of common bigrams and divides by the average number of bigrams. The resulting number is returned.

Edit Distance

The edit distance is the minimal number of insertions, deletions and substitutions needed to make two strings equal.

The functions in this utility come from the project Pyfdupes

Utility Mill is another wonderful Blended Technologies project.

copyright, owned and operated by Blended Technologies LLC.

Powered by Python and the ineffable Web.py