This utility returns a numeric score between 0 and 1. 0 being least similar and 1 being identical. As a rule of thumb, a value over 0.6 means the sequences are close matches (at least for the seqmatch method).
How the Compare Methods Work
Seq Match
Python's difflib documentation describes the formula for this score as 2.0*M / T where T is the total number of elements in both sequences, and M is the number of matches.
Jaro
As desribed in 'An Application of the Fellegi-Sunter Model of Record Linkage to the 1990 U.S. Decennial Census' by William E. Winkler and Yves hibaudeau.
Winkler
Based on the 'jaro' string comparator, but modifies it according to whether the first few characters are the same or not.
Bigram
Bigrams are two-character sub-strings contained in a string. For example, 'peter' contains the bigrams: pe,et,te,er. This routine counts the number of common bigrams and divides by the average number of bigrams. The resulting number is returned.
Edit Distance
The edit distance is the minimal number of insertions, deletions and substitutions needed to make two strings equal.
The functions in this utility come from the project Pyfdupes
Utility Mill is another wonderful Blended Technologies project.
copyright, owned and operated by Blended Technologies LLC.