A challenging problem in analyzing text-based datasets is that a same record may be represented in multiple ways throughout the dataset. Variations can result from inconsistent naming conventions, spelling mistakes, etc. which makes it difficult to compare text data using ‘VLOOKUP’ & ‘MATCH’ functions. To match text datasets we can use something called Fuzzy Logic.
What is Fuzzy Logic?
Fuzzy logic is an approach to computing based on “degrees of truth” rather than the usual “true or false” (1 or 0) Boolean logic on which the modern computer is based.
Fuzzy logic includes 0 and 1 as extreme cases of truth (or “the state of matters” or “fact”) but also includes the various states of truth in between so that, for example, the result of a comparison between two things could be not “tall” or “short” but “.38 of tallness.”
Fuzzy Logic in Excel
Analytics software like Idea, ACL have inbuilt Fuzzy lookup logic. For Excel Microsoft has provided a free add-in called Fuzzy Lookup Add-in for Excel.
The Fuzzy Lookup add-in provided by Microsoft works on the principle of Jaccard Similarity Index. The Jaccard index also known as Jaccard similarity coefficient was developed by Paul Jaccard.
In Jaccard Similarity index, you take two sample sets. You count the attributes shared by both samples (call this Z). You count the attributes observed in only one sample (call this X). You then count the attributes observed in only the other sample (call this Y). Finally, you can calculate the similarity index by using the three counts in this operation.
Jaccard Similarity Index = Z / (Z+X+Y)
Fuzzy lookup add-in can be used to match records in two different data sets or can be used to find duplicates within the same dataset.
Using the Fuzzy Lookup add-in:
- Convert your data in to a table (Alt+N+T or Ctrl+T or Insert>Table)
- Go in a new sheet as Fuzzy Lookup add-in overwrites the data
- Click the Fuzzy Lookup button on ribbon
In the Fuzzy Lookup dialogue box:
- Select the two tables you want to match
- Select the columns that holds the data you want to match
- Click the join button to drop your two columns down to the Match Columns section
- Select all the columns you want included in your output. Be sure to include the FuzzyLookupSimilarity column
- Define how many matches you want per record. If you keep this to 1, the output will only show exact duplicates with Similarity Index = 1. Keep it around 10
- Define the Similarity Threshold. This is the Jaccard Index number. Any match with an index of less than 0.70 is typically not that accurate
- Press the Go button to get the result
Fuzzy Lookup also has algorithms and built-in dictionaries that allow for the automatic correction of spelling mistakes, word merge scenarios, string split scenarios, and string prefix corrections. All of these automatic transformation can be turned on and off via the Fuzzy Lookup Configure dialog box. You can add custom Transformation rules by pointing the TransformationRowsetName property to your own Excel Table of rules.
You can download the sample file, to practice Fuzzy Lookup in Excel.
Also you can see this blog post by datapig Text Match & Fuzzy Lookup for more detailed information on Jaccard Similarity index and Fuzzy Lookup add-in.