Fuzzy Logic: Analyzing Text Datasets

A challenging problem in analyzing text-based datasets is that a same record may be represented in multiple ways throughout the dataset. Variations can result from inconsistent naming conventions, spelling mistakes, etc. which makes it difficult to compare text data using ‘VLOOKUP’ & ‘MATCH’ functions. To match text datasets we can use something called Fuzzy Logic.

What is Fuzzy Logic?

Fuzzy logic is an approach to computing based on “degrees of truth” rather than the usual “true or false” (1 or 0) Boolean logic on which the modern computer is based.

Fuzzy logic includes 0 and 1 as extreme cases of truth (or “the state of matters” or “fact”) but also includes the various states of truth in between so that, for example, the result of a comparison between two things could be not “tall” or “short” but “.38 of tallness.”

Fuzzy Logic in Excel

Analytics software like Idea, ACL have inbuilt Fuzzy lookup logic. For Excel Microsoft has provided a free add-in called Fuzzy Lookup Add-in for Excel.

The Fuzzy Lookup add-in provided by Microsoft works on the principle of Jaccard Similarity Index. The Jaccard index also known as Jaccard similarity coefficient was developed by Paul Jaccard.

In Jaccard Similarity index, you take two sample sets. You count the attributes shared by both samples (call this Z). You count the attributes observed in only one sample (call this X). You then count the attributes observed in only the other sample (call this Y). Finally, you can calculate the similarity index by using the three counts in this operation.

Jaccard Similarity Index = Z / (Z+X+Y)

Fuzzy lookup add-in can be used to match records in two different data sets or can be used to find duplicates within the same dataset.

Using the Fuzzy Lookup add-in:
  • Convert your data in to a table (Alt+N+T or Ctrl+T or Insert>Table)
  • Go in a new sheet as Fuzzy Lookup add-in overwrites the data
  • Click the Fuzzy Lookup button on ribbon

In the Fuzzy Lookup dialogue box:

  1. Select the two tables you want to match
  2. Select the columns that holds the data you want to match
  3. Click the join button to drop your two columns down to the Match Columns section
  4. Select all the columns you want included in your output. Be sure to include the FuzzyLookupSimilarity column
  5. Define how many matches you want per record. If you keep this to 1, the output will only show exact duplicates with Similarity Index = 1. Keep it around 10
  6. Define the Similarity Threshold. This is the Jaccard Index number. Any match with an index of less than 0.70 is typically not that accurateFuzzy Lookup add-in settings
  7. Press the Go button to get the resultFuzzy lookup addin output

Fuzzy Lookup also has algorithms and built-in dictionaries that allow for the automatic correction of spelling mistakes, word merge scenarios, string split scenarios, and string prefix corrections. All of these automatic transformation can be turned on and off via the Fuzzy Lookup Configure dialog box. You can add custom Transformation rules by pointing the TransformationRowsetName property to your own Excel Table of rules.Fuzzy Lookup add-in advanced settings

You can download the sample file, to practice Fuzzy Lookup in Excel.
Also you can see this blog post by datapig Text Match & Fuzzy Lookup for more detailed information on Jaccard Similarity index and Fuzzy Lookup add-in.

Advertisements

2 thoughts on “Fuzzy Logic: Analyzing Text Datasets

  1. ITauditSecurity September 7, 2017 / 3:06 am

    Audit Monk,
    Great post. I use Excel’s Fuzzy Lookup and I suggest all auditors give it a try. It looks more complicated than it is. The only problem is that when you get up to 30,000 records, it starts to slow down.

    I always suggest that you play around with the settings. Different data will provide more accurate comparisons than others. I think it’s better to use a lower setting like .70 as a Similarity Threshold than a higher one. I like to sort the results by the threshold and find the point at which the accuracy is acceptable (say .78); then I filter out anything below that, and I have the results I’m looking for.

    Liked by 1 person

    • AuditMonk September 9, 2017 / 5:35 pm

      I recently came across this function in excel and have not used it that much. And Completely agree with you on that we should play around with settings since we never know which setting may give us what in different datasets.

      Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s