Home > General > Finding text similarities with fuzzy hashes – Duplicate code for example

Finding text similarities with fuzzy hashes – Duplicate code for example

How would an email server go about identifying spam email ? The problem is an interesting one. The challenges towards identifying spam are…

1. Scaling any solution to thousands of emails
2. Identifying spam even when there are small changes to the spam content
3. Reducing false positives

One solution could be to identify a hash for the spam message and compare the hash with the hash of a new message. The problem with this approach is that minute changes in a message can result in a different hash. A fuzzy hash (Context triggered piecewise hash (CTPH)) solves this by calculating hashes based on a trigger point in the text. Hash values are calculated for pieces of the text, delimited by a trigger. For example the trigger for the following text could be ‘a’ and ‘or’

why would a lazy sunday be greeted with sleep. or was I wrong? You were not sleeping ?

Here are the MD5 sums for this text, split by the delimiters / triggers

why would = 3cbca4bc4bba85fd54f384867ff4fd3e

lazy sunday be greeted with sleep. = e007174b7ef850ccc4b68ae1db98d1fe

was I wrong? You were not sleeping ? = 17d36b21dc2dfe530ea677075519a265


Summing up these hashes gives a final hash. If this text were considered spam, and minor changes occur to the spam, the individual hashes can be compared to arrive at a ‘match score’.

why would a lazy sunday be greeted with sleep. or was I wrong? You were not sleeping ? Did you get the jIAgra pills I sent ?

why would = 3cbca4bc4bba85fd54f384867ff4fd3e

lazy sunday be greeted with sleep. = e007174b7ef850ccc4b68ae1db98d1fe

was I wrong? You were not sleeping ? Did you get the jIAgra pills I sent ? = 556ce7bf1c804330863e4d39755d5c58

Only one of the hashes has changed. This would give a good score. The ssdeep library calculates fuzzy hashes for you. Comparing similar spam messages is like comparing similar source code. Developers love to copy one piece of code from Project A and paste it into Project B. Eclipse plugins like ‘Google code pro analytix’ try to track similar code. I am not aware if the plugin uses fuzzy hashes to make this comparison, but it is a possibility.

The fuzzy hash concept can be be extended to solve other problems

  • A DDOS tool issues the same type of request across different clients. If a fuzzy hash is applied to one request and it is identified as spam, the same can be done across all other requests. If the same request originates from the IP again, the fuzzy hash calculation is not necessary. Simply ban the IP for a few days.
  • Environments that are affected by a common problem probably log the same error message. A network timeout could affect say 5 servers. Comparing the error message likeness across the server logs can help determine the problem quicker
  • CTPH is also used in forensics.

The next time you encounter a problem that involves ‘text similarity’ give fuzzy hashes a thought.





Categories: General Tags: ,
  1. Shubhashish
    April 25th, 2011 at 18:13 | #1

    Nice article, specially to know about “Fuzzy hash”.

  2. May 1st, 2013 at 06:49 | #2

    Bad news, everyone! Unfortunately, Matt Groening best series (you heard me) is going into that dark knight… again. Entertainment Weekly reports that Comedy Central has decided not to renew the show after the final set of 13 episodes starts airing in June

  3. October 5th, 2013 at 02:04 | #3

    nike id 納期

  4. October 19th, 2013 at 07:44 | #4

    You’re so cool! I do not believe I’ve truly read
    something like that before. So wonderful to find
    somebody with some original thoughts on this topic. Seriously..
    thanks for starting this up. This website is something that is required on the web, someone with a bit of originality!

  1. No trackbacks yet.