Home > General > Finding text similarities with fuzzy hashes – Duplicate code for example

Finding text similarities with fuzzy hashes – Duplicate code for example

How would an email server go about identifying spam email ? The problem is an interesting one. The challenges towards identifying spam are…

1. Scaling any solution to thousands of emails
2. Identifying spam even when there are small changes to the spam content
3. Reducing false positives

One solution could be to identify a hash for the spam message and compare the hash with the hash of a new message. The problem with this approach is that minute changes in a message can result in a different hash. A fuzzy hash (Context triggered piecewise hash (CTPH)) solves this by calculating hashes based on a trigger point in the text. Hash values are calculated for pieces of the text, delimited by a trigger. For example the trigger for the following text could be ‘a’ and ‘or’

why would a lazy sunday be greeted with sleep. or was I wrong? You were not sleeping ?

Here are the MD5 sums for this text, split by the delimiters / triggers

why would = 3cbca4bc4bba85fd54f384867ff4fd3e

lazy sunday be greeted with sleep. = e007174b7ef850ccc4b68ae1db98d1fe

was I wrong? You were not sleeping ? = 17d36b21dc2dfe530ea677075519a265

Summing up these hashes gives a final hash. If this text were considered spam, and minor changes occur to the spam, the individual hashes can be compared to arrive at a ‘match score’.

why would a lazy sunday be greeted with sleep. or was I wrong? You were not sleeping ? Did you get the jIAgra pills I sent ?

why would = 3cbca4bc4bba85fd54f384867ff4fd3e

lazy sunday be greeted with sleep. = e007174b7ef850ccc4b68ae1db98d1fe

was I wrong? You were not sleeping ? Did you get the jIAgra pills I sent ? = 556ce7bf1c804330863e4d39755d5c58

Only one of the hashes has changed. This would give a good score. The ssdeep library calculates fuzzy hashes for you. Comparing similar spam messages is like comparing similar source code. Developers love to copy one piece of code from Project A and paste it into Project B. Eclipse plugins like ‘Google code pro analytix’ try to track similar code. I am not aware if the plugin uses fuzzy hashes to make this comparison, but it is a possibility.

The fuzzy hash concept can be be extended to solve other problems

  • A DDOS tool issues the same type of request across different clients. If a fuzzy hash is applied to one request and it is identified as spam, the same can be done across all other requests. If the same request originates from the IP again, the fuzzy hash calculation is not necessary. Simply ban the IP for a few days.
  • Environments that are affected by a common problem probably log the same error message. A network timeout could affect say 5 servers. Comparing the error message likeness across the server logs can help determine the problem quicker
  • CTPH is also used in forensics.

The next time you encounter a problem that involves ‘text similarity’ give fuzzy hashes a thought.

Categories: General Tags: ,
  1. Shubhashish
    April 25th, 2011 at 18:13 | #1

    Nice article, specially to know about “Fuzzy hash”.

  2. May 1st, 2013 at 06:49 | #2

    Bad news, everyone! Unfortunately, Matt Groening best series (you heard me) is going into that dark knight… again. Entertainment Weekly reports that Comedy Central has decided not to renew the show after the final set of 13 episodes starts airing in June

  3. October 5th, 2013 at 02:04 | #3

    nike id 納期

  4. October 19th, 2013 at 07:44 | #4

    You’re so cool! I do not believe I’ve truly read
    something like that before. So wonderful to find
    somebody with some original thoughts on this topic. Seriously..
    thanks for starting this up. This website is something that is required on the web, someone with a bit of originality!

  5. October 9th, 2014 at 04:39 | #5

    Howdy I am so excited I found your site, I really found
    you by accident, while I was browsing on Google for something else, Regardless I am here now and would just like to say thanks for a marvelous
    post and a all round enjoyable blog (I also love the
    theme/design), I don’t have time to go through it all at the minute but
    I have bookmarked it and also added your RSS feeds, so when I have time I will be back to read more, Please
    do keep up the awesome b.

  6. October 9th, 2014 at 22:08 | #6

    Good article. І am facing ɑ few of these issues as well..

  7. October 11th, 2014 at 18:30 | #7

    I think the admin of this site is in fact working hard in favor of his website, for the reason that here every
    information is quality based information.

  8. October 14th, 2014 at 05:58 | #8

    It remains true that there is no way to stop the onset of
    male pattern baldness or, more correctly, the changes in the
    body that lead to the development of this disorder. Olive oil is
    absolutely wonderful and unfortunately overlooked as a fantastic, effective hair growth treatment.
    It is better for you to steer clear and to
    eat instead more natural foods with higher fiber.

  9. October 14th, 2014 at 12:22 | #9

    It is easy to understand and use, even a novice can do it.

    The embarrassment is a complex recipe to get your hands on more types of Overstock Coupon Code.
    You will want to uninstall all of your other Antivirus and Antispyware before you download and install another one,
    or the current version might flag it and not allow it to be downloaded
    or installed.

  10. October 15th, 2014 at 03:36 | #10

    L’installation dans votre jardin est possible mais il va falloir
    tracter la remorque (750 kg) jusqu’à l’endroit voulu et l’installer (stabilisateur hydraulique).

  11. October 16th, 2014 at 06:41 | #11

    If you desire to take a great deal from this piece of writing
    then you have to apply these strategies to your won webpage.

  12. October 18th, 2014 at 15:34 | #12

    Of all domesticated animals, dogs are the most loyal and loving
    of all. Picking a bed for your pooch is as exciting as selecting your own bed or sofa.
    For more information about dog clothes click Dog Clothes, and for more about dog health and keeping your dog
    happy click Dogs and Puppies Online.

  13. October 19th, 2014 at 19:00 | #13

    In particularly cold areas such as those where it snows most of the time, scarves can provide adequate protection and still look good and hip.
    You can bring your handmade jewelry gifts to a professional engraver to have them engraved with the girls’ names.
    Make a diagonal fold of the square cotton scarf creating a
    triangular output.

  14. October 19th, 2014 at 19:08 | #14

    The webcam supplies an instant link and the savvy predator will know just how to reel the younger person into a “relationship” that will go beyond
    the Chatroulette on-line site. Tech Crunch found that about 15% of the time
    a user will be brought into a chat yielding something R rated
    or worse. The website is now a global phenomenon with bands using it to webcast performances and many celebrities logging on to
    chat with individuals across the world.

  15. October 19th, 2014 at 22:25 | #15

    magnificent put up, very informative. I ponder why the opposite experts of this sector do not notice this.
    You must continue your writing. I’m confident, you’ve a huge readers’
    base already!

  16. October 20th, 2014 at 07:54 | #16

    Their products additionally add new and innovative intercourse toys and so they offer
    a lowest value assure on all their grownup toys to be able to
    convey you the best choice and worth. Sex toys
    have their time and place though and shouldn’t be
    used as a substitute to intimacy, however an enhancement of it.

    Use the sex toys often, however have intercourse without them often too.

  1. No trackbacks yet.