Home > General > Finding text similarities with fuzzy hashes – Duplicate code for example

Finding text similarities with fuzzy hashes – Duplicate code for example

How would an email server go about identifying spam email ? The problem is an interesting one. The challenges towards identifying spam are…

1. Scaling any solution to thousands of emails
2. Identifying spam even when there are small changes to the spam content
3. Reducing false positives

One solution could be to identify a hash for the spam message and compare the hash with the hash of a new message. The problem with this approach is that minute changes in a message can result in a different hash. A fuzzy hash (Context triggered piecewise hash (CTPH)) solves this by calculating hashes based on a trigger point in the text. Hash values are calculated for pieces of the text, delimited by a trigger. For example the trigger for the following text could be ‘a’ and ‘or’

why would a lazy sunday be greeted with sleep. or was I wrong? You were not sleeping ?

Here are the MD5 sums for this text, split by the delimiters / triggers

why would = 3cbca4bc4bba85fd54f384867ff4fd3e

lazy sunday be greeted with sleep. = e007174b7ef850ccc4b68ae1db98d1fe

was I wrong? You were not sleeping ? = 17d36b21dc2dfe530ea677075519a265

Summing up these hashes gives a final hash. If this text were considered spam, and minor changes occur to the spam, the individual hashes can be compared to arrive at a ‘match score’.

why would a lazy sunday be greeted with sleep. or was I wrong? You were not sleeping ? Did you get the jIAgra pills I sent ?

why would = 3cbca4bc4bba85fd54f384867ff4fd3e

lazy sunday be greeted with sleep. = e007174b7ef850ccc4b68ae1db98d1fe

was I wrong? You were not sleeping ? Did you get the jIAgra pills I sent ? = 556ce7bf1c804330863e4d39755d5c58

Only one of the hashes has changed. This would give a good score. The ssdeep library calculates fuzzy hashes for you. Comparing similar spam messages is like comparing similar source code. Developers love to copy one piece of code from Project A and paste it into Project B. Eclipse plugins like ‘Google code pro analytix’ try to track similar code. I am not aware if the plugin uses fuzzy hashes to make this comparison, but it is a possibility.

The fuzzy hash concept can be be extended to solve other problems

  • A DDOS tool issues the same type of request across different clients. If a fuzzy hash is applied to one request and it is identified as spam, the same can be done across all other requests. If the same request originates from the IP again, the fuzzy hash calculation is not necessary. Simply ban the IP for a few days.
  • Environments that are affected by a common problem probably log the same error message. A network timeout could affect say 5 servers. Comparing the error message likeness across the server logs can help determine the problem quicker
  • CTPH is also used in forensics.

The next time you encounter a problem that involves ‘text similarity’ give fuzzy hashes a thought.

Categories: General Tags: ,
  1. April 10th, 2015 at 01:45 | #1

    Great post.

  2. April 10th, 2015 at 08:25 | #2

    Dispute between the towers’ management and owners of apartments started in July 2013 when a number of
    defects in the towers were pointed out by an inspection team of the Capital Development Authority (CDA).

    With the CPA designation in hand, numerous specialized jobs will become available to you such as auditor, consultant, tax expert, and even chief financial officer.
    for $6,400,000 for FY 2010, he again stated the Super P.

  3. April 18th, 2015 at 15:44 | #3

    If they seem too busy or show a lack of interest when you interview them,
    walk away. A divorce is a legal matter that can involve many
    legalities which may take time and effort to resolve. There he was responsible for all the
    legal aspects of this privately owned conglomerate, as
    well as all banking and treasury relationships and contracts.

  4. April 20th, 2015 at 05:03 | #4

    It has become easy to search for songs, bands, singers etc.
    Download Nigerian music from these online sites
    and become lost in the music. The first step is for you to select an online
    storage service on the internet.

  5. April 23rd, 2015 at 15:17 | #5

    First of all, you need to know about the type of the mobile which
    you are going to buy and then its cost as you have the earning power to take
    that cell phone. This is a great way to give
    them coupons and news that might be of additional
    importance. Special applications are required for that, this is made possible through
    those SMS in Bulk when having a large number of recipients as it.

  6. April 24th, 2015 at 07:45 | #6

    Actually no matter if someone doesn’t be aware of after that
    its up to other users that they will help, so here it happens.

  7. April 25th, 2015 at 16:29 | #7

    Amelia Freer champions a gluten-free Paleo-style diet that consists of high-quality proteins such as organic eggs, free-range chicken and
    plenty of vegetables. Pro Action Acai – This wonderful combo pack is rich in good things about unleash in your body
    and permit you boost the way you appear and feel from
    inside and without. Your body consumes a specific number of calories daily in order to function well.

  8. May 11th, 2015 at 10:40 | #8

    This site was… how do you say it? Relevant!! Finally I’ve found something
    that helped me. Thank you!

  9. May 15th, 2015 at 12:16 | #9

    Since they are needed as working capital, they carry a fixed rate of
    interest on the total sum and cannot be recalled prior
    to the due date. When looking to complete house maintenance or working
    in the construction industry, there are a number of elements which can halt
    the progress of the job at hand. Even if there is no real posted discount, the trucks are
    selling for an estimated appraised value which is much lower than buying the same machine brand new.

  10. May 20th, 2015 at 01:31 | #10

    The different models of their vacuums offer a variety of features depending on which model you are going to
    get. This makes it easier to move from room to room and means fewer attachments to keep track of for upright owners.

    Doctors are using this therapy quite often to get good and satisfactory

  11. July 24th, 2015 at 17:22 | #11

    Have been using mixed-case passwords for the crictial security accounts for many years now. They work since IIRC 9i, with the password between (rabbit’s ears). With 11 it’s easier to type, that’s all.

  12. August 21st, 2015 at 11:49 | #12

    Remarkable things here. I am very glad to peer your post.
    Thank you a lot and I’m taking a look ahead to touch you. Will you kindly drop me a mail?

  13. November 4th, 2015 at 18:23 | #13

    Thanks for that, I really appreciate the help. We have 20+ years’ experience in the IT industry especially in e-commerce and Marketing and this will help tremendously.

  14. November 5th, 2015 at 19:28 | #14

    Спасибо до моего отца, который рассказал
    мне касательно это веб, это Веб-страница есть фактически удивительным.

  15. November 12th, 2015 at 05:13 | #15

    We are the principal resource to numerous developers, marketing agencies, and sales teams for their client-centric development and marketing needs and posts like this really do help us out.

  16. November 16th, 2015 at 19:33 | #16

    Here are my personal strategies for everything. First eat once a day, or work towards one high nutritious diet and meal plan. Meditate daily, this will help preserve the telemores.

Comment pages
1 2 3 1028
  1. No trackbacks yet.