Finding text similarities with fuzzy hashes – Duplicate code for example
How would an email server go about identifying spam email ? The problem is an interesting one. The challenges towards identifying spam are…
1. Scaling any solution to thousands of emails
2. Identifying spam even when there are small changes to the spam content
3. Reducing false positives
One solution could be to identify a hash for the spam message and compare the hash with the hash of a new message. The problem with this approach is that minute changes in a message can result in a different hash. A fuzzy hash (Context triggered piecewise hash (CTPH)) solves this by calculating hashes based on a trigger point in the text. Hash values are calculated for pieces of the text, delimited by a trigger. For example the trigger for the following text could be ‘a’ and ‘or’
why would a lazy sunday be greeted with sleep. or was I wrong? You were not sleeping ?