Security Aspects of Fuzzy Hashing
Loading...
Files
Date
2011-07-21
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
Hash functions are widely spread in computer science and used to map arbitrary large data to bit
strings of a fixed length called a fingerprint. Cryptographic hash functions like SHA-1 or MD5 are
established in various fields, but have one `drawback' due to several security requirements: if one
bit of the input is changed, this results in a completely different fingerprint.
In computer forensics cryptographic hashing represents the backbone in identifying files in sets of
known-to-be-good (known as whitelisting) or known-to-be-bad files (known as blacklisting). Therefore,
most investigators use hash databases, which are either created by themselves or maintained
by institutions. Once they are in possession of such a database, the processing is straightforward:
take a storage medium, compute all hash values, and perform look ups in the database.
An active adversary, however, can destroy this straightforward proceeding by changing one
bit of each file. Conditioned by the pseudo-randomness of the cryptographic hash function, this
leads to completely different hash values. A 'similarity hash function' for files, i.e. a fuzzy-hashing
function, would be the perfect counterpart for this attack.
In 2006, Jesse Kornblum presented an implementation for this idea in [Kor06] named context
triggered piecewise hashing (CTPH). Since then, he has been quoted numerous times (e.g. [Hur09])
and his approach has been improved with respect to performance (e.g. [SLC09]). However, there
is no security analysis in terms of the reliability of the results.
In our research, we address some security aspects of CTPH:
1. Even if two files have absolutely different content, we show how to efficiently achieve high
match scores of the respective CTPH values of both files. This approach may be used to
circumvent CTPH whitelisting.
2. We show how to efficiently manipulate a file by a fixed number of modification. The CTPH
of the manipulated version of the file differs significantly from the CTPH of the original,
although both versions are perceptually identical. This approach may be used to circumvent
CTPH blacklisting.
Besides these two anti-forensic issues, we also discovered weaknesses in the randomness of Kornblum's
pseudo-random function. We showed how to improve Kornblum's pseudo-random function
with respect to efficiency and randomness.