It’s just data

Electronic Forgery

Bruce Schneier: It's time for us all to migrate away from SHA-1.

What triggered this?  A collision in MD5.  I am not an expert in cryptography, but I do find this subject fascinating.  Here is what I have pieced together so far:

What does a collision mean?  One thing it does not mean is that somebody can reverse engineer 186 megabyte Debian image knowing only a 128 bit hash.

What it does, however, mean is that somebody may be able to generate another file with the same hash.  If they can then persuade you to download it, the results would appear authentic.  If able to be done in real time, this represents a man in the middle attack.

There are three hurdles, however, to overcome.

First, the attacker would have be able to generate a collision at will.  This is considerably harder than being able to find a collision - something known as the birthday paradox.  If you gather a random collection of 23 people, you have a 50% chance of finding two people with the same birthday.  However it takes gathering a total of 253 people before you have a 50% chance of finding somebody with your birthday.  Most people find these numbers to be a bit counterintuitive, but the fact is that it is much easier to find a collision than to generate a collision at will.

Second, the collision would have to be something more than a random series of bits in order to be dangerous.  This presumably will require the ability to insert without detection a number of bits roughly the size of the key (128 or 160).  This would not be a problem for a download the size of a Linux distribution, but would be a bit more noticeable in a typical PGP signed message.

Third, the collision found dealt with MD5.  We are talking about SHA-1.  All other things being equal, if you were able to construct a computer which could crack MD5 in a second, the same computer would take over 136 years to crack SHA-1.  And we don't yet have a computer that can crack MD5 in a second, we only have a collision that was found after an immense amount of searching.

So, given all this, why the concern?  The concern is not that these hashes are vulnerable to a brute force attack, but that a shortcut can be found.  What kind of shortcut?  Here's a trivial example: one can determine that the number 123,456,789 is divisible by 9 by a simple technique: adding up the digits.  1+2+3+4+5+6+7+8+9=45 and 45 is divisible by 9, so 123,456,789 is divisible by 9.  This shortcut only works for 3 and 9, but it is much quicker than doing division.  Other numbers have other shortcuts, some seemingly nearly as hard as doing the division outright. These are merely examples of shortcuts, ones that are not likely to help with the cracking of MD5 or SHA-1, but if an appropriate shortcut can be found, all bets are off.

That's what people are concerned about.  The prevailing feeling seems to be that this is only a matter of time for MD5, but that SHA1 might last a little longer.