Insecure SHA-1 imperils Internet security, PKI, and most password systems

safe 'n green by Robert S. Donovan (cc) (from flickr)
safe ‘n green by Robert S. Donovan (cc) (from flickr)

I suppose it’s inevitable but surprising nonetheless.  A recent article Faster computation will damage the Internet’s integrity in MIT Technology Review indicates that by 2018, SHA-1 will be crackable by any determined large  organization. Similarly, just a few years later,  perhaps by 2021 a much smaller organization will have the computational power to crack SHA-1 hash codes.

What’s a hash?

Cryptographic hash functions like SHA-1 are designed such that, when a string of characters is “hash”ed they generate a binary value which has a couple of great properties:

  • Irreversibility – given a text string and a “hash_value” generated by hashing “text_string”, there is no way to determine what the “text_string” was from its hash_value.
  • Uniqueness – given two or more text strings, “text_string1” and “text_string2” they should generate two unique hash values, “hash_value1” and “hash_value2”.

Although hash functions are designed to be irreversible that doesn’t mean that they couldn’t be broken via a brute force attack. For example, if one were to try every known text string, sooner or later one would come up with a “text_string1” that hashes to “hash_value1”.

But perhaps even more serious, the SHA-1 algorithm is prone to hash collisions  which makes fails the uniqueness property above.  That is, there are a few “text_string1″s that hash to the same “hash_value1”.

All this wouldn’t be much of a problem except that with Moore’s law in force and continuing for the next 6 years or so we will have processing power in chips capable of doing a brute force attack against SHA-1 to find text_strings that match any specific hash value.

So what’s the big deal?

Well it turns out that SHA-1 algorithms underpin almost all secure data transmissions today. That is, most Public-key infrastructure (PKI) depend on SHA-1 to sign digital certificates.  And although that’s pretty bad, what’s even worse is that Secure Socket Layer/Transport Layer Security (SSL/TLS) used by “https://” websites the world over also depend on SHA-1 to send key information used to encrypt/decrypt secure Internet transactions.

On top of all that, many of today’s secure systems with passwords, use SHA-1 to hash passwords and instead of storing actual passwords in plain-text on their password files, they only store the SHA-1 hash of the passwords.  As such, by 2021, anyone that can read the hashed password file can retrieve any password in plain text.

What all this means is that by 2018 for some and 2021 or thereabouts for just about anybody else, todays secure internet traffic, PKI and most system passwords will no longer be secure.

What needs to be done

It turns out that NSA knew about the failings of SHA-1 quite awhile ago and as such, NIST released SHA-2 as a new hash algorithm and its functional replacement.  Probably just in time, this month, NIST announced a winner for a new SHA-3 algorithm as a functional replacement for SHA-2.

This may take awhile, what needs to be done is to have all digital certificates that use SHA-1, be invalidated with new ones generated using SHA-2 or SHA-3.  And of course, TLS and SSL Internet functionality all have to be re-coded to recognize and use SHA-2 or SHA-3, instead of SHA-1.

Finally, for most of those password systems, users will need to re-login and have their password hashes changed over from SHA-1 to SHA-2 or SHA-3.

Naturally, in order to use SHA-2 or SHA-3 many systems may need to be upgraded to later levels of code.  Seems like Y2K all over again, only this time it’s security that’s going to crash.  It’s good to be in the consulting business, again.

~~~~

But the real problem IMHO, is Moore’s law.  If it continues to double processing power/transistor density every two years or so, how long before SHA-2 or SHA-3 succumb to same sorts of brute force attacks?  Given that, we appear destined to change hashing, encryption and other security algorithms every decade or so until Moore’s law slows down or god forbid, stops altogether.

Comments?

 

Describing Dedupe

Hard Disk 4 by Alpha six (cc) (from flickr)
Hard Disk 4 by Alpha six (cc) (from flickr)

Deduplication is a mechanism to reduce the amount of data stored on disk for backup, archive or even primary storage.  For any storage, data is often duplicated and any system that eliminates storing duplicate data will be more utilize storage more efficiently.

Essentially, deduplication systems identify duplicate data and only store one copy of such data.  It uses pointers to incorporate the duplicate data at the right point in the data stream. Such services can be provided at the source, at the target, or even at the storage subsystem/NAS system level.

The easiest way to understand deduplication is to view a data stream as a book and as such, it can consist of two parts a table of contents and actual chapters of text (or data).  The stream’s table of contents provides chapter titles but more importantly (to us), identifies a page number for the chapter.  A deduplicated data stream looks like a book where chapters can be duplicated within the same book or even across books, and the table of contents can point to any book’s chapter when duplicated. A deduplication service inputs the data stream, searches for duplicate chapters and deletes them, and updates the table of contents accordingly.

There’s more to this of course.  For example, chapters or duplicate data segments must be tagged with how often they are duplicated  so that such data is not lost when modified.  Also, one way to determine if data is duplicated is to take one or more hashes and compare this to other data hashes, but to work quickly, data hashes must be kept in a searchable index.

Types of deduplication

  • Source deduplication involves a repository, a client application, and an operation which copies client data to the repository.  Client software chunks the data, hashes the data chunks, and sends these hashes over to the repository.  On the receiving end, the repository determines which hashes are duplicates and then tells the client to send only the unique data.  The repository stores the unique data chunks and the data stream’s table of contents.
  • Target deduplication involves performing deduplication inline, in-parallel, or post-processing by chunking the data stream as it’s recieved, hashing the chunks, determining which chunks are unique, and storing only the unique data.  Inline refers to doing such processing while receiving data at the target system, before the data is stored on disk.  In-parallel refers to doing a portion of this processing while receiving data, i.e., portions of the data stream will be deduplicated while other portions are being received.  Post-processing refers to data that is completely staged to disk before being deduplicated later.
  • Storage subsystem/NAS system deduplication looks a lot like post-processing, target deduplication.  For NAS systems, deduplicaiot looks at a file of data after it is closed. For general storage subsystems the process looks at blocks of data after they are written.  Whether either system detects duplicate data below these levels is implementation dependent.

Deduplication overhead

Deduplication processes generate most overhead while deduplicating the data stream, essentially during or after the data is written, which is the reason that target deduplication has so many options, some optimize ingestion while others optimize storage use. There is very little additonal overhead for re-constituting (or un-deduplicating) the data for read back as retrieving the unique and/or duplicated data segments can be done quickly.  There may be some minor performance loss because of lack of  sequentiality but that only impacts data throughput and not that much.

Where dedupe makes sense

Deduplication was first implemented for backup data streams.  Because any backup that takes full backups on a monthly or even weekly basis will duplicate lot’s of data.  For example, if one takes a full backup of 100TBs every week and lets say new unique data created each week is ~15%, then at week 0, 100TB of data is stored both for the deduplicated and undeduplicated data versions; at week 1 it takes 115TB to store the deduplicated data but 200TB for the non-deduplicated data; at week 2 it takes ~132TB to store deduplicated data but 300TB for the non-deduplicated data, etc.  As each full backup completes it takes another 100TB of un-deduplicated storage but significantly less deduplicated storage.  After 8 full backups the un-deduplicated storage would require 8ooTB but only ~265TB for deduplicated storage.

Deduplication can also work for secondary or even primary storage.  Most IT shops with 1000’s of users, duplicate lot’s of data.  For example, interim files are sent from one employee to another for review, reports are sent out en-mass to teams, emails are blasted to all employees, etc.  Consequently, any storage (sub)system that can deduplicate data would more efficiently utilize backend storage.

Full disclosure, I have worked for many deduplication vendors in the past.