The hackers were from Russia and Ukraine and used an “SQL injection” attack with malware to cover their tracks. SQL injection appends SQL commands to the end of an entry field which then gets interpreted as a valid SQL command that can then me used to dump an SQL database.
This indictment documents the largest data breach in US judicial history. However, Verizon’s 2013 Data Breach Investigation Report (DBIR) indicates that there were 621 confirmed data breaches in 2012 which compromised 44 million records and for the nine year history collected in VERIS Community Database over 1.1Billion records have been compromised. So it’s hard to tell if this is a World record or just a US one. Small consolation to the customers and the institutions which lost the information.
Data security to the rescue?
In the data storage industry we talk a lot about data encryption of data-in-flight and data-at-rest. It’s unclear to me whether data storage encryption services would have done anything to help mitigate this major data breach as the perpetuators gained SQL command access to a database which would normally have plain text access to the data.
However, there are other threats where data storage encryption can help. Just a couple of years ago,
A commercial bank’s backup tapes were lost/stolen which contained over 1 million bank records containing social security information and other sensitive data.
A government laptop was stolen containing over 28 million discharged veterans social security numbers.
These are just two examples but I am sure there were more where proper data-at-rest encryption would have saved the data from being breached.
Data encryption is not enough
Nevertheless, data encryption is only one layer in a multi-faceted/multi-layered security perimeter that needs to be in place to reduce and someday perhaps, eliminate the risk of losing confidential customer information.
Apparently, SQL injection can be defeated by proper filtering or strongly typing all user input fields. Not exactly sure how hard this would be to do, but if it could be used to save the security of 160 Million credit cards and potentially defeat one of the top ten web application vulnerabilities, it should have been a high priority on somebody’s to-do list.
In the last few weeks both Sepaton and NEC have announced new data deduplication appliance hardware and for Sepaton at least, new functionality. Both of these vendors compete against solutions from EMC Data Domain, IBM ProtectTier, HP StoreOnce and others.
Sepaton v7.0 Enterprise Data Protection
From Sepaton’s point of view data growth is exploding, with little increase in organizational budgets and system environments are becoming more complex with data risks expanding, not shrinking. In order to address these challenges Sepaton has introduced a new version of their hardware appliance with new functionality to help address the rising data risks.
Their new S2100-ES3 Series 2925 Enterprise Data Protection Platform with latest Sepaton software now supports up to 80 TB/hour of cluster data ingest (presumably with Symantec OST) and up to 2.0 PB of raw storage in an 8-node cluster. The new appliances support 4-8Gbps FC and 2-10GbE host ports per node, based on HP DL380p Gen8 servers with Intel Xeon E5-2690 processors, 8 core, dual 2.9Ghz CPU, 128 GB DRAM and a new high performance compression card from EXAR. With the bigger capacity and faster throughput, enterprise customers can now support large backup data streams with fewer appliances, reducing complexity and maintenance/licensing fees. S2100-ES3 Platforms can scale from 2 to 8 nodes in a single cluster.
The new appliance supports data-at-rest encryption for customer data security as well as data compression, both of which are hardware based, so there is no performance penalty. Also, data encryption is an optional licensed feature and uses OASIS KMIP 1.0/1.1 to integrate with RSA, Thales and other KMIP compliant, enterprise key management solutions.
NEC HYDRAstor Gen 4
With Gen4 HYDRAstor introduces a new Hybrid Node which contains both the logic for accelerator nodes and capacity for storage nodes, in one 2U rackmounted server. Before the hybrid node similar capacity and accessibility would have required 4U of rack space, 2U for the accelerator node and another 2U for the storage node.
The HS8-4000 HN supports 4.9TB/hr standard or 5.6TB/hr per node with NetBackup OST IO express ingest rates and 12-4TB, 3.5in SATA drives, with up to 48TB of raw capacity. They have also introduced an HS8-4000 SN which just consists of the 48TB of additional storage capacity. Gen4 is the first use of 4TB drives we have seen anywhere and quadruples raw capacity per node over the Gen3 storage nodes. HYDRAstor clusters can scale from 2- to 165-nodes and performance scales linearly with the number of cluster nodes.
With the new HS8-4000 systems, maximum capacity for a 165 node cluster is now 7.9PB raw and supports up to 920.7 TB/hr (almost a PB/hr, need to recalibrate my units) with an all 165-HS8-4000 HN node cluster. Of course, how many customers need a PB/hr of backup ingest is another question. Let alone, 7.9PB of raw storage which of course gets deduplicated to an effective capacity of over 100PBs of backup data (or 0.1EB, units change again).
NEC has also introduced a new low end appliance the HS3-410 for remote/branch office environments that has a 3.2TB/hr ingest with up to 24TB of raw storage. This is only available as a single node system.
Maybe Facebook could use a 0.1EB backup repository?
I suppose it’s inevitable but surprising nonetheless. A recent article Faster computation will damage the Internet’s integrity in MIT Technology Review indicates that by 2018, SHA-1 will be crackable by any determined large organization. Similarly, just a few years later, perhaps by 2021 a much smaller organization will have the computational power to crack SHA-1 hash codes.
What’s a hash?
Cryptographic hash functions like SHA-1 are designed such that, when a string of characters is “hash”ed they generate a binary value which has a couple of great properties:
Irreversibility – given a text string and a “hash_value” generated by hashing “text_string”, there is no way to determine what the “text_string” was from its hash_value.
Uniqueness – given two or more text strings, “text_string1” and “text_string2” they should generate two unique hash values, “hash_value1” and “hash_value2”.
Although hash functions are designed to be irreversible that doesn’t mean that they couldn’t be broken via a brute force attack. For example, if one were to try every known text string, sooner or later one would come up with a “text_string1” that hashes to “hash_value1”.
But perhaps even more serious, the SHA-1 algorithm is prone to hash collisions which makes fails the uniqueness property above. That is, there are a few “text_string1″s that hash to the same “hash_value1”.
All this wouldn’t be much of a problem except that with Moore’s law in force and continuing for the next 6 years or so we will have processing power in chips capable of doing a brute force attack against SHA-1 to find text_strings that match any specific hash value.
On top of all that, many of today’s secure systems with passwords, use SHA-1 to hash passwords and instead of storing actual passwords in plain-text on their password files, they only store the SHA-1 hash of the passwords. As such, by 2021, anyone that can read the hashed password file can retrieve any password in plain text.
What all this means is that by 2018 for some and 2021 or thereabouts for just about anybody else, todays secure internet traffic, PKI and most system passwords will no longer be secure.
What needs to be done
It turns out that NSA knew about the failings of SHA-1 quite awhile ago and as such, NIST released SHA-2 as a new hash algorithm and its functional replacement. Probably just in time, this month, NIST announced a winner for a new SHA-3 algorithm as a functional replacement for SHA-2.
This may take awhile, what needs to be done is to have all digital certificates that use SHA-1, be invalidated with new ones generated using SHA-2 or SHA-3. And of course, TLS and SSL Internet functionality all have to be re-coded to recognize and use SHA-2 or SHA-3, instead of SHA-1.
Finally, for most of those password systems, users will need to re-login and have their password hashes changed over from SHA-1 to SHA-2 or SHA-3.
Naturally, in order to use SHA-2 or SHA-3 many systems may need to be upgraded to later levels of code. Seems like Y2K all over again, only this time it’s security that’s going to crash. It’s good to be in the consulting business, again.
But the real problem IMHO, is Moore’s law. If it continues to double processing power/transistor density every two years or so, how long before SHA-2 or SHA-3 succumb to same sorts of brute force attacks? Given that, we appear destined to change hashing, encryption and other security algorithms every decade or so until Moore’s law slows down or god forbid, stops altogether.
This concern was mainly aired by one cloud provider but they mentioned any US company would need to provide the same access to data located anywhere.
I suppose living in the US, this sort of access should not be a concern for me but somehow this struck a chord. Does this mean that anything I store in the cloud, search on the internet, publish to social media is essentially available to any government entity that deems it important to access – yes, probably so.
The Fourth Amendment to the US constitution established the right of individuals to not be subject to “unreasonable search and seizure of property”. One could readily extend the definition of property to data. However somewhere in case law this provision has been modified to imply that such rights only apply to property that a person has a reasonable expectation of being private.
Data property rights outside your office
So where does that leave the data property rights:
Social media – seems to me that you waive any property rights to the data you submit to social media the moment you hit enter. For example, in Twitter any tweets you create are broadcast to all your followers and anybody searching on tweet text (unless you restrict your tweets) can see it. Places like Facebook, Flickr, Youtube, and other social media provide a service where updates are broadcast automatically to anyone searching on that information unless you lock it down and secure access to only a limited set of “friends”. But in the most common case, data in social media is public information (although perhaps owned by the social media company).
Cloud data – privacy rights may or may not exist in the cloud, it depends on what you store there. Lets say you start backing up your laptop/desktop to the cloud. Such data is in a format that is likely proprietary to the particular backup application you use but that doesn’t mean you have any reasonable expectation of privacy because those formats are known to the US company that created it. As such, plain text data, placed in the cloud probably has no expectation of privacy. Encrypted data is another story however.
Establishing reasonable expectations of privacy
So what can someone do today to establish “expectations of privacy”
Abandon social media. If you can’t do that, be very careful of the data you expose there.
Abandon cloud storage. If you can’t do that encrypt your data before it moves or is copied to the cloud. But you must understand who owns the encryption keys and where they reside. If the cloud provider owns the encryption keys and they can be found in the cloud, then reasonable expectation of privacy IS not present. To really secure data, encrypt the data yourself with an application not associated with the cloud service, with key phrases known only to you and stored outside the cloud only. Given all that one can assume a “reasonable expectation of privacy”.
Yes, either of these approaches are painful. Yes, they make using such facilities more complex, painful and time consuming but it’s the only way to establish a privacy rights for your data.
Being an active user of Twitter and blogging, I have no reasonable expectation of privacy for this data but that doesn’t mean I relinquish the rest of my data to unrestrained access.
For some time now I have been considering the use of cloud backup but have been reluctant for my data to leave my control. Such fears, now seem to have a factual component to them. Nonetheless, cloud data can be private and secure but only if one safeguards the data before it leaves your premises.
Although we have discussed securing data in the cloud before but we have not discussed IT data security in general. I count at least 6 different places one can secure IT data-at-rest today. In most cases, one has some sort of system to provide encryption/decryption services and some way to get encryption keys, generated, stored, and securely retrieved by this system. All these systems use symmetric key cryptography where the same key is used for encryption and decryption purposes. Approaches to IT data-at-rest security include data encryption performed as follows:
Drive level encryption
For tape transports drive level encryption has been around since LTO-4 and previously with other proprietary tape formats. For disk, data encryption capabilities have been around for a long time in the consumer space and lately has been introduced into enterprise storage as well.
Encryption key management is critical to securing any drive level encryption. Key management can be supplied either externally by some sort of standalone key management software/appliance or internally from the tape library or disk subsystem controller itself.
The reasons for tape drive encryption are fairly substantial, tapes in transit can be lost or stolen. Similarly, disks can be replaced/stolen from enterprise storage subsystems and as such are subject to the same security concerns as tape volumes. As drive encryption is typically performed by special purpose hardware, it can operate with almost no overhead and thus, little impact to storage performance.
Disk subsystem-based encryption
Although there are only a few current implementations of this capability, data encryption/decryption could easily be done entirely at the subsystem level with key management available external or internal to the subsystem. Most likely this would be considered a software cryptographic solution but hardware could also be supplied to encrypt/decrypt data. With a software implementation, the impact on storage performance (especially, read back) might be considerable.
A couple of years ago, EMC, HDS and others added “secure data erasure” for disks or subsystems going out of service. However, this does nothing for operating data-at-rest security.
Both Cisco and Brocade offer data security services in the SAN or storage network facilities. Such capabilities will encrypt and decrypt data going to or from LUNs and/or tape drives. Key management can be supplied externally as well as internally to the networking equipment. Both Cisco and Brocade SAN encryption servicesare hardware encryption solutions and as such, operate at line speed with high throughput.
In the past, a number of companies offered appliance or standalone hardware based encryption which places the data security appliance within the data path somewhere between the host and its storage devices. Such solutions have been falling behind or recently been replaced by network based encryption solutions but still have a significant install base. Key management can be supplied internal to the appliance or externally. All appliance based encryption solutions support dedicated hardware for encryption/decryption of data.
Last month EMC announced a new capability for their CLARiiON storage which operates in conjunction with Emulex HBAs to offer hardware HBA-based encryption for data. This solution is an interesting in that it’s almost host based, hardware solution and should have little to no impact on storage performance. Key management is supplied external to the HBA.
Host encryption has been available in the consumer and enterprise space for a number of years. Such services have seen much success with laptop data. Host based services are available from operating system vendors or special purpose applications. In the consumer space products such as PGP (recently purchased by Symantec) have been available for over a decade, similar capabilities exist in the enterprise space via special purpose “secure” file systems and other applications. Most host based cryptographic systems use software based algorithms. Although hardware host-based services are available in the mainframe, System z environment via cryptographic co-processors and the latest versions of Intel’s advanced processors with their instruction set extensions for AES encryption support.
Other data-at-rest security considerations
From a performance perspective, hardware encryption can have the least impact but it’s very expensive. In addition, drive level encryption is probably the most scaleable as the more drives you have, the more encryption throughput can be supported. Next comes the appliance or network based encryption solutions which can be scaled by purchasing more appliances or encryption blades/switches.
In contrast, software based services perform the worst but are easiest to deploy. Most consumer O/Ss support data encryption with a simple configuration change. Software solutions are the least expensive as well because there is no hardware to purchase. Software based solutions can also be scaled but only be adding more servers/subsystems.
In any event, key management cannot be overlooked for any data-at-rest security solution. Given the strength of modern day encryption algorithms, the loss of a data key is equivalent to the loss of all data encrypted with that key. So when considering key management, one should look for support of key archives, redundant key managers, key hierarchies and other advanced characteristics that make key access continuously available and disaster proof.
Data security is certainly feasible with any of these solutions. But performance, availability and ease of management must be understood before seriously considering any data-at-rest security regimin.
A recent article from MIT’s Technology Review discussed cloud security (“Security in the Ether”). Most of the article was on how many cloud servers are vulnerable to a particular hack that can uncover private data in server memory/cache. But a good portion of the article was on how to secure data in the cloud and the article discussed a couple of new ideas (to me at least):
Securing cloud data access by using a key hierarchy – in this way a particular file/table/row could have a hierarchy of keys and thus, could have one master key for the whole datum and subset keys which would provide access to segments of the datum. As such, the patient could hold the master key to their electronic health records while their physicians held subset keys that would allow them to access diagnostic results and other information needed to treat the patient.
Securing cloud data search by encrypting meta-data – in this way a search key could be encrypted and then the search could execute in the cloud against the encrypted meta-data. As such, meta data and search keys would need to be encrypted in a static fashion so that they would always encrypt to the same cipher text but this could be done with an MD5 hash. Not sure how this might help sorting but it’s certainly a step in the right direction as searches could be performed completely secure while using cloud resources. Subsequent search results could then be easily delivered back to the end user for decryption and use.
Securing cloud data manipulation by using “ideal lattice” calculations on encrypted data – in this way mathematical manipulations of encrypted data are possible and can be extracted from the cloud for decryption and use. As such, data queries using arithmetic functions such as summing a column of cloud data, can be completely secured and the resultant summation delivered outside the cloud. How this works is beyond me and the mathematics are said to be a bit cumbersome but, it’s still early and may someday become a viable approach.
It seems to me most of this goes way beyond the data archive I would envision for the cloud. With such encryption techniques one could conceivably host one’s data center applications in the cloud and/or use the cloud to serve as data storage for all applications. While this may be the ultimate goal for the cloud it still seems a way off.
So what mathematical functions can be accomplished using an “ideal lattice”?
Depositing data into the cloud seems a little like a chinese laundry to me – you deposit data in the cloud and receive a ticket or token used to retrieve the data. Today’s cloud data security depends entirely on this token.
Threats to the token
If one only looks at external security threats, two issues to token use seem apparent
Brute force cracking of any token is possible. I envision a set of cloud storage users wherein they use their current storage tokens as seeds to identify other alternate tokens. Such an attack could easily generate token synonyms which may or may not be valid. Detecting a brute force attack could be easily accomplished, but distributing this attack across 1000s of compromised PCs would be much harder to detect.
Tokens could be intercepted in the clear. Cloud data often may need to be accessed in locations outside the data center of origin. This would require sending tokens to others. These data tokens could inadvertently be sent in the clear and as such, intercepted.
Probably other external exposures beyond these two exist as well but these will suffice.
Securing cloud data-at-rest
Given the potential external and internal threats to data tokens, securing such data can eliminate any data loss from token exposure. I see at least three approaches to securing data in the cloud.
The data dispersal approach – Cleversafe’s product splits a data stream into byte segments and disburses these segments across storage locations. Their approach is oriented around Reed Solomon logic and results in no one location having any recognizable portion of the data. This requires multiple sites, or multiple systems at one site to implement but is essentially securing data through segmenting it. The advantages of this approach is that its fast and automatic but the disadvantage is that it only is supported via Cleversafe.
The software data encryption approach – there are plenty of software packages out such as GnuPG (GNU Privacy Guard) or PGP which can be used to encrypt your data prior to sending it to the cloud. It’s sort of brute force, software only approach but its advantage is that it can be used with any cloud storage provider. Its disadvantages are that it’s slow, processor intensive, and key management is sporadic.
The hardware data encryption approach – there are also plenty of data encryption appliances and/or hardware options out there which can be used to encrypt data. Some of these are available at the FC switch level, some are standalone appliances, and some exist at the storage subsystem level. The problem with most of these is that they only apply to FC storage and are not readily useable by Cloud Storage (unless the provider uses FC storage as its backing store). The advantages are that it’s fast and key management is generally built into the product.
One disadvantage to any of the encryption approaches is that now one needs the encryption keys and the token to access the data. Yet one more thing to protect.
Nothing says that hardware data encryption couldn’t also work for data flowing to the cloud but they would have to support IP plus the cloud specific REST interface. Such support would depend on cloud storage provider market share, but perhaps some cloud vendor could fund a security appliance vendor to support their interface directly, providing a cloud data security option.
The software approach suffers from performance problems but supports anybody’s cloud storage. It might be useful if Cloud storage providers started offering hooks into GnuPG or PGP to directly encrypt cloud data. However, most REST interfaces require some programming to use and it’s not too much of a stretch to program in encryption into this.
I like the data dispersal approach, but most argue that security is not guaranteed as reverse engineering the dispersal algorithm allows one to reconstruct the data stream. But the other more serious problem is that it only applies to Cleversafe storage, perhaps the dispersal algorithm should be open sourced (which it already is) and/or standardized.
There are possibly other approaches which I have missed here but these can easily be used to secure cloud data-at-rest. Possibly adding more security around the data token could also help alleviate this concern. Thoughts?
Post disclosure: I am not currently working with Cleversafe, any data security appliance provider, or cloud storage provider.