Bringing compute to storage

Researchers at MIT (see Storage system for ‘big data’ dramatically speeds access to information) have come up with a novel storage cluster using FPGAs and flash chips to create a new form of database machine.

In their system they have an FPGA that supports limited computational offload/acceleration along with flash controller functionality for a set of flash chips. They call their system the BlueDBM or Blue Database Machine.

Their storage device is used as PCIe flash card on a host PC. But in their implementation each of the PCIe flash cards are interconnected via an FPGA serial link. This approach creates a distributed controller across all the PCIe flash cards in the host servers and allows any host PC to access any of the flash card data at high speed.

They claim that node to node access latencies are on the order of 60-80 microseconds and their distributed controller can sustain 70% of theoretical system bandwidth.  In their prototype 4-node system their performance testing shows that it’s an order of magnitude faster than Microsoft Research’s CORFU (Cluster of Raw Flash Units).

Why FPGAs?

There are two novel aspects of their system: 1 ) Is the computational offload capabilities provided by the FPGA in front of the flash and 2) Is their implementation of a  distributed controller across the storage nodes using the FPGA serial network.

Both of these characteristics are dependent on the FPGA. Also by using FPGAs system cost would be less and the FPGAs had a readily available, internally supported serial link that could be used.

But by using an FPGA, the computational capabilities are more limited and re-configurating (re-programming) the storage cluster’s compute capabilities will take more time. If they used a more general purpose CPU in front of the flash chips they could support a much richer computational offload next to the storage chips.  For example, in their prototype the FPGAs supported ‘word-counting’ offload functionality.

Nonetheless, as most flash storage these days already have a fairly sophisticated controller, it’s not much of a stretch to bump this compute power up to something a bit more programmable and make its functionality more available via APIs.  I suppose to gain equivalent performance this would need to use PCIe flash cards.

Where they would get the internal card to card serial link with general purpose CPUs may be a concern, which brings up another question.

The distributed controller gives them what exactly?

I believe that with a serial link based distributed controller they don’t need a full networking stack to access the PCIe flash storage on other nodes. This should save both access time and compute power.

In follow on work, the MIT researchers plan to implement a Linux based, distributed file system across the BlueDBM. This should give them a more normal storage stack for their system. How this may interact with the computational offload capabilities is another question.

I would have to say the reduction in access latency is what they were after with the distributed controller and they seem to have achieved it, as noted above. I suppose something similar could be done with multiple PCIe cards in the same host but with the potential to grow from 4 to 20 nodes, the BlueDBM starts to look more interesting.

What sort of application could use such a device?

They talked about performing near real-time analysis of scientific data or modeling all the particles in a simulation of the universe.  But just about any application that required extremely low access time with limited data services could potentially take advantage of their storage system. High Frequency Trading comes to mind.

As for big data applications, I haven’t heard of any big data deployments that use SSDs for basic storage let alone PCIe flash cards. I don’t believe there’s going to be a lot of big data analytics that has need for this fast a storage system.

~~~~

Utilizing excess compute power in a storage controller has been an ongoing dream for a long time. Aside from running VMs and a couple of other specialized services such as A-V scanning within a storage controller there hasn’t been a lot of this type of functionality  ever released for use inside a storage controller. With software defined storage coming online, it may not even make that much sense anymore.

MIT research’s BlueDBM solution is somewhat novel but unless they can more easily generalize the computational offload it doesn’t seem as if it will become a very popular way to go for analytics applications.

As for their reduction in access latencies, that might have some legs if they can put more storage capacity behind it and continue to support similar access latencies. But they will need to provide a more normal access method to it. The distributed Linux file system might be just the ticket to get this off into the market.

Comments?

Photo Credits: Lightening by Jolene

DS3, the BlackPearl and the way forward for … tape

Spectra Logic Summit 2013, Nathan Thompson, CEO talking about  Spectra Logic's historyJust got back from an analyst summit with Spectra Logic.  They announced a new interface to tape called, Deep Simple Storage Service (DS3) and an appliance that implements this interface named the BlackPearl.  The intent is to broaden the use of tape to include, todays more web services, application environments.

The main problems addressed by the new interface is how do you map an essentially sequential, high throughput but long latency access to first byte, removable media device to an essentially small file, get and put environment.  And is there a market for such services. I think Spectra Logic has answered the first set of questions and is about to embark on a journey to answer the second set of questions.

The new interface – it’s all about simplifying tape

The DS3 interface answers the first set of questions. With DS3 Specra Logic has extended Amazon’s S3 interface to expose some of the sequentiality and removability of tape to the object storage world.

As you should recall, Amazon S3 is a RESTful, web interface that uses HTTP type GET and PUT commands to move data to and from the S3 storage service.  The data you are moving is considered an object and the object name or identifier is unique across the storage service. When you “PUT” an object you get to add key-value pairs of information called meta-data to the object. When you “GET” an object you retrieve the data from the storage service. The other thing one needs to be aware of is that you get and put objects into “BUCKET”s.

With DS3, Spectra Logic has added essentially 4 new commands to S3 protocol, which are:

  • Bulk Put – this provides a list of objects that one wants to “PUT” into a DS3 storage service and the response from the DS3 storage service is an ordered list of which objects to PUT in sequence and which DS3 storage server node (essentially an IP address) to send the data.
  • Bulk Get – this supplies a list of objects that one wants to GET from a DS3 storage service and the response is an ordered list of the sequence to get those objects and the node address to use for those object gets
  • Export Bucket – this identifies a BUCKET that you wish to remove from a DS3 storage service.  Presumably the response would be where the bucket can be found,  the number of pieces of media to expect, and some identification of the media serial numbers that constitute a bucket on the DS3 storage service.
  • Import Bucket – this identifies a new bucket which will be imported into a DS3 storage service and will supply some necessary information such as how many pieces of media to expect and the serial numbers of the media.  Presumably the response will be a location which can be used to import the media.

With these four simple commands and an appropriate DS3 client, DS3 server and DS3 storage backend one now has everything they need to support a removable media object store. I could see real value for export/import like this on the “rare occasion” when a  cloud service provider goes out of business.

The DS3 interface will be publicly available and the intent is to both supply Spectra Logic developed clients as well a ISV/partner developed DS3 clients so as to provide removable media object stores for all sorts of other applications.

Spectra’s is providing developer tools and documentation so that anyone can write a DS3 client. To that end, the DS3 developer portal is up (couldn’t find a link this AM but will update this post when I find it) and available free of charge to anyone today (believe you need to register to gain access to the doc.). They have a DS3 server simulator that DS3 client developers can use to test out and validate their client software. They also have a try & buy service for client developers.

Essentially, the combination of DS3 clients, DS3 servers and DS3 backend storage create a really deep archive for object data. It’s not intended for primary or secondary storage access but it’s big, cheap, and power/space efficient storage that can be very effective if used for archive data.

BlackPearl, the first DS3 Server

Their second announcement is the first implementation of a DS3 server, Spectra Logic calls BlackPearl(™). The BlackPearl connects to one or more Spectra Logic tape libraries as a backend store which together essentially provides a DS3 object storage archive. The DS3 server talks to DS3 clients on the front end. BlackPearl uses SAS or FC connected tape transports, which can be any transport currently supported by SpectraLogic tape libraries, including IBM TS1140, LTO-4, -5 and -6.

In addition to BlackPearl, Spectra Logic is releasing the first DS3 client for Hadoop. In this case, the DS3 client implements a new version of the Hadoop DistCp (distributed copy) command which can be used to create a copy of an HDFS directory tree onto a DS3 storage service.

Current BlackPearl hardware is a standard 2U server with 4-400GB SSDs inside which act as sort of a speed matching buffer for the Object interface to SAS/FC tape interface.

We only saw a configuration with one BlackPearl in operation (GA of BlackPearl is expected this December). But the plan is to support multiple BlackPearl appliances to talk with the same DS3 backend storage. In that case, there will be a shared database and (tape) resource scheduler across all the appliances in the cluster.

Yes, but what about the market?

It’s a gutsy move for someone like Spectra Logic to define a new open interface to deep storage. The fact that the appliance exists outside the tape library itself and could potentially support any removable media offers interesting architectural capabilities. The current (beta) implementation lacked some sophistication but the expectation is that much of this will be resolved by GA or over time through incremental enhancements.

Pricing is appealing. When you add BlackPearl appliance(s), with a T950 Spectra Logic tape library using LTO drives which supports uncompressed data store of ~2.4PB of archive data, the purchase price is ~$0.10/GB. This compares especially well with current Amazon Glacier pricing of $0.01/GB/Month, so that for the price of 10 months of Glacier storage you could own your own DS3 storage service.

At larger capacities, such as BlackPearl using T950 with TS1140 tape drives supporting 6.4PB is even cheaper, at $0.09/GB. Other configurations are available and in general bigger congfigurations are cheaper on $/GB and smaller ones more expensive.  The configurations are speced by Spectra Logic to have all the media, tape drives and BlackPearl systems be needed to support an archives object store.

As for markets, Spectra Logic already has beta interest from a large well known web services customer and a number of media & entertainment customers.

In the long run, Spectra Logic believes that if they can simplify access to tape for an application where it’s well qualified to support (deep archive), that this will enable new applications to take advantage of tape, that weren’t even dreamed of before.  By opening up a Object Store interface to tape, anyone currently using S3 is a potential customer.

Amazon announced earlier this year that they have over 2 trillion objects is their S3. And as far as I can tell (see my post Who’s the next winner in storage?) they are growing with no end in sight.

~~~~

Comments?

 

DR preparedness in real time

As many may have seen there has been serious flooding throughout the front range of Colorado.  At the moment the flooding hasn’t impacted our homes or offices but there’s nothing like a recent, nearby disaster to focus one’s thoughts on how prepared we are to handle a similar situation.

 

What we did when serious flooding became a possibility

As I thought about what I should be doing last night with flooding in nearby counties, I moved my computers, printer, some other stuff from the basement office to an upstairs area in case of basement flooding. I also moved my “Time Machine” backup disk upstairs as well which holds the iMac’s backups (hourly for last 24 hrs, daily for past month and weekly backups [for as many weeks that can be held on a 2TB disk drive]). I have often depended on time machine backups to recover files I inadvertently overwrote, so it’s good to have around.

I also charged up all our mobiles, laptops & iPads and made sure software and email were as up-to-date as possible.  I packed up my laptop & iPad, with my most recent monthly and weekly backups and some other recent work printouts into my backpack and left it upstairs ready to go at a moments notice.

The next day post-mortum

This morning with less panic and more time to think, the printer was probably the least of my concerns but the internet and telecommunications (phones & headset) should probably have been moved upstairs as well.

Although we have multiple mobile phones, (AT&T) reception is poor in the office and home. It would have been pretty difficult to conduct business here with the mobile alone if we needed to.  I use a cable provider for business phones but also have a land line for our home. So I (technically) have triple backup for telecom, although to use the mobile effectively, we would have had to leave the office.

Internet access

Internet is another matter though. We also use cable for internet and the modem that supplies office internet connects to a cable close to where it enters the house/office. All this is downstairs, in the basement. The modem is powered using basement plugs (although it does have a battery as well) and there’s a hard ethernet link between the cable modem and an Airport Express base station (also downstairs) which provides WiFi to the house and LAN for the house iMacs/PCs.

Considering what I could do to make this a bit more flood tolerant, I should have probably moved the cable modem and Airport Express upstairs connecting it to the TV cable and powering it using upstairs power. Airport Express WiFi would have provided sufficient Internet access to work but with the modem upstairs connecting an ethernet cable to a desktop would also have been a possibility.

I do have the hotspot/tethering option for my mobile phone but as discussed above, reception is not that great. As such, it may have not sufficed for the household, let alone a work computer.

Internet is available at our local library and at many nearby coffee shops.  So, worst case was to take my laptop and head to a coffee shop/library that still had power/WiFi and camp out all day, for potentially multiple days.

I could probably do better with Internet access. With the WiFi and tethering capabilities available with cellular iPad these days, if I should just purchase one for the office, with a suitable data plan, I could have used the iPad as another hot spot, independent of my mobile. Of course, I would probably go with a different carrier so that reception issues could also be minimized (hoping where one [AT&T] is poor the other [Verizon?] carrier would be fine).

Data availability

Data access outside of the Time Machine disk and the various hard drive backups was another item I considered this morning.  I have a monthly, hard-drive backups, normally kept in a safety deposit box at a local bank.

The bank is in the same flood/fire plane that I am in, but the tell me it’s floodproof, fireproof and earthquake proof.  Call me paranoid but I didn’t see any fire suppression equipment visible in the vault. The vault door although a large quantity of steel and other metals didn’t seem to have waterproof seals surrounding it.  As for earthquakes, concrete walls, steel door doesn’t mean it’s eartquake proof.  But then again, I am paranoid, it would probably survive much better than anything in our home/office.

Also, I keep weekly encrypted backups in the house, alternating between two hard disk drives and keep the most recent upstairs. So between the weeklies, monthlies, and Time Machine I have three distinct tiers of data backups. Of course, the latest monthly was sitting in the house waiting to be moved to the safety deposit box – not good news.

I also have  a (manual) copy of work data on the laptop, current to the last hard backup (also at home). So of my three tiers of backup every single current one of them was in the home/office.

I could do better. Looking at Dropbox and Box for about $100/year/100GB (DropBox, Box is ~40% cheaper) I could keep our important work and home data on cloud storage and have access to it from any Internet accessible location (including with mobile devices) with proper security credentials. Not sure how long it would take to seed this backup we have about 20Gb of family and work office documents and probably another 120GB or so of photos that I would want to keep around or about 140GB of info.  This could provide 5-way redundancy with Time machine, weekly hard drive and monthly hard drive backups and now Box/Dropbox for a for a (office and home) fourth backup, with  the laptop being a fifth (office only) backup.  Seems like cheap insurance at the moment.

The other thing that Box/DropBox would do for me is to provide a synch service with my laptop so that files changed on either device would synch to the cloud and then be copied to all other devices.  This would substitute my current 4th tier of (work) backups with a more current, cloud backup. It would also eliminate the manual copy process performed during every backup to keep my laptop up to date.

I have some data security concerns with using cloud storage, but according to Dropbox they use Amazon S3 for their storage and AES-256 data encryption so that others can’t read your data. They use SSL to transfer data to the cloud.

Where all the keys are held is another matter and with all the hullabaloo with NSA, anything on the internet can be provided to the gov’t with a proper request. But the same could be said for my home computer and all my backups.

There are plenty of other solutions here, Google drive and Microsoft’s SkyDrive to name just a few. But from what I have heard Dropbox is best, especially if you have a large number of files.

The major downsides (besides the cost) is that when you power up your system it can take longer while Dropbox scans for out-of-synch files and the time it takes to seed your Dropbox account. This is all dependent on your internet access, but according to a trusted source Dropbox seeding starts with smallest files and works up to the larger ones over time. So there is a good likelihood your office files (outside of PPT) might make it to the cloud sooner than your larger media, databases, and other large files.  I figure we have about ~140GB to be copied to the cloud. I promise to update the post with the time it took to copy this data to the cloud.

Power and other emergency preparedness

Power is yet another concern.  I have not taken the leap to purchase a generator for the home/office. But now think this unwise. Although power has gotten a lot more reliable in our home/office over the years, there’s still a real possibility that there could be a disruption. The areas with serious flooding all around us are having power blackouts this morning and no telling when their power might get back on. So a generator purchase is definitely in my future.

Listening to the news today, there was talk of emergency personnel notifying people that they had 30 minutes to evacuate their houses.  So, next time there is a flood/fire warning in the area I think I will take some time to pack up more than my laptop. Perhaps some other stuff like clothing and medicines that will help us survive and continue to work.

Food and water are also serious considerations. In Florida for hurricane preparedness  they suggest filling up your bathtubs with water or having 1 gallon of water per person per day set aside in case of emergency – didn’t do this last night but should have.  Florida’s family emergency preparedness plan also suggests enough water for 5-7 days.

I think we have enough dry food around the house to sustain us for a number of days (maybe not 7 though). If we consider whats in the freezer and fridge that probably goes up to a couple of weeks or so, assuming we can keep it cold.

Cooking food is another concern. We have propane and camp stoves which would provide rudimentary ability to cook outdoors if necessary as well as an old charcoal grill and bag of charcoal in our car-camping stuff. Which should suffice for a couple of days but probably not a week.

As for important documents they are in that safety deposit box in our flood plain. (May need to rethink that). Wills and other stuff are also in the hands of appropriate family members and lawyers so that’s taken care of.

Another item on their list of things to have for a hurricane is flashlights and fresh batteries. These are all available in our camping stuff but would be difficult to access in a moments notice. So a couple of rechargeable flashlights that were easier to access might be a reasonable investment. The Florida plan further suggests you have a battery operated radio. I happen to have an old one upstairs with the batteries removed – just need to make sure to have some fresh batteries around someplace.

They don’t mention gassing up your car. But we do that as a matter of course anytime harsh weather is forecast.

I think this is about it for now. Probably other stuff I didn’t think of. I have a few fresh fire extinguishers around the home/office but have no pumps. May need to add that to the list…

~~~~

Comments?

Photo Credits: September 12 [2013], around 4:30pm [Water in Smiley Creek – Boulder Flood]

 

 

Who’s the next winner in data storage?

Strange Clouds by michaelroper (cc) (from Flickr)
Strange Clouds by michaelroper (cc) (from Flickr)

“The future is already here – just not evenly distributed”, W. Gibson

It starts as it always does outside the enterprise data center. In the line of businesses, in the development teams, in the small business organizations that don’t know any better but still have an unquenchable need for data storage.

It’s essentially an Innovator’s Dillemma situation. The upstarts are coming into the market at the lower end, lower margin side of the business that the major vendors don’t seem to care about, don’t service very well and are ignoring to their peril.

Yes, it doesn’t offer all the data services that the big guns (EMC, Dell, HDS, IBM, and NetApp) have. It doesn’t offer the data availability and reliability that enterprise data centers have come to demand from their storage. require. And it doesn’t have the performance of major enterprise data storage systems.

But what it does offer, is lower CapEx, unlimited scaleability, and much easier to manage and adopt data storage, albeit using a new protocol. It does have some inherent, hard to get around problems not the least of which is speed of data ingest/egress, highly variable latency and eventual consistency. There are other problems which are more easily solvable, with work, but the three listed above are intrinsic to the solution and need to be dealt with systematically.

And the winner is …

It has to be cloud storage providers and the big elephant in the room has to be Amazon. I know there’s a lot of hype surrounding AWS S3 and EC2 but you must admit that they are growing, doubling year over year. Yes it is starting from a much lower capacity point and yes, they are essentially providing “rentable” data storage space with limited or even non-existant storage services. But they are opening up whole new ways to consume storage that never existed before. And therein lies their advantage and threat to the major storage players today, unless they act to counter this upstart.

On AWS’s EC2 website there must be 4 dozen different applications that can be fired up in the matter of a click or two. When I checked out S3 you only need to signup and identify a bucket name to start depositing data (files, objects). After that, you are charged for the storage used, data transfer out (data in is free), and the number of HTTP GETs, PUTs, and other requests that are done on a per month basis. The first 5GB is free and comes with a judicious amount of gets, puts, and out data transfer bandwidth.

… but how can they attack the enterprise?

Aside from the three systemic weaknesses identified above, for enterprise customers they seem to lack enterprise security, advanced data services and high availability storage. Yes, NetApp’s Amazon Direct addresses some of the issues by placing enterprise owned, secured and highly available storage to be accessed by EC2 applications. But to really take over and make a dent in enterprise storage sales, Amazon needs something with enterprise class data services, availability and security with an on premises storage gateway that uses and consumes cloud storage, i.e., a cloud storage gateway. That way they can meet or exceed enterprise latency and services requirements at something that approximates S3 storage costs.

We have talked about cloud storage gateways before but none offer this level of storage service. An enterprise class S3 gateway would need to support all storage protocols, especially block (FC, FCoE, & iSCSI) and file (NFS & CIFS/SMB). It would need enterprise data services, such as read-writeable snapshots, thin provisioning, data deduplication/compression, and data mirroring/replication (synch and asynch). It would need to support standard management configuration capabilities, like VMware vCenter, Microsoft System Center, and SMI-S. It would need to mask the inherent variable latency of cloud storage through memory, SSD and hard disk data caching/tiering.. It would need to conceal the eventual consistency nature of cloud storage (see link above). And it would need to provide iron-clad, data security for cloud storage.

It would also need to be enterprise hardened, highly available and highly reliable. That means dually redundant, highly serviceable hardware FRUs, concurrent code load, multiple controllers with multiple, independent, high speed links to the internet. Todays, highly-available data storage requires multi-path storage networks, multiple-independent power sources and resilient cooling so adding multiple-independent, high-speed internet links to use Amazon S3 in the enterprise is not out of the question. In addition to the highly available and serviceable storage gateway capabilities described above it would need to supply high data integrity and reliability.

Who could build such a gateway?

I would say any of the major and some of the minor data storage players could easily do an S3 gateway if they desired. There are a couple of gateway startups (see link above) that have made a stab at it but none have it quite down pat or to the extent needed by the enterprise.

However, the problem with standalone gateways from other, non-Amazon vendors is that they could easily support other cloud storage platforms and most do. This is great for gateway suppliers but bad for Amazon’s market share.

So, I believe Amazon has to invest in it’s own storage gateway if they want to go after the enterprise. Of course, when they create an enterprise cloud storage gateway they will piss off all the other gateway providers and will signal their intention to target the enterprise storage market.

So who is the next winner in data storage – I have to believe its going to be and already is Amazon. Even if they don’t go after the enterprise which I feel is the major prize, they have already carved out an unbreachable market share in a new way to implement and use storage. But when (not if) they go after the enterprise, they will threaten every major storage player.

Yes but what about others?

Arguably, Microsoft Azure is in a better position than Amazon to go after the enterprise. Since their acquisition of StorSimple last year, they already have a gateway that with help, could be just what they need to provide enterprise class storage services using Azure. And they already have access to the enterprise, already have the services, distribution and goto market capabilities that addresses enterprise needs and requirements. Maybe they have it all but they are not yet at the scale of Amazon. Could they go after this – certainly, but will they?

Google is the other major unknown. They certainly have the capability to go after enterprise cloud storage if they want. They already have Google Cloud Storage, which is priced under Amazon’s S3 and provides similar services as far as I can tell. But they have even farther to go to get to the scale of Amazon. And they have less of the marketing, selling and service capabilities that are required to be an enterprise player. So I think they are the least likely of the big three cloud providers to be successful here.

There are many other players in cloud services that could make a play for enterprise cloud storage and emerge out of the pack, namely Rackspace, Savvis, Terremark and others. I suppose DropBox, Box and the other file sharing/collaboration providers might also be able to take a shot at it, if they wanted. But I am not sure any of them have enterprise storage on their radar just yet.

And I wouldn’t leave out the current major storage, networking and server players as they all could potentially go after enterprise cloud storage if they wanted to. And some are partly there already.

Comments?

 

Enhanced by Zemanta

Bringing Internet to rural Africa using TV

Read an article the other day from BBC, named TV white space connecting rural Africa about how radio spectrum designed for TV is being used to bring Internet access to rural Africa.

The group promoting TV for Internet connectivity is the 4Afrika Initiative from Microsoft.  Their stated intent is to engage in the economic development of Africa to improve its global competitiveness.

Why TV?

Apparently, the TV spectrum has a number of attributes that make it very useful to provide Internet connectivity.  In the article they talked about 400mhz as being very resilient that propagates well around natural obstructions, through walls and goes long distances.

Although these days, Africa has plenty of undersea cables connecting it to the rest of the world, getting fiber connectivity to rural Africa has been too costly to date.  So if the last mile (or in the case of rural Africa, 100km) problem can be solved then Internet access can be available to all communities.

But the main problem is that this spectrum is usually licensed to TV stations. On the other hand, Africa probably has plenty of TV spectrum not currently being used for active broadcasting, especially across rural Africa.  As such, using this “white space” in TV signals to provide Internet access is a great alternative use of the spectrum.

With a solar powered base station libraries, schools, healthcare centers, government offices, etc., in rural Africa can now be connected to the Internet.  Presently many of these rural Africa locations have no electricity and no telephone lines whatsoever.

Providing internet access to such locations will enable e-learning, more informed access to agricultural markets as well as a plethora of advanced communications technologies currently absent from their villages.

Why Microsoft?

Microsoft has been actively engaged in Africa for over 20 years now.  And more  storage vendors have started listing Africa as a blossoming market for their gear, where they are all engaged in upgrading IT and telecommunications infrastructure. Microsoft has an interesting graphic on their involvement in Africa over the past two decades (see 4Africa Infographic).

We have discussed the emergence of mobile and cloud as a leap-frog technologies propelling Africa and especially Kenya into the information economy, (please see Mobile health (mHealth) takes off in Kenya and Is cloud a leabfrog technology posts). But Internet access is even broader than the just mobile or cloud and is certainly complementary (and for cloud, a necessary infrastructure) for both these technologies.

Africa, welcome to the Information Economy…

Comments?

Photo Credits: DIY antenna (bottlenet) by robin.elaine

EU vs. US on data protection

Prison Planet by AZRainman (cc) (from Flickr)
Prison Planet by AZRainman (cc) (from Flickr)

Last year I was at SNW and talking to a storage admin from a large, international company who mentioned how data protection policies in EU were forcing them to limit where data gets copied and replicated.  Some of their problem was due to different countries having dissimilar legislation regarding data privacy and protection.

However, their real concern was how to effectively and automatically sanitize this information. It seems they would like to analyze it off shore but still adhere to EU country’s data protection legislation.

Recently, there has been more discussions in the EU about data protection requirement (See NY Times post on Consumer Data Protection Laws, an Ocean Apart and the Ars Technica post Proposed EU data protection reform could start a “trade war”).  It seems, EU proposals are becoming even more at odds with current US data protection environment.

Compartmentalized US data privacy

In the US, data protection seems much more compartmentalized and decentralized. We have data protection for health care information, video rentals, credit reports, etc. Each with their own provisions and protection regime.

This allows companies in different markets pretty much internal control over what they do with customer information but tightly regulates what happens with the data as it moves outside that environment.

Within such an data protection regime an internet company can gather all the information they want on a person’s interaction with their web services and that way better target services and advertising for the user.

EU’s broader data protection regime

In contrast, EU countries have a much broader regime in place that covers any and all personal information.  The EU wants to ultimately control how much information can be gathered by a company about what a person does online and provide an expunge on demand capability directly to the individual.

EU’s proposed new rules would standardize data privacy rules across the 27 country region but would also strengthen them in the process.  Doing so, would make it much harder to personalize services and the presumption is that the internet companies trying to do so would not make as much revenue in the EU because of this.

Although US companies and government officials have been lobbying heavily to change the new proposals it appears to be backfiring and causing a backlash.  EU considers the US position to be biased to commerce and commercial interests whereas, US considers the EU position to be more biased to the individual.

US data privacy is evolving

On this side of the Atlantic, the privacy tide may be rising as well.  Recently, the President has recently proposed a “Consumer Privacy Bill Of Rights” which would enshrine some of the same privacy rights present in the EU proposals. For instance, such a regime would include rights for individuals to see any and all information company’s have on them, rights to correct such information and rights to limit how much information companies collect on individuals.

This all sounds a lot closer to what the EU currently has and where they seem to want to go.

However, how this plays out in Congress and what ultimately emerges as data protection and privacy legislation is another matter. But for the moment it seems that governments on both sides of the Atlantic are pushing for more data protection not less.

Comments?

 

Enterprise file synch

Strange Clouds by michaelroper (cc) (from Flickr)
Strange Clouds by michaelroper (cc) (from Flickr)

Last fall at SNW in San Jose there were a few vendors touting enterprise file synchronization services each having a slightly different version of the requirements.   The one that comes most readily to mind was Egnyte which supported file synchronization across a hybrid cloud (public cloud and network storage) which we discussed in our Fall SNWUSA wrap up post last year.

The problem with BYOD

With bring your own devices (BYOD) corporate end users are quickly abandoning any pretense of IT control and turning consumer class file synchronization services to help  synch files across desktop, laptop and all mobile devices they haul around.   But the problem with these solutions such as DropBoxBoxOxygenCloud and others are that they are really outside of IT’s control.

Which is why there’s a real need today for enterprise class file synchronization solutions that exhibit the ease of use and set up available from consumer file synch systems but  offer IT security, compliance and control over the data that’s being moved into the cloud and across corporate and end user devices.

EMC Syncplicity and EMC on premises storage

Last week EMC announced an enterprise version of their recently acquired Syncplicity software that supports on-premises Isilon or Atmos storage, EMC’s own cloud storage offering.

In previous versions of Syncplicity storage was based in the cloud and used Amazon Web Services (AWS) for cloud orchestration and AWS S3 for cloud storage. With the latest release, EMC adds on premises storage to host user file synchronization services that can span mobile devices, laptops and end user desktops.

New Syncplicity users must download desktop client software to support file synchronization or mobile apps for mobile device synchronization.  After that it’s a simple matter of identifying which if any directories and/or files are to be synchronized with the cloud and/or shared with others.

However, with the Business (read enterprise) edition one also gets the Security and Compliance console which supports access control to define users and devices that can synchronize or share data, enforce data retention policies, remote wipe corporate data,  and native support for single sign services. In addition, one also gets centralized user and group management services to grant, change, revoke user and group access to data.  Also, one now obtains enterprise security with AES-256 data-at-rest encryption, separate key manager data centers and data storage data centers, quadruple replication of data for high disaster fault tolerance and SAS70 Type II compliant data centers.

If the client wants to use on premises storage, they would also need to deploy a VM virtual appliance somewhere in the data center to act as the gateway to file synchronization service requests. The file synch server would also presumably need access to the on premises storage and it’s unclear if the virtual appliance is in-band or out-of-band (see discussion on Egnyte’s solution options below).

Egnyte’s solution

Egnyte comes as a software only solution building a file server in the cloud for end user  storage. It also includes an Egnyte app for mobile hardware and the ever present web file browser.  Desktop file access is provided via mapped drives which access the Egnyte cloud file server gateway running as a virtual appliance.

One major difference between Syncplicity and Egnyte is that Egnyte offers a combination of both cloud and on premises storage but you cannot have just on premises storage. Syncplicity only offers one or the other storage for file data, i.e., file synchronization data can only be in the cloud or on local on premises storage but cannot be in both locations.

The other major difference is that Egnyte operates with just about anybody’s NAS storage such as EMC, IBM, and HDS for the on premises file storage.  It operates as an in-band, software appliance solution that traps file activity going to your on premises storage. In this case, one would need to start using a new location or directory for data to be synchronized or shared.

But for NetApp storage only (today), they utilize ONTAP APIs to offer out-of-band file synchronization solutions.  This means that you can keep NetApp data where it resides and just enable synchronization/shareability services for the NetApp file data in current directory locations.

Egnyte promises enterprise class data security with AD, LDAP and/or SSO user authentication, AES-256 data encryption and their own secure data centers.  No mention of separate key security in their literature.

As for cloud backend storage, Egnyte has it’s own public cloud or supports other cloud storage providers such as AWS S3, Microsoft Azure, NetApp Storage Grid and HP Public Cloud.

There’s more to Egnyte’s solution than just file synchronization and sharing but that’s the subject of today’s post. Perhaps we can cover them at more length in a future post if their interest.

File synchronization, cloud storage’s killer app?

The nice thing about these capabilities is that now IT staff can re-gain control over what is and isn’t synched and shared across multiple devices.  Up until now all this was happening outside the data center and external to IT control.

From Egnyte’s perspective, they are seeing more and more enterprises wanting data both on premises for performance and compliance as well as in the cloud storage for ubiquitous access.  They feel its both a sharability demand between an enterprise’s far flung team members and potentially client/customer personnel as well as a need to access, edit and propagate silo’d corporate information using new mobile devices that everyone has these days.

In any event, Enterprise file synchronization and sharing is emerging as one of the killer apps for cloud storage.  Up to this point cloud gateways made sense for SME backup or disaster recovery solutions but IMO, didn’t really take off beyond that space.  But if you can package a robust and secure file sharing and synchronization solution around cloud storage then you just might have something that enterprise customers are clamoring for.

~~~~

Comments?

Shingled magnetic recording disks

A couple of weeks ago I attended a day of the SNIA Storage Developers Conference (SDC) where Garth Gibson of Carnegie Mellon University Parallel Data Lab (CMU PDL) and Panasas was giving a talk of what they are up to at CMU’s storage lab.  His talk at the conference was on shingled magnetic recording (SMR) disks. We have discussed this topic before in our posts on Sequential only disks?!  and in Disk trends revisited.  SMR may require a re-thinking of how we currently access disk storage.

Recall that shingled magnetic recording uses a write head that overwrites multiple tracks at a time (see graphic above), with one track being properly written and the adjacent (inward) tracks being overwritten. As the head moves to the next track, that track can be properly written but more adjacent (inward) tracks are overwritten, etc. In this fashion data can be written sequentially, on overlapping write passes.  In contrast, read heads can be much narrower and are able to read a single track.

In my post, I assumed that this would mean that the new shingled magnetic recording disks would need to be accessed sequentially not unlike tape. Such a change would need a massive rewrite to only write data sequentially.  I had suggested this could potentially work if one were to add some SSD or other NVRAM to the device to help manage the mapping of the data to the disk.  Possibly that plus a very sophisticated drive controller, not unlike SSD wear leveling today, could handle mapping a physically sequentially accessed disk to a virtually randomly accessed storage protocol.

Garth’s approach to the SMR dilemma

Garth and his team of researchers are taking another tack at the problem. In his view there are multiple groups of tracks on an SMR disk (zones or bands).  Each band can be either written sequentially or randomly but all bands can be read randomly.  One can break up the disk to include sections of multiple shingled bands, that are sequentially written and less, non-shingled bands that can be randomly written. Of course there would be a gap between the shingled bands in order not to overwrite adjacent bands. And there would also be gaps between the randomly written tracks in a non-shingled partition to allow for the wider track writing that occurs with the SMR write head.

His pitch at the conference dealt with some characteristics of such a multi-band disk device.  Such as

  • How to determine the density for a device that has multiple bands of both shingled write data and randomly written data.
  • How big or small a shingled band should be in order to support “normal” small block and randomly accessed file IO.
  • How many randomly written tracks or what the capacity of the non-shingled bands would need to be to support “normal” file IO activity.

For maximum areal density one would want large shingled bands.  There are other interesting considerations that were not as obvious but I won’t go into here.

SCSI protocol changes for SMR disks

The other, more interesting section of Garth’s talk was on recent proposed T10 and T13 changes to support SMR disks that supported shingled and non-shingled partitions and what needed to be done to support SMR devices.

The SCSI protocol changes being considered to support SMR devices include:

  • A new write cursor for shingled write bands that indicates the next LBA to be written.  The write cursor starts out at a relative band address of 0 and as each LBA is written consecutively in the band it’s incremented by one.
  • A write cursor can be reset (to zero) indicating that the band has been erased
  • Each drive maintains the band map and current cursor position within each band and this can be requested by SCSI drivers to understand the configuration of the drive.

Probably other changes are required as well but these seem sufficient to flesh out the problem.

SMR device software support

Garth and his team implemented an SMR device, emulated in software using real random accessed devices.  They then implemented an SMR device driver that used the proposed standards changes and finally, implemented a ShingledFS file system to use this emulated SMR disk to see how it would work.  (See their report on Shingled Magnetic Recording for Big Data Applications for more information.)

The CMU team implemented a log structured file system for the ShingledFS that only wrote data to the emulated SMR disk shingled partition sequentially, except for mapping and meta-data information which was written and updated randomly in a non-shingled partition.

You may recall that a log structured file system is essentially written as a sequential stream of data (not unlike a log).  But there is additional mapping required that indicates where file data is located in the log which allows for randomly accessing the file data.

In their report and at the conference, Garth presented some benchmark results for a big data application called Terasort (essentially Teragen, Terasort and Teravalidate) which seems to use Hadoop to sort a large body of data.   Not sure I can replicate this information here but suffice it to say at the moment the emulated SMR device with ShingledFS did not beat a base EXT3 or FUSE using the same hardware for these applications.

Now the CMU project wAs done by a bunch of smart researchers but it’s still relatively new and not necessarily that optimized.  Thus, there’s probably some room for improvement in the ShingledFS and maybe even the emulated SMR device and/or the SMR device driver.

At the moment Garth and his team seem to believe that SMR devices are certainly feasible and would take only modest changes to the SCSI protocols to support such devices.  As for file system support there is plenty of history surrounding log structured file systems so these are certainly doable but would require probably extensive development to implemented in various OS to support an SMR device.  The device driver changes don’t seem to be as significant.

~~~~

It certainly looks like there’s going to be SMR devices in our future.  It’s just a question whether they will be ever as widely supported as the randomly accessed disk device we know and love today.  Possibly, this could all be behind a storage subsystem that makes the technology available as networked storage capacity and over time maybe SMR devices could be implemented in more standard OS device drivers and file systems.  Nevertheless, to keep capacity and areal density on their current growth trajectory, SMR disks are coming, it’s just a matter of time.

Comments?

Image: (c) 2012 Hitachi Global Storage Technologies, from IEEE SCV Magnetics Society presentation by Roger Wood