At USENIX ATC conference a couple of weeks ago there was a presentation by a number of researchers on their BlockStack global name space and storage system based on the blockchain based Bitcoin network. Their paper was titled “Blockstack: A global naming and storage system secured by blockchain” (see pg. 181-194, in USENIX ATC’16 proceedings).
Bitcoin blockchain simplified
Blockchain’s like Bitcoin have a number of interesting properties including completely distributed understanding of current state, based on hashing and an always appended to log of transactions.
Blockchain nodes all participate in validating the current block of transactions and some nodes (deemed “miners” in Bitcoin) supply new blocks of transactions for validation.
All blockchain transactions are sent to each node and blockchain software in the node timestamps the transaction and accumulates them in an ordered append log (the “block“) which is then hashed, and each new block contains a hash of the previous block (the “chain” in blockchain) in the blockchain.
The miner’s block is then compared against the non-miners node’s block (hashes are compared) and if equal then, everyone reaches consensus (agrees) that the transaction block is valid. Then the next miner supplies a new block of transactions, and the process repeats. (See wikipedia’s article for more info).
safe ‘n green by Robert S. Donovan (cc) (from flickr)
Was reading an article the other day from TechCrunch that said Servers need to die to save the Internet. This article talked about a startup called MaidSafe which is attempting to re-architect/re-implement/replace the Internet into a Peer-2-Peer, mesh network and storage service which they call the SAFE (Secure Access for Everyone) network. By doing so, they hope to eliminate the need for network servers and storage.
Sometime in the past I wrote a blog post about Peer-2-Peer cloud storage (see Free P2P Cloud Storage and Computing if interested). But it seems MaidSafe has taken this to a more extreme level. By the way the acronym MAID used in their name stands for Massive Array of Internet Disks, sound familiar?
Crypto currency eco-system
The article talks about MaidSafe’s SAFE network ultimately replacing the Internet but at the start it seems more to be a way to deploy secure, P2P cloud storage. One interesting aspect of the MaidSafe system is that you can dedicate a portion of your Internet connected computers’ storage, computing and bandwidth to the network and get paid for it. Assuming you dedicate more resources than you actually use to the network you will be paid safecoins for this service.
For example, users that wish to participate in the SAFE network’s data storage service run a Vault application and indicate how much internal storage to devote to the service. They will be compensated with safecoins when someone retrieves data from their vault.
Safecoins are a new BitCoin like internet currency. Currently one safecoin is worth about $0.02 but there was a time when BitCoins were worth a similar amount. MaidSafe organization states that there will be a limit to the number of safecoins that can ever be produced (4.3Billion) so there’s obviously a point when they will become more valuable if MaidSafe and their SAFE network becomes successful over time. Also, earned safecoins can be used to pay for other MaidSafe network services as they become available.
Application developers can code their safecoin wallet-ids directly into their apps and have the SAFE network automatically pay them for application/service use. This should make it much easier for App developers to make money off their creations, as they will no longer have to use advertising support, or provide differenct levels of product such as free-simple user/paid-expert use types of support to make money from Apps. I suppose in a similar fashion this could apply to information providers on the SAFE network. An information warehouse could charge safecoins for document downloads or online access.
All data objects are encrypted, split and randomly distributed across the SAFE network
The SAFE network encrypts and splits any data up and then randomly distributes these data splits uniformly across their network of nodes. The data is also encrypted in transit across the Internet using rUDPs (reliable UDPs) and SAFE doesn’t use standard DNS services. Makes me wonder how SAFE or Internet network nodes know where rUDP packets need to go next without DNS but I’m no networking expert. Apparently by encrypting rUDPs and not using DNS, SAFE network traffic should not be prone to deep packet inspection nor be easy to filter out (except of course if you block all rUDP traffic). The fact that all SAFE network traffic is encrypted also makes it much harder for intelligence agencies to eavesdrop on any conversations that occur.
The SAFE network depends on a decentralized PKI to authenticate and supply encryption keys. All SAFE network data is either encrypted by clients or cryptographically signed by the clients and as such, can be cryptographically validated at network endpoints.
The each data chunk is replicated on, at a minimum, 4 different SAFE network nodes which provides resilience in case a network node goes down/offline. Each data object could potentially be split up into 100s to 1000s of data chunks. Also each data object has it’s own encryption key, dependent on the data itself which is never stored with the data chunks. Again this provides even better security but the question becomes where does all this metadata (data object encryption key, chunk locations, PKI keys, node IP locations, etc.) get stored, how is it secured, and how is it protected from loss. If they are playing the game right, all this is just another data object which is encrypted, split and randomly distributed but some entity needs to know how to get to the meta-data root element to find it all in case of a network outage.
Supposedly, MaidSafe can detect within 20msec. if a node is no longer available and reconfigure the whole network. This probably means that each SAFE network node and endpoint is responsible for some network transaction/activity every 10-20msec, such as a SAFE network heartbeat to say it is still alive.
It’s unclear to me whether the encryption key(s) used for rUDPs and the encryption key used for the data object are one and the same, functionally related, or completely independent? And how a “decentralized PKI” and “self authentication” works is beyond me but they published a paper on it, if interested.
For-profit open source business model
MaidSafe code is completely Open Source (available at MaidSafe GitHub) and their APIs are freely available to anyone and require no API key. They also have multiple approved and pending patents which have been provided free to the world for use, which they use in a defensive capacity.
MaidSafe says it will take a 5% cut of all safecoin transactions over the SAFE network. And as the network grows their revenue should grow commensurately. The money will be used to maintain the core network software and MaidSafe said that their 5% cut will be shared with developers that help develop/fix the core SAFE network code.
They are hoping to have multiple development groups maintaining the code. They currently have some across Europe and in California in the US. But this is just a start.
They are just now coming out of stealth, have recently received $6M USD investment (by auctioning off MaidSafeCoins a progenitor of safecoins) but have been in operation now, architecting/designing/developing the core code now for 8+ years now, which probably qualifies them for the longest running startup on the planet.
Replacing the Internet
MaidSafe believes that the Internet as currently designed is too dependent on server farms to hold pages and other data. By having a single place where network data is held, it’s inherently less secure than by having data spread out, uniformly/randomly across a multiple nodes. Also the fact that most network traffic is in plain text (un-encrypted) means anyone in the network data path can examine and potentially filter out data packets.
I am not sure how the SAFE network can be used to replace the Internet but then I’m no networking expert. For example, from my perspective, SAFE is dependent on current Internet infrastructure to store and forward rUDPs on along its trunk lines and network end-paths. I don’t see how SAFE can replace this current Internet infrastructure especially with nodes only present at the endpoints of the network.
I suppose as applications and other services start to make use of SAFE network core capabilities, maybe the SAFE network can become more like a mesh network and less dependent on the current hub and spoke current Internet we have today. As a mesh network, node endpoints can store and forward packets themselves to locally accessed neighbors and only go out on Internet hubs/trunk lines when they have to go beyond the local network link.
Moreover, the SAFE can make any Internet infrastructure less vulnerable to filtering and spying. Also, it’s clear that SAFE applications are no longer executing in data center servers somewhere but rather are actually executing on end-point nodes of the SAFE network. This has a number of advantages, namely:
SAFE applications are less susceptible to denial of service attacks because they can execute on many nodes.
SAFE applications are inherently more resilient because the operate across multiple nodes all the time.
SAFE applications support faster execution because the applications could potentially be executing closer to the user and could potentially have many more instances running throughout the SAFE network.
Still all of this doesn’t replace the Internet hub and spoke architecture we have today but it does replace application server farms, CDNs, cloud storage data centers and probably another half dozen Internet infrastructure/services I don’t know anything about.
Yes, I can see how MaidSafe and its SAFE network can change the Internet as we know and love it today and make it much more secure and resilient.
Not sure how having all SAFE data being encrypted will work with search engines and other web-crawlers but maybe if you want the data searchable, you just cryptographically sign it. This could be both a good and a bad thing for the world.
Nonetheless, you have to give the MaidSafe group a lot of kudos/congrats for taking on securing the Internet and making it much more resilient. They have an active blog and forum that discusses the technology and what’s happening to it and I encourage anyone interested more in the technology to visit their website to learn more
Here’s my thoughts on SNWUSA which occurred this past week in the Long Beach Convention Center.
First, it was a great location. I saw a number of users I haven’t seen at SNWUSA ever before, some of which I have known for years from other (non-storage) venues.
Second, the exhibit hall was scantly populated. There were no major storage vendors at the show at all. Gold sponsors included NEC, Riverbed, & Sepaton, representing the largest exhibiters presenn. Making up the next (Contributing) tier were Western Digital, Toshiba, Active Archive Alliance, and LTO consortium with a smattering of smaller companies. Finally, there were another 12 vendors with kiosks around the floor, with the largest there being Veeam Software.
I suspect VMWorld Europe happening the same time in Barcelona might have had something to do with the sparse exhibit floor but the trend has been present for the past few shows.
That being said there were still a few surprises in store, at least for me. Two of the most interesting ones were:
Coho Data who came out of stealth with a scale out, RAIN (Redundant array of independent nodes) based storage cluster, with distributed, mirrored customer data across nodes and software defined networking. They currently support NFS for VMware with a management UI reminiscent of IOS 7 sans touch support. The product comes as a series of nodes with SSDs, disk storage and SDN. The SDN allows Coho Data to relocate front-end (client) connections to where the customer data lies. The distributed, mirrored backend storage provides redundancy in the case of a node/disk failure, at which time the system understands what data is now at risk and rebuilds the now-mirorless data onto other nodes. It reminds me a lot of Bycast/Archivas like architectures, with SDN and NFS support. I suppose the reason they are supporting VMware VMDKs is that the files are fairly large and thus easier to supply.
Cloud Physicswas not exhibiting but they sponsored a break. As such, they were there talking with analysts and the press about their product. Their product installs as a VMware VM service and propagates VMware management agents to ESX servers which then pipe information back to their app about how your VMware environment is running, how VMs are performing, how your network and storage are performing for the VMs running, etc. This data is then sent to the cloud, where it’s anonymized. In the cloud, customers can use apps (called Cards) to analyze this data in the cloud, which can help them understand problem areas, predict what configuration changes can do for them, show them how VMs are performing, etc. It essentially is logging all this information to the cloud and providing ways to analyze the data to optimize your VMware environment.
Coming in just behind these two was Jeda Networkswith their Software Defined Storage Network (SDSN). They use commodity (OpenFlow compatible) 10GbE switches to support a software FCoE storage SAN. Jeda Networks say that over the past two years, most 10GbE switch hardware have started to support DCB in hardware and with that in place, plus OpenFlow compatibility, they can provide a SDSN on top of them just by emulating a control layer for FCoE switches. Of course one would still need FCoE storage and CNAs but with that in place one could use much cheaper switches to support FCoE.
CloudPhysics has a subscription based pricing model which offers three tiers:
Free where you get their Vapp, the management agents and a defined set of Free Card Apps for no cost;
Standard level where you get all the above plus a set of Card Apps which provide more VMware managability for $50/ESX server/Month; and
Enterprise level where you get all the above plus all the Card Apps presently available for $150/ESX server/Month.
Jeda networks and Coho Data are still developing their pricing and had none they were willing to disclose.
One of the CloudPhysics Card apps could predict how certain VMs would benefit from host based (PCIe or SSD) IO caching. They had a chart which showed working set inflection points for (I think) one VM running an OLTP application. I have asked for this chart to discuss further in a future post. But although CloudPhysics has the data to produce such a chart, the application shows three potential break points where say adding 500MB, 2000MB or 10000MB of SSD cache can speed up application performance by 10%, 30% or 50% (numbers here made up for example purposes and not off the chart they showed me).
A few other companies made announcements at the show. For example, Sepaton announced their new VirtuoSO, scale out hybrid reduplication appliance.
That’s about it. I would have to say that SNW needs to rethink their business model, frequency of stows or what they are trying to do at their conferences. However, on the plust side, most of the users I talked with came away with a lot of information and thought the show was worthwhile and I came away with a few surprises.
(c) 2012 Silverton Consulting, Inc. All rights reserved
With all this talk of software defined networking and server virtualization where does storage virtualization stand. I blogged about some problems with storage virtualization a week or so ago in my post on Storage Utilization is broke and this post takes it to the next level. Also I was at a financial analyst conference this week in Vail where I heard Mark Lewis of Tekrocket but formerly of EMC discuss the need for a data hypervisor to provide software defined storage.
I now believe what we really need for true storage virtualization is a renewed focus on data hypervisor functionality. The data hypervisor would need both a control plane and a data plane in order to function properly. Ideally the control plane would set up the interface and routing for the data plane hardware and the server and/or backend storage would be none the wiser.
DMs everywhere
I envision a scenario where a customer’s application data is packaged with a data hypervisor which runs on a commodity data switch hardware with data plane and control plane software running on it. Sort of creating (virtual) data machines or DMs.
All enterprise and nowadays most midrange storage provide most of the functionality of a storage control plane such as defining units of storage, setting up physical to logical storage mapping, incorporating monitoring, and management of the physical storage layer, etc. So control planes are pervasive in today’s storage but proprietary.
In addition most storage systems have data plane functionality which operates to connect a host IO request to the actual data which resides in backend storage or internal cache. But again although data planes are everywhere in storage today they are all proprietary to a specific vendor’s storage system.
Data switch needed
But in order to utilize a data hypervisor and create a more general purpose control plane layer, we need a more generic data plane layer that operates on commodity hardware. This is different from today’s SAN storage switches or DCB switches but similar in a some ways.
The functions of the data switch/data plane layer would be to take routing instructions from the control plane layer and direct the server IO request to the proper storage unit using the data plane layer. Somewhere in this world view, probably at the data plane level it would introduce data protection services like RAID or other erasure coding schemes, point in time copy/clone services and replication services and other advanced storage features needed by enterprise storage today.
Also it would need to provide some automated storage movement across and within tiers of physical storage and it would connect server storage interfaces at the front end to storage interfaces at the backend. Not unlike SAN or DCB switches but with much more advanced functionality.
Ideally data switch storage interfaces could attach to dedicated JBOD, Flash arrays as well as systems using DAS storage. In addition, it would be nice if the data switch could talk to real storage arrays on SAN, IP/SANs or NFS&CIFS/SMB storage systems.
The other thing one would like out of a data switch is support for a universal translator that would map one protocol to another, such as iSCSI to SAS, NFS to FC, or FC to NFS and any other combination, depending on the needs of the server and the storage in the configuration.
Now if the data switch were built on top of commodity x86 hardware and software with the data switch as just a specialized application that would create the underpinnings for a true data hypervisor with a control and data plane that could be independent and use anybody’s storage.
Data hypervisor
Assuming all this were available then we would have true storage virtualization. With these capabilities, storage could be repurposed on the fly, added to, subtracted from, and in general be a fungible commodity not unlike server processing MIPs under VMware or Hyper-V.
Application data would then needed to be packaged into a data machine which would offer all the host services required to support host data access. The data hypervisor would handle the linkages required to interface with the control and data layers.
Applications could be configured to utilize available storage at ease and storage could grow, shrink or move to accommodate the required workload just as easily as VMs can be deployed today.
How we get there
Aside from the VMware, Citrix, Microsoft thrusts towards virtual storage there are plenty of storage virtualization solutions that can control most backend enterprise SAN storage. However, the problem with these solutions is that in general the execute only on a specific vendors hardware and don’t necessarily talk to DAS or JBOD storage.
In addition, not all of the current generation storage virtualization solutions are unified. That is most of these today only talk FC, FCoE or iSCSI and don’t support NFS or CIFS/SMB.
These don’t appear to be insurmountable obstacles and with proper allocation of R&D funding, could all be solved.
However the more problematic is that none of these solutions operate on commodity hardware or commodity software.
The hardware is probably the easiest to deal with. Today many enterprise storage systems are built ontop of x86 processor storage controllers. Albeit sometimes they incorporate specialized packaging for redundancy and high availability.
The harder problem may be commodity software. Although the genesis for a few storage virtualization systems might come from BSD or other “commodity” software operating systems. They have been modified over the years to no longer represent anything that can run on standard off the shelf operating systems.
Then again some storage virtualization systems started out with special home grown hardware and software. As such, converting these over to something more commodity oriented would be a major transition.
But the challenge is how to get there from here and would anyone want to take this on. The other problem is that the value add that storage vendors supply currently would be somewhat eroded. Not unlike what happened to proprietary Unix systems with the advent of VMware.
But this will not take place overnight and the company that takes this on and makes a go at it can have a significant software monopoly that would be hard to crack.
Perhaps it will take a startup to do this but I believe the main enterprise storage vendors are best positioned to take this on.
Code Name "Thumper" by richardmasoner (cc) (from Flickr)
An announcement this week by VMware on their vSphere 5 Virtual Storage Appliance has brought back the concept of shared DAS (see vSphere 5 storage announcements).
Over the years, there have been a few products, such as Seanodes and Condor Storage (may not exist now) that have tried to make a market out of sharing DAS across a cluster of servers.
Arguably, Hadoop HDFS (see Hadoop – part 1), Amazon S3/cloud storage services and most scale out NAS systems all support similar capabilities. Such systems consist of a number of servers with direct attached storage, accessible by other servers or the Internet as one large, contiguous storage/file system address space.
Why share DAS? The simple fact is that DAS is cheap, its capacity is increasing, and it’s ubiquitous.
Shared DAS system capabilities
VMware has limited their DAS virtual storage appliance to a 3 ESX node environment, possibly lot’s of reasons for this. But there is no such restriction for Seanode Exanode clusters.
On the other hand, VMware has specifically targeted SMB data centers for this facility. In contrast, Seanodes has focused on both HPC and SMB markets for their shared internal storage which provides support for a virtual SAN on Linux, VMware ESX, and Windows Server operating systems.
Although VMware Virtual Storage Appliance and Seanodes do provide rudimentary SAN storage services, they do not supply advanced capabilities of enterprise storage such as point-in-time copies, replication, data reduction, etc.
But, some of these facilities are available outside their systems. For example, VMware with vSphere 5 will supports a host based replication service and has had for some time now software based snapshots. Also, similar services exist or can be purchased for Windows and presumably Linux. Also, cloud storage providers have provided a smattering of these capabilities from the start in their offerings.
Performance?
Although distributed DAS storage has the potential for high performance, it seems to me that these systems should perform poorer than an equivalent amount of processing power and storage in a dedicated storage array. But my biases might be showing.
On the other hand, Hadoop and scale out NAS systems are capable of screaming performance when put together properly. Recent SPECsfs2008 results for EMC Isilon scale out NAS system have demonstrated very high performance and Hadoops claim to fame is high performance analytics. But you have to throw a lot of nodes at the problem.
—–
In the end, all it takes is software. Virtualizing servers, sharing DAS, and implementing advanced storage features, any of these can be done within software alone.
However, service levels, high availability and fault tolerance requirements have historically necessitated a physical separation between storage and compute services. Nonetheless, if you really need screaming application performance and software based fault tolerance/high availability will suffice, then distributed DAS systems with co-located applications like Hadoop or some scale out NAS systems are the only game in town.
BIGData is creating quite a storm around IT these days and at the bottom of big data is an Apache open source project called Hadoop.
In addition, over the last month or so at least three large storage vendors have announced tie-ins with Hadoop, namely EMC (new distribution and product offering), IBM ($100M in research) and NetApp (new product offering).
What is Hadoop and why is it important
Ok, lot’s of money, time and effort are going into deploying and supporting Hadoop on storage vendor product lines. But why Hadoop?
Essentially, Hadoop is a batch processing system for a cluster of nodes that provides the underpinnings of most BIGData analytic activities because it bundle two sets of functionality most needed to deal with large unstructured datasets. Specifically,
Distributed file system – Hadoop’s Distributed File System (HDFS) operates using one (or two) meta-data servers (NameNode) with any number of data server nodes (DataNode). These are essentially software packages running on server hardware which supports file system services using a client file access API. HDFS supports a WORM file system, that splits file data up into segments (64-128MB each), distributes segments to data-nodes, stored on local (or networked) storage and keeps track of everything. HDFS provides a global name space for its file system, uses replication to insure data availability, and provides widely distributed access to file data.
MapReduce processing – Hadoop’s MapReduce functionality is used to parse unstructured files into some sort of structure (map function) which can then be later sorted, analysed and summarized (reduce function) into some worthy output. MapReduce uses the HDFS to access file segments and to store reduced results. MapReduce jobs are deployed over a master (JobTracker) and slave (TaskTracker) set of nodes. The JobTracker schedules jobs and allocates activities to TaskTracker nodes which execute the map and reduce processes requested. These activities execute on the same nodes that are used for the HDFS.
Hadoop explained
It just so happens that HDFS is optimized for sequential read access and can handle files that are 100TBs or larger. By throwing more data nodes at the problem, throughput can be scaled up enormously so that a 100TB file could literally be read in minutes to hours rather than weeks.
Similarly, MapReduce parcels out the processing steps required to analyze this massive amount of data onto these same DataNodes, thus distributing the processing of the data, reducing time to process this data to minutes.
HDFS can also be “rack aware” and as such, will try to allocate file segment replicas to different racks where possible. In addition, replication can be defined on a file basis and normally uses 3 replicas of each file segment.
Another characteristic of Hadoop is that it uses data locality to run MapReduce tasks on nodes close to where the file data resides. In this fashion, networking bandwidth requirements are reduced and performance approximates local data access.
MapReduce programming is somewhat new, unique, and complex and was an outgrowth of Google’s MapReduce process. As such, there have been a number of other Apache open source projects that have sprouted up namely, Cassandra, Chukya,Hbase, Hive, Mahout, and Pig to name just a few that provide easier ways to automatically generate MapReduce programs. I will try to provide more information on these and other popular projects in a follow on post.
Hadoop fault tolerance
When an HDFS node fails, the NameNode detects a missing heartbeat signal and once detected, the NameNode will recreate all the missing replicas that resided on the failed DataNode.
Similarly, the MapReduce JobTracker node can detect that a TaskTracker has failed and allocate this work to another node in the cluster. In this fashion, work will complete even in the face of node failures.
Hadoop distributions, support and who uses them
Alas, as in any open source project having a distribution that can be trusted and supported can take much of the risk out of using them and Hadoop is no exception. Probably the most popular distribution comes from Cloudera which contains all the above named projects and more and provides support. Recently EMC announced that they will supply their own distribution and support of Hadoop as well. Amazon and other cloud computing providers also support Hadoop on their clusters but use other distributions (mostly Cloudera).
As to who uses Hadoop, it seems just about everyone on the web today is a Hadoop user, from Amazon to Yahoo including EBay Facebook, Google and Twitter to highlight just a few popular ones. There is a list on the Apache’s Hadoop website which provides more detail if interested. The list indicates some of the Hadoop configurations and shows anywhere from a 18 node cluster to over 4500 nodes with multiple PBs of data storage. Most of the big players are also active participants in the various open source projects around Hadoop and much of the code came from these organizations.
—-
I have been listening to the buzz on Hadoop for the last month and finally decided it was time I understood what it was. This is my first attempt – hopefully, more to follow.
Sometime this week EMC announced a new generation of Isilon NearLine storage which now includes HGST 3TB SATA disk drives. With the new capacity the multi-node (144) Isilon cluster using the 108NL nodes can support 15PB of file data in a single file system.
Some of the booths along the walk to the solutions pavilion highlight EMC innovation winners. Two that caught my interest included:
Constellation computing – not quite sure how to define this but it’s distributed computing along with distributed data creation. The intent is to move the data processing to the source of the data creation and keep the data there. This might be very useful for applications that have many data sources and where data processing capabilities can be moved out to the nodes where the data was created. Seems highly scaleable but may depend on the ability to carve up the processing to work on the local data. I can see where compression, encryption, indexing and some statistical summarization can be done at the data creation site before it’s sent elsewhere. Sort of like both a sensor mesh with a processing nodes attached to the sensors configured as a sensor-proccessing grid. Only one thing concerned me, there didn’t seem to be any central repository or control to this computing environment. Probably what they intended, as the distributed solution is more adaptable and more scaleable than a centrally controlled environment.
Developing world healthcare cloud – seemed to be all about delivering healthcare to the bottom of the pyramid. They won EMC’s social innovation award and are working with a group in Rwanda to try to provide better healthcare to remote villages. It’s built around OpenMRS as a backend medical record archive hosted on EMC DC powered Iomega NAS storage and uses Google’s OpenDataKit to work with the data on mobile and laptop devices. They showed a mobile phone which could be used to create, record and retrieve healthcare information (OpenMRS records) remotely and upload it sometime later when in range of a cell tower. The solution also supports the download of a portion of the medical center’s health record database (e.g., a “cohort” slice, think a village’s healthcare records) onto a laptop, usable offline by a healthcare provider to update and record patient health changes onsite and remotely. Pulling all the technology together and delivering this as an application stack usable on mobile and laptop devices with minimal IT sophistication, storage and remote/mobile access are where the challenges lie.
Went to Sanjay’s (EMC’s CIO) keynote on EMC IT’s journey to IT-as-a-Service. As you can imagine it makes extensive use of VMware’s vSphere, vCloud, and vShield capabilities primarily in a private cloud infrastructure but they seem agnostic to a build-it or buy-it approach. EMC is about 75% virtualized today, and are starting to see significant and tangible OpEx and energy savings. They designed their North Carolina data center around the vCloud architecture and now are offering business users self service portals to provision VMs and business services…
Only caught the first section of BJ’s (President of BRS) keynote but he said recent analyst data (think IDC?) said that EMC was the overall leader (>64% market share) in purpose built backup appliances (Data Domain, Disk Library, Avamar data stores, etc.). Too bad I had to step out but he looked like he was on a roll.
The Large Hadron Collider/ATLAS at CERN by Image Editor (cc) (from flickr)
That’s what CERN produces from their 6 experiments each year. How this data is subsequently accessed and replicated around the world is an interesting tale in multi-tier data grids.
When an experiment is run at CERN, the data is captured locally in what’s called Tier 0. After some initial processing, this data is stored on tape at Tier 0 (CERN) and then replicated to over 10 Tier 1 locations around the world which then become a second permanent repository for all CERN experiment data. Tier 2 centers can request data from Tier 1 locations and process the data and return results to Tier 1 for permanent storage. Tier 3 data centers can request data and processing time from Tier 2 centers to analyze the CERN data.
Each experiment has it’s own set of Tier 1 data centers that store its results. According to the latest technical description I could find, the Tier 0 (at CERN) and most Tier 1 data centers provide a tape storage permanent repository for experimental data frontended by a disk cache. Tier 2 can have similar resources but are not expected to be a permanent repository for data.
Each Tier 1 data center has it’s own hierarchical management system (HMS) or mass storage system (MSS) based on any number of software packages such as HPSS, CASTOR, Enstore, dCache, DPM, etc., most of which are open source products. But regardless of the HMS/MSS implementation they all a set of generic storage management services based on the Storage Resource Manager (SRM) as defined by a consortium of research centers and provide a set of file transfer protocols defined by yet another set of standards by Globus or gLite.
Each Tier 1 data center manages their own storage element (SE). Each experiment storage element has disk storage with optionally tape storage (using one or more of the above disk caching packages) and provides authentication/security, provides file transport, and maintains catalogs and local databases. These catalogs and local databases index the data sets or files available on the grid for each experiment.
Data stored in the grid are considered read-only and can never be modified. It is intended for users that need to process this data read it from Tier 1 data centers, process the data and create new data which is then stored in the grid. New data added Data to be placed in the grid must be registered to the LCG file catalogue and transferred to a storage element to be replicated throughout the grid.
CERN data grid file access
“Files in the Grid can be referred to by different names: Grid Unique IDentifier (GUID), Logical File Name (LFN), Storage URL (SURL) and Transport URL (TURL). While the GUIDs and LFNs identify a file irrespective of its location, the SURLs and TURLs contain information about where a physical replica is located, and how it can be accessed.” (taken from the gLite user guide).
GUIDs look like guid:<unique_string> files are given an unique GUID when created and can never be changed. The unique string portion of the GUID is typically a combination of MAC address and time-stamp and is unique across the grid.
LFNs look like lfn:<unique_string> files can have many different LFNs all pointing or linking to the same data. LFN unique strings typically follow unix-like conventions for file links.
SURLs look like srm:<se_hostname>/path and provide a way to access data located at a storage element. SURLs are transformed to TURLs. SURLs are immutable and are unique to a storage element.
TURLs look like <protocol>://<se_hostname>:<port>/path and are obtained dynamically from a storage element. TURLs can have any format after the // that uniquely identifies the file to the storage element but typically they have a se_hostname, port and file path.
GUIDs and LFNs are used to lookup a data set in the global LCG file catalogue . After file lookup a set of site specific replicas are returned (via SURLs) which are used to request file transfer/access from a nearby storage element. The storage element accepts the file’s SURL and assigns a TURL which can then be used to transfer the data to wherever it’s needed. TURLs can specify any file transfer protocol supported across the grid
CERN data grid file transfer protocols supported for transfer and access currently include:
GSIFTP – a grid security interface enabled subset of the GRIDFTP interface as defined by Globus
gsidcap – a grid security interface enabled feature of dCache for ftp access.
rfio – remote file I/O supported by DPM. There is both a secure and an insecure version of rfio.
file access – local file access protocols used to access the file data locally at the storage element.
While all storage elements provide the GSIFTP protocol, the other protocols supported depend on the underlying HMS/MSS system implemented by the storage element for each experiment. Most experiments use one type of MSS throughout their world wide storage elements and as such, offer the same file transfer protocols throughout the world.
If all this sounds confusing, it is. Imagine 15PB a year of data replicated to over 10 Tier 1 data centers which can then be securely processed by over 160 Tier 2 data centers around the world. All this supports literally thousands of scientist who have access to every byte of data created by CERN experiments and scientists that post process this data.
Just exactly how this data is replicated to the Tier 1 data centers and how a scientists processes such data must be the subject for other posts.