Latest Publications

Caching DaaD for federated data centers

Internet Splat Map by jurvetson (cc) (from flickr)

Internet Splat Map by jurvetson (cc) (from flickr)

Today, I attended a webinar where Pat Gelsinger, President of Information Infrastructure at EMC discussed their concept for a new product based on the Yotta Yotta technology they acquired a few years back.  Yotta Yotta’s product was a distributed, coherent caching appliance that had FC front end ports, an Infiniband appliance internal network and both FC and WAN backend links.

What one did with Yotta Yotta nodes was place them in front of your block storage, connect them together via infiniband locally and via a WAN technology (of your choice, then) and then you could access any data behind the appliances from any attached location.  They also provided very quick transferring of bulk data between remote nodes. So, their technology allowed for very rapid data transmission over standard WAN interfaces/distances and provided a distributed cache across those very same distances to the data behind the appliances.

I like caching appliances as much as anyone but they had become prominent only in the late 70’s and early 80’s mostly because caching technology was hard to do with the storage subsystems of the day, but they went away a long time ago.  Nowadays, you can barely purchase a lone disk drive without a cache in them.  So what’s different.

Introducing DaaD

Today we have SSDs and much cheaper processing power.  I wrote about new caching appliances like DataRam’s XcelaSAN  in a Cache appliances rise from the dead post I did after last years SNW.  But EMC’s going after a slightly broader domain – the world.  The caching appliance that EMC is discussing is really intended to support distributed data access, or as I like to call it,  Data-at-a-Distance (DaaD).

How can this work?  Data is stored on subsystems at various locations around the world.  A DaaD appliance is inserted in front of each of these and connected over the WAN. Some or all of that data is now re-configured (at block or more likely LUN level) to be accessible at distance from each DaaD data center.  As each data center reads and writes data from/to their remote brethern, some portion of that data is cached locally in the DaaD appliance and the rest is only available by going to the remote site (with considerably higher latency).

This works moderately well for well behaved, read intensive workloads where 80% of the IO is to 20% of the data (most of which is cached locally).  But block writes present a particularly nasty problem as any data write has to be propagated to all cache copies before acknowledged.

It’s possible write propagation could be done via invalidating the data in cache (so any subsequent read would need to re-access the data from the original host).  Nevertheless, to even know which DaaD nodes have a cached copy of a particular block, one needs to maintain a dictionary of all globally identifiable blocks held in any DaaD cache node at every moment in time.  Any such table would change often, will necessarily need to be updated very carefully, deadlock free and atomically with non-failable transactions – therein lies one of the technological hurdles.  Doing this quickly without impacting performance is another hurdle.

So simple enough, EMC takes Yotta Yotta’s technology, updates it for todays processors, networking, and storage, and releases it as a data center federation enabler. So, what can one do with a federated data center, well that’s another question and it involves Vmotion, and must be a subject for a future post …

R&D effectiveness

A recent Gizmodo blog post compared a decade of R&D at Sony, Microsoft and Apple.  There were some interesting charts but mostly it showed that R&D as a percent of revenue, fluctuates from year to year and R&D spend has been rising for all the companies (although at different rates).

R&D Effectiveness, (C) 2010 Silverton Consulting, All Rights Reserved

R&D Effectiveness, (C) 2010 Silverton Consulting, All Rights Reserved

Overall from a percentage of Revenue basis, Microsoft wins, spending ~15% of revenue on R&D over the past decade, Apple loses, spending only ~4% on R&D and Sony is right in the middle at spending ~7% on R&D.  Yet viewing the impact on corporate revenue R&D spending had significantly different impacts on each company than what pure % R&D spending would indicate.

How can one measure R&D effectiveness.

  • Number of patents – this is often used as an indicator, but unclear how this correlates to business success.  Patents can be licensed but only if they prove important to other companies. However, patent counts can be gauged early on during the R&D activities rather than much later when a product reaches the market.
  • Number of projects – by projects we mean an idea from research taken into development.  Such projectst may or may not make it out to market.  At one level this can be a leading indicator of “research” effectiveness, as this means an idea was deemed at least of commercial interest.  A problem with this is that not all projects get released to the market or become commercially viable.
  • Number of products – by products, we mean something sold to customers.  At least such a measure reflects that the total R&D effort was deemed worthy enough to take to market.  How successful such a product is still to be determined.
  • Revenue of products – product revenue seems easy enough but often can be hard to allocate properly.  Looking at the iPhone, do we count just handset revenues or include application and cell service revenues. But assuming one can properly allocate revenue sources to R&D efforts, one can come up with a revenue from R&D spending.  The main problem with revenue generated from R&D ratios are all the other non-R&D factors confound it, e.g., marketing, manufacturing, competition, etc.
  • Profitability of products – product profitability is even messier than revenue when it comes to confoundability.  But ultimately profitability of R&D efforts may be the best factor to use as any product that’s truly effective should generate the most profits.

There are probably other R&D effectiveness factors that could be considered but these will suffice for now.

How did they do?

Returning to the Gizmodo discussion, their post didn’t include any patent counts, project counts (only visibly internally), product counts, or profitability measures but they did show revenue for each company.  From a purely Revenue impact one would have to say that Apple’s R&D was a clear winner with Microsoft a clear second.  Although we would have to say that Apple started from considerable smaller revenue than Sony or Microsoft but Apple’s ~$148B of revenue in 2005 was only small in comparison to other giants.  We all know the success of the iPhone and iPod but they also stumbled on the Apple TV.

Why did they do so well?

What then makes Apple do so good?  We have talked before about an elusive quality we called visionary leadership.  Certainly Bill Gates is as technically astute as Steve Jobs and there can be no denying that their respective marketing machines are evenly matched.  But both Microsoft and Apple were certainly led by more technical individuals than Sony over the last decade.   Both Microsoft and Apple have had significant revenue increases over the past ten years, that parallel one another while Sony, in comparison, has remained relatively flat.

I would say both Microsoft and Apple results show that “visionary leadership” has a certain portion of technicality to it that can’t be denied.  Moreover, I think that if one looked at Sony under Akio Morita, HP under Bill Hewlett and Dave Packard or many other large companies today, one could conclude that technical excellence is a significant component of visionary leadership.  All these companies highest revenue growth came under leadership which had significant technical knowledge.  There’s more to visionary leadership then technicality alone but it seems at least foundational.

I still owe a post on just what constitute’s visionary leadership, but I seem to be surrounding it rather than attacking it directly.

WD’s new SiliconEdge Blue SSD data write spec

Western Digital's Silicon Edge Blue SSD SATA drive (from their website)

Western Digital's SiliconEdge Blue SSD SATA drive (from their website)

Western Digital (WD) announced their first SSD drive for the desktop/laptop market space today.  Their drive offers the typical256, 128, and 64GB capacity points over a SATA interface.  Performance looks ok at 5K random read or write IO/s with sustained transfers at 250 and 140MB/s for read and write respectively.  But what caught my eye was a new specification I hadn’t seen before indicating Maximum GB written per day of 17.5, 35 and 70GB/d for their drives using WD’s Operational Lifespan – LifeEST(tm) definition.

I couldn’t find anywhere that said which NAND technology was used in the device but it likely uses MLC NAND.  In a prior posting we discussed a Toshiba study that said a “typical” laptop user writes about 2.4GB/d and a “heavy” laptop user writes about 9.2GB/d.  This data would indicate that WD’s new 64GB drive can handle almost 2X the defined “heavy” user workload for laptops and their other drives would handle it just fine.  A data write rate for desktop work, as far as I can tell, has not been published, but presumably it would be greater than laptop users.

From my perspective more information on the drives underlying NAND technology, on what a LifeEST specification actually means, and a specification as to how much NAND storage was actually present would be nice, but these are all personal nits.  All that aside, I applaud WD for standing up and saying what data write rate their drives can support.  This needs to be a standard part of any SSD specification sheet and I look forward to seeing more information like this coming from other vendors as well.

Describing Dedupe

Hard Disk 4 by Alpha six (cc) (from flickr)

Hard Disk 4 by Alpha six (cc) (from flickr)

Deduplication is a mechanism to reduce the amount of data stored on disk for backup, archive or even primary storage.  For any storage, data is often duplicated and any system that eliminates storing duplicate data will be more utilize storage more efficiently.

Essentially, deduplication systems identify duplicate data and only store one copy of such data.  It uses pointers to incorporate the duplicate data at the right point in the data stream. Such services can be provided at the source, at the target, or even at the storage subsystem/NAS system level.

The easiest way to understand deduplication is to view a data stream as a book and as such, it can consist of two parts a table of contents and actual chapters of text (or data).  The stream’s table of contents provides chapter titles but more importantly (to us), identifies a page number for the chapter.  A deduplicated data stream looks like a book where chapters can be duplicated within the same book or even across books, and the table of contents can point to any book’s chapter when duplicated. A deduplication service inputs the data stream, searches for duplicate chapters and deletes them, and updates the table of contents accordingly.

There’s more to this of course.  For example, chapters or duplicate data segments must be tagged with how often they are duplicated  so that such data is not lost when modified.  Also, one way to determine if data is duplicated is to take one or more hashes and compare this to other data hashes, but to work quickly, data hashes must be kept in a searchable index.

Types of deduplication

  • Source deduplication involves a repository, a client application, and an operation which copies client data to the repository.  Client software chunks the data, hashes the data chunks, and sends these hashes over to the repository.  On the receiving end, the repository determines which hashes are duplicates and then tells the client to send only the unique data.  The repository stores the unique data chunks and the data stream’s table of contents.
  • Target deduplication involves performing deduplication inline, in-parallel, or post-processing by chunking the data stream as it’s recieved, hashing the chunks, determining which chunks are unique, and storing only the unique data.  Inline refers to doing such processing while receiving data at the target system, before the data is stored on disk.  In-parallel refers to doing a portion of this processing while receiving data, i.e., portions of the data stream will be deduplicated while other portions are being received.  Post-processing refers to data that is completely staged to disk before being deduplicated later.
  • Storage subsystem/NAS system deduplication looks a lot like post-processing, target deduplication.  For NAS systems, deduplicaiot looks at a file of data after it is closed. For general storage subsystems the process looks at blocks of data after they are written.  Whether either system detects duplicate data below these levels is implementation dependent.

Deduplication overhead

Deduplication processes generate most overhead while deduplicating the data stream, essentially during or after the data is written, which is the reason that target deduplication has so many options, some optimize ingestion while others optimize storage use. There is very little additonal overhead for re-constituting (or un-deduplicating) the data for read back as retrieving the unique and/or duplicated data segments can be done quickly.  There may be some minor performance loss because of lack of  sequentiality but that only impacts data throughput and not that much.

Where dedupe makes sense

Deduplication was first implemented for backup data streams.  Because any backup that takes full backups on a monthly or even weekly basis will duplicate lot’s of data.  For example, if one takes a full backup of 100TBs every week and lets say new unique data created each week is ~15%, then at week 0, 100TB of data is stored both for the deduplicated and undeduplicated data versions; at week 1 it takes 115TB to store the deduplicated data but 200TB for the non-deduplicated data; at week 2 it takes ~132TB to store deduplicated data but 300TB for the non-deduplicated data, etc.  As each full backup completes it takes another 100TB of un-deduplicated storage but significantly less deduplicated storage.  After 8 full backups the un-deduplicated storage would require 8ooTB but only ~265TB for deduplicated storage.

Deduplication can also work for secondary or even primary storage.  Most IT shops with 1000’s of users, duplicate lot’s of data.  For example, interim files are sent from one employee to another for review, reports are sent out en-mass to teams, emails are blasted to all employees, etc.  Consequently, any storage (sub)system that can deduplicate data would more efficiently utilize backend storage.

Full disclosure, I have worked for many deduplication vendors in the past.

Cleversafe’s new hardware

Cleversafe new dsNet(tm) Rack (from Cleversafe.com)

Cleversafe new dsNet(tm) Rack (from Cleversafe.com)

Yesterday, Cleversafe announced new Slicestor(r) 2100 and 2200 hardware using 2TB SATA drives. The standard 2100 1U package supports 8TB of raw data and the 2200 new 2U package supports 24TB of data. In addition, a new Accesser(r) 2100 supports 8GB of ECC RAM, and 2 GigE or 10GbE ports for data access.

In addition to the new server hardware, Cleversafe also announced an integrated rack with up to 18 Slicestor 2200s, 2 Accessors 2100s, 1 Omnience (management node), 48-port ethernet switch, and PDUs. This new rack configuration comes pre-cabled and can easily be installed to support an immediate 432TB raw capacity. It’s expected that customers with multiple sites could order 1 or more racks to support a quick installation of Cleversafe storage services.

Cleversafe currently offers iSCSI block services, direct object storage interface and file services interfaces (over iSCSI).  They are finding some success in the media and entertainment space as well as federal and state government data centers.

The federal and state government agencies seem especially interested in Cleversafe for its data security capabilities.  They offer cloud data security via their SecureSlice(tm) technology which encrypts data slices and uses key masking to obscure the key.  With SecureSlice, the only way to decrypt the data is to have enough slices to reconstitute the data.

Also the new Accesser and Slicestor server hardware now uses a drive on motherboard flash unit to hold operating system/Cleversafe software. This allows data drives to only hold customer data and reduces Accesser power requirements while also improving both Slicestor and Accesser reliability.

In a previous post we discussed EMC’s Atmos’s GeoProtect capabilities and although they are not quite at the sophistication of Cleversafe, EMC does offer a sort of data dispersion across sites/racks.  However, it appears that GeoProtect is currently limited to two distinct configurations.  In contrast, Cleversafe allows the user to select the number of Slicestor’s to store data and the threshold required to reconstitute the data.  Doing this allows the user to almost dial up or down the availability and reliability they want for their data.

Cleversafe performs well enough to saturate a single Accesser GigE iSCSI link.  Accessers maintain a sort of preferred routing table which indicates which Slicestors currently have the best performance. By accessing the quickest Slicestors first to reconstitute data, performance can be optimized.  Specifically, for the typical multi-site Cleversafe implementation, knowing current Slicestor to Accesser performance can improve data reconstitution performance considerably.

Full disclosure, I have done work for Cleversafe in the past.

Google vs. National Information Exchange Model

Information Exchange Package Documents (IEPI) lifecycle from www.niem.gov

Information Exchange Package Documents (IEPI) lifecycle from www.niem.gov

Wouldn’t the National information exchange be better served by deferring the National Information Exchange Model (NIEM) and instead implementing some sort of Google-like search of federal, state, and municipal text data records.  Most federal, state and local data resides in sophisticated databases using their information management tools but such tools all seem to support ways to create a PDF, DOC, or other text output for their information records.   Once in text form, such data could easily be indexed by Google or other search engines, and thus, searched by any term in the text record.

Now this could never completely replace NIEM, e.g., it could never offer even “close-to” real-time information sharing.  But true real-time sharing would be impossible even with NIEM.  And whereas NIEM is still under discussion today (years after its initial draft) and will no doubt require even more time to fully  implement, text based search could be available today with minimal cost and effort.

What would be missing from a text based search scheme vs. NIEM:

  • “Near” realtime sharing of information
  • Security constraints on information being shared
  • Contextual information surrounding data records,
  • Semantic information explaining data fields

Text based information sharing in operation

How would something like a Google type text search work to share government information.  As discussed above government information management tools would need to convert data records into text.  This could be a PDF, text file, DOC file, PPT, and more formats could be supported in the future.

Once text versions of data records were available, it would need to be uploaded to a (federally hosted) special website where a search engine could scan and index it.  Indexing such a repository would be no more complex than doing the same for the web today.  Even so it will take time to scan and index the data.  Until this is done, searching the data will not be available.  However, Google and others can scan web pages in seconds and often scan websites daily so the delay may be as little as minutes to days after data upload.

Securing text based search data

Search security could be accomplished in any number of ways, e.g., with different levels of websites or directories established at each security level.   Assuming one used different websites then Google or another search engine could be directed to search any security level site at your level and below for information you requested. This may take some effort to implement but even today one can restrict a Google search to a set of websites.  It’s conceivable that some script could be developed to invoke a search request based on your security level to restrict search results.

Gaining participation

Once the upload websites/repositories are up and running, getting federal, state and local government to place data into those repositories may take some persuasion.  Federal funding can be used as one means to enforce compliance.  Bootstrapping data loading into the searchable repository can help insure initial usage and once that is established hopefully, ease of access and search effectiveness, can help insure it’s continued use.

Interim path to NIEM

One loses all contextual and most semantic information when converting a database record into text format but that can’t be helped.   What one gains by doing this is an almost immediate searchable repository of information.

For example, Google can be licensed to operate on internal sites for a fair but high fee and we’re sure Microsoft is willing to do the same for Bing/Fast.  Setting up a website to do the uploads can take an hour or so by using something like WordPress and file management plugins like FileBase but other alternatives exist.

Would this support the traffic for the entire nation’s information repository, probably not.  However, it would be an quick and easy proof of concept which could go a long way to getting information exchange started. Nonetheless, I wouldn’t underestimate the speed and efficiency of WordPress as it supports a number of highly active websites/blogs.  Over time such a WordPress website could be optimized, if necessary, to support even higher performance.

As this takes off, perhaps the need for NIEM becomes less time sensitive and will allow it to take a more reasoned approach.  Also as the web and search engines start to become more semantically aware perhaps the need for NIEM becomes less so.  Even so, there may ultimately need to be something like NIEM to facilitate increased security, real-time search, database context and semantics.

In the mean time, a more primitive textual search mechanism such as described above could be up and available for download within a day or so. True, it wouldn’t provide real time search, wouldn’t provide everything NIEM could do, but it could provide viable, actionable information exchange today.

I am probably over simplifying the complexity to provide true information sharing but such a capability could go a long way to help integrate governmental information sharing needed to support national security.

Atmos GeoProtect vs RAID

The Night Lights of Europe (as seen from space) by woodleywonderworks (cc) (from flickr)

The Night Lights of Europe (as seen from space) by woodleywonderworks (cc) (from flickr)

Yesterday, twitterland was buzzing about EMC’s latest enhancement to their Atmos Cloud Storage platform called GeoProtect.  This new capability improves cloud data protection by supporting erasure code data protection rather than just pure object replication.

Erasure coding has been used for over a decade in storage and some of the common algorithms are Reed-Solomon, Cauchy Reed-Soloman, EVENODD coding, etc.  All these algorithms provide a way for splitting up customer data into data instances and parity (encoding) to allow some number of data or parity instances to be erased (or lost) while still providing customer data.  For example, a R-S encoding scheme we used in the past (called RAID 6+) had 13 data fragments and 2 parity fragments.  Such an encoding scheme supported the simultaneous failure of any two drives and could still supply (reconstruct) customer data.

But how does RAID differ from something like GeoProtect.

  • RAID is typically within a storage array and not across storage arrays
  • RAID is typically limited to a small number of alternative configurations of data disks and parity disks which cannot be altered in the field, and
  • Currently, RAID typically doesn’t support more than two disk failures while still being able to recover customer data (see Are RAIDs days numbered?)

As I understand it GeoProtect currently supports only two different encoding schemes which can provide for different levels of data instance failures while still protecting customer data.  And with GeoProtect you are protecting data across Atmos nodes and potentially across different geographic locations not just within storage arrays.  Also, with Atmos this is all policy driven and data that comes into the system can use any object replication policy or either of the two GeoProtect policies supported today.

Although the nice thing about R-S encoding is that it doesn’t have to be fixed to two different encoding schemes.  And as it’s all software, new coding schemes could easily be released over time, possibly someday being entirely something a user could dial up or down at their whim.

But this would seem much more like what Cleversafe has been offering in their SliceStor product.  With Cleversafe the user can specify exactly how much redundancy they want to support and the system takes care of everything else.  In addition, Cleversafe has implemented a more fine grained approach (with many more fragments) and data and parity are intermingled in each stored fragment.

It’s not a big stretch for Atmos to go from two GeoProtect configurations to four or more.  Unclear to me what the right number would be but once you get past 3 or so, it might be easier to just code a generic R-S routine that can handle any configuration the customer wants but I may be oversimplifying the mathematics here.

Nonetheless, in future versions of Atmos I wouldn’t be surprised if it’s possible that through policy management the way data is protected could change over time. Specifically, while data is being frequently accessed, one could use object replication or less compressed encoding to speed up access but once access frequency diminishes (or time passes), data can then protected with more storage efficient encoding schemes which would reduce the data footprint in the cloud while still offering similar resiliency to data loss.

Full disclosure I have worked for Cleversafe in the past and although I am currently working with EMC, I have had no work from EMC’s Atmos team.

ESRP 1K to 5Kmbox performance – chart of the month

ESRP 1001 to 5000 mailboxes, database transfers/second/spindle

ESRP 1001 to 5000 mailboxes, database transfers/second/spindle

One astute reader of our performance reports pointed out that some ESRP results could be skewed by the number of drives that are used during a run.  So, we included a database transfers per spindle chart in our latest Exchange Solution Review Program (ESRP) report on 1001 to 5000 mailboxes in our latest newsletter.  The chart shown here is reproduced from that report and shows the number of overall database transfers attained (total of read and write) for the top 10 storage subsystems reporting in the latest ESRP results.

This cut of the system performance shows a number of diverse systems:

  • Some storage systems had 20 disk drives and others had 4.
  • Some of these systems were FC storage (2), some were SAS attached storage (3), but most were iSCSI storage.
  • Mailbox counts supported by these subsystems ranged from 1400 to 5000 mailboxes.

What’s not shown is the speed of the disk spindles. Also none of these systems are using SSD or NAND cards to help sustain their respective workloads.

A couple of surprises here:

  • iSCSI systems should have shown up much worse than FC storage. True, the number 1 system (NetApp FAS2040) is FC while the numbers 2&3 are iSCSI, the differences are not that great.  It would seem that protocol overhead is not a large determinant in spindle performance for ESRP workloads.
  • The number of drives used also doesn’t seem to matter much.  The FAS2040 had 12 spindles while the AX4-5i only had 4.  Although this cut of the data should minimize drive count variability, one would think that more drives would result in higher overall performance for all drives.
  • Such performance approaches drive limits of just what a 15Krpm drive can sustain.  No doubt some of this is helped by system caching, but no amount of cache can hold all the database write and read data for the duration of a Jetstress run.  It’s still pretty impressive, considering typical 15Krpm drives (e.g., Seagate 15K.6) can probably do ~172 random 64Kbyte block IOs/second. The NetApp FAS2040 hit almost 182 database transfers/second/spindle, perhaps not 64Kbyte blocks and maybe not completely random, but impressive nonetheless.

The other nice thing about this metric, is that it doesn’t correlate that well with any other ESRP metrics we track, such as aggregate database transfers, database latencies, database backup throughput etc. So it seems to measure a completely different dimension of Exchange performance.

The full ESRP report went out to our newsletter subscribers last month and a copy of the report will be up on the dispatches page of our website later this month. However, you can get this information now and subscribe to future newsletters to receive future full reports even earlier, just email us at SubscribeNews@SilvertonConsulting.com?Subject=Subscribe_to_Newsletter.

As always, we welcome any suggestions on how to improve our analysis of ESRP or any of our other storage system performance results.  This new chart was a result of one such suggestion.

SNIA Tech Center Grand Opening – Mystery Storage Contest

SNIA Technology Center Grand Opening Event - mingling before the show

SNIA Technology Center Grand Opening Event - mingling before the show

Yesterday in Colorado Springs SNIA held a grand opening for their new Technology Center.  They have moved their tech center about a half mile closer to Pikes Peak.

The new center has less data center floor space than the old one but according to Wayne Adams/EMC and Chairman of SNIA Board, this is a better fit for what SNIA is doing today.  These days SNIA doesn’t do as many plugfests requiring equipment to all be co-located on the same data center floor.  Most of SNIA’s plugfests today are done over the web, remotely, across continent wide distances.

SNIA’s new tech center is being leased from LSI which occupies the other half of the building.  If you were familiar with the old tech center it was leased from HP which resided in the other half of the old tech center.  This seems to work well for SNIA in the past, providing access to a large number of technical experts which can be called on to help out when needed.

A couple of things I didn’t realize about SNIA:

  • They have been in Colorado Springs since 2001
  • They have only 14 USA employees but over 4000 volunteers
  • They host a world wide IO trace library which member companies contribute and can access
  • All the storage equipment in their data center is provided for free by vendor/member companies.
  • SNIA certification training is one of the top 10 certifications as judged by an independent agency

Took a tour of their technology display area highlighting SNIA initiatives:

SNIA Green Storage Initiative display station

SNIA Green Storage Initiative display station

  • Green Storage Initiative display had a technician working on getting the power meter working properly.  One of only two analyzers I saw in their data centers.  The green storage initiative is all about the energy consumption of storage.
  • Solid State Storage Initiative display had a presentation on the SSSI activities and white papers which they have produced on this technology.
  • XAM initiative section had a couple of people talking about the importance of XAM to compliance activities and storage archives.
  • FCIA section had a talk on FC and its impact on storage present and futures.

Other initiatives were on display as well but I spent less time studying them.  In the conference room and 2 training rooms, SNIA had presentations on their Certification activity and storage training opportunities.  Howie Goldstein (HGAI Associates) was in one of the training rooms talking about education he provides through SNIA.

SNIA Tech Center Computer Lab 1

SNIA Tech Center Computer Lab 1

The new tech center has two computer labs.  Lab 1 seemed to have just about every vendors storage hardware.  As you can see from the photo, each storage subsystem was dedicated to SNIA initiative activities.  Didn’t see a lot of servers using this storage but they were probably located in computer lab 2.  In the picture one can see EMC, HDS, APC, and at the end, 3PAR storage.  On the other side of the aisle (not shown) was HP, NetApp, PillarData, and more HDS storage (and probably missed one or two more).

Don’t recall the SAN switch hardware but it wouldn’t surprise me to include a representative selection of all the vendors.  There was more switch hardware in Lab 2 and there we could make easily make out (McData or now) Brocade switch hardware.

SNIA Tech Center Computer Lab 2 switching hw

SNIA Tech Center Computer Lab 2 switching hw

Computer Lab 2 seemed to have most of the server hardware and more storage.  But both labs looked pretty clean from my perspective, probably due to all the press and grand opening celebration.  It ought to look more lived in/worked in over time.  I always like labs to be a bit more chaotic (see my price of quality post for a look at a busy HP EVA lab).

You would think with all SNIAs focus on plugfests there would be a lot more hardware analyzers floating around the two labs.  But outside of the power meter in the Green Storage Initiative display the only other analyzer like equipment I saw was a lone workstation behind some of the storage in lab 2.  It was way too clean to actually be in use. There ought to be post-it notes all over it, cables hanging all around it, with manuals and other documentation underneath it (but maybe I am a little old school docs should all be online nowadays).  Also, I didn’t see one white board in either of the labs, also a clear sign of early life.

SNIA Tech Center Lab 2 Lone portable workstation

SNIA Tech Center Lab 2 Lone portable workstation

We didn’t get to see much of the office space but it looked like plenty of windows and decent sized offices. Not sure how many people would be assigned to each but they have to put the volunteers and employees someplace.

Mystery Storage Contest – 1

And now for a new contest. See if you can determine the storage I am showing in the photo below.  Please submit your choice via comment(s) to this post and be sure to supply a valid email address.

Contest participant(s) will all receive a subscription to my (free) monthly Storage Intelligence email newsletter.  One winner will be chosen at random from all correct entries and will earn a free coupon code for 30% off any Silverton Consulting Briefings purchased through the web (once I figure out how to do this).

Correct entries must supply a valid email and identify the storage vendor and product model depicted in the picture.  Bonus points will be awarded for anyone who can tell the raw capacity of the subsystem in the picture.

SNIA volunteers, SNIA employees, SNIA member company employees and family members of any of these are not allowed to submit answers.  The contest will be closed 90 days after this post is published.  (And would someone from SNIA Technology Center please call me at 720-221-7270 and provide the identification and raw capacity of the storage subsystem depicted below in Computer Lab 2 on the NorthWest Wall.)

SNIA Technology Center - Mystery Storage 1

SNIA Technology Center - Mystery Storage 1

Remember our first Mystery Storage Contest closes in 90 days.

Also if you would like to submit an entry picture for future mystery storage contests please indicate so in your comment and I will be happy to contact you directly.

15PB a year created by CERN

The Large Hadron Collider/ATLAS at CERN by Image Editor (cc) (from flickr)

The Large Hadron Collider/ATLAS at CERN by Image Editor (cc) (from flickr)

That’s what CERN produces from their 6 experiments each year.  How this data is subsequently accessed and replicated around the world is an interesting tale in multi-tier data grids.

When an experiment is run at CERN, the data is captured locally in what’s called Tier 0.  After some initial processing, this data is stored on tape at Tier 0 (CERN) and then replicated to over 10 Tier 1 locations around the world which then become a second permanent repository for all CERN experiment data.  Tier 2 centers can request data from Tier 1 locations and process the data and return results to Tier 1 for permanent storage.  Tier 3 data centers can request data and processing time from Tier 2 centers to analyze the CERN data.

Each experiment has it’s own set of Tier 1 data centers that store its results.  According to the latest technical description I could find, the Tier 0 (at CERN) and most Tier 1 data centers provide a tape storage permanent repository for experimental data frontended by a disk cache.  Tier 2 can have similar resources but are not expected to be a permanent repository for data.

Each Tier 1 data center has it’s own hierarchical management system (HMS) or mass storage system (MSS) based on any number of software packages such as HPSS, CASTOR, Enstore, dCache, DPM, etc., most of which are open source products.  But regardless of the HMS/MSS implementation they all a set of generic storage management services based on the Storage Resource Manager (SRM) as defined by a consortium of research centers and provide a set of file transfer protocols defined by yet another set of standards by Globus or gLite.

Each Tier 1 data center manages their own storage element (SE).  Each experiment storage element has disk storage with optionally tape storage (using one or more of the above disk caching packages) and provides authentication/security, provides file transport,  and maintains catalogs and local databases.  These catalogs and local databases index the data sets or files available on the grid for each experiment.

Data stored in the grid are considered read-only and can never be modified.  It is intended for users that need to process this data read it from Tier 1 data centers, process the data and create new data which is then stored in the grid. New data added Data to be placed in the grid must be registered to the LCG file catalogue and transferred to a storage element to be replicated throughout the grid.

CERN data grid file access

“Files in the Grid can be referred to by different names: Grid Unique IDentifier (GUID), Logical File Name (LFN), Storage URL (SURL) and Transport URL (TURL). While the GUIDs and LFNs identify a file irrespective of its location, the SURLs and TURLs contain information about where a physical replica is located, and how it can be accessed.” (taken from the gLite user guide).

  • GUIDs look like guid:<unique_string> files are given an unique GUID when created and can never be changed.  The unique string portion of the GUID is typically a combination of MAC address and time-stamp and is unique across the grid.
  • LFNs look like lfn:<unique_string> files can have many different LFNs all pointing or linking to the same data.  LFN unique strings typically follow unix-like conventions for file links.
  • SURLs look like srm:<se_hostname>/path and provide a way to access data located at a storage element.  SURLs are transformed to TURLs.  SURLs are immutable and are unique to a storage element.
  • TURLs look like <protocol>://<se_hostname>:<port>/path and are obtained dynamically from a storage element.  TURLs can have any format after the // that uniquely identifies the file to the storage element but typically they have a se_hostname, port and file path.

GUIDs and LFNs are used to lookup a data set in the global LCG file catalogue .  After file lookup a set of site specific replicas are returned (via SURLs) which are used to request file transfer/access from a nearby storage element.  The storage element accepts the file’s SURL and assigns a TURL which can then be used to transfer the data to wherever it’s needed.  TURLs can specify any file transfer protocol supported across the grid

CERN data grid file transfer protocols supported for transfer and access currently include:

  • GSIFTP – a grid security interface enabled subset of the GRIDFTP interface as defined by Globus
  • gsidcap – a grid security interface enabled feature of dCache for ftp access.
  • rfio – remote file I/O supported by DPM. There is both a secure and an insecure version of rfio.
  • file access – local file access protocols used to access the file data locally at the storage element.

While all storage elements provide the GSIFTP protocol, the other protocols supported depend on the underlying HMS/MSS system implemented by the storage element for each experiment.  Most experiments use one type of MSS throughout their  world wide storage elements and as such, offer the same file transfer protocols throughout the world.

If all this sounds confusing, it is.  Imagine 15PB a year of data replicated to over 10 Tier 1 data centers which can then be securely processed by over 160 Tier 2 data centers around the world.  All this supports literally thousands of scientist who have access to every byte of data created by CERN experiments and scientists that post process this data.

Just exactly how this data is replicated to the Tier 1 data centers and how a scientists processes such data must be the subject for other posts.