Technology selection and trusted information sources

iblioteca José Vasconcelos / Vasconcelos Library by * CliNKer * (from flickr) (cc)
iblioteca José Vasconcelos / Vasconcelos Library by * CliNKer * (from flickr) (cc)

A rather comprehensive selection of papers on Information Overload was compiled for the recent IEEE Engineering Management Review (EMR, vol 38, #1, March 2010).  Among the many excellent papers was one that seemed somewhat important for many of my readers: Managing Technology Information Overload; Which Sources of Knowledge are Best? by C.J. Rhoads on the faculty at Kutztown University and with ETM associates.

Rhoads surveyed top decision makers in businesses listed in the Chamber of Commerce and in newspaper directories regarding information technology (IT) use and selection decisions.  Industries surveyed included Education, Healthcare, Manufacturing, Media & Publishing, Non-Profit, Retail, and mostly Services.  584 responses were received. (More information on the research can be found in the article.)

There were many questions that were asked but the two most significant items of interest to me were:

  • Who in an organization was involved in technical decisions?
  • What source did those people most trust to help them decide?

Who decides?

It turns out that ” … the person in the technology experienced role was involved in the decision only 19% of the time.”  According to Rhoads research the CEO was most involved at  51% of the time.  Now as Rhoads explains, this could be due to the research being done across a statistically representative sample of businesses where a high percentage of businesses were “… on the smaller side.” I suppose most sales organizations would agree wholeheartedly with this result.  Nonetheless, clearly such minimization of technical insight makes the information these people use to make technical decisions even more important.

What do they trust?

Rhoads selected five information sources to discover which was most used and most trusted by IT decision makers.  The information sources chosen included “Top” consulting firms (such as Gartner, Giga, Meta, Forrest and others), Friends & Family, publication and web resources, vendors, and local consulting firms.  Rhoad’s survey results revealed, by a statistically significant margin, that people making IT decisions trusted local consulting companies more often than any of the other sources.  Once again the size of the companies surveyed may be biasing the non-use of top consultancies due to their relatively high expense. Nevertheless, even local consultants aren’t as inexpensive as some of the other sources of information. (Almost makes me glad that I represent a small, LOCAL consulting company).

In addition to the above resulst, Rhoad’s study classified IT use effectiveness of the organizations surveyed.  As a result Rhoads was also able to determine which “savvy”, “blossoming”, “base” or “unversed”  users of IT were influenced by which information source.  The survey found that savvy users were most influenced by local consultancies and that both savvy and blossoming IT users were secondly most influenced by publication and web resources.  (Makes me also glad to be a blogger.)

A lot more interesting stuff in the article and I found at least two other papers in the EMR compendium to be worth reading.

SPC-1 Results IOPs vs. Capacity – chart of the month

SPC-1* IOPS vs. Capacity, (c) 2010 Silverton Consuliting, All Rights Reserved
SPC-1* IOPS vs. Capacity, (c) 2010 Silverton Consuliting, All Rights Reserved

This chart is from SCI’s last months report on recent Storage Performance Council (SPC) benchmark results. There were a couple of new entries this quarter but we decided to introduce this new chart as well.

This is a bubble scatter plot of SPC-1(TM) (online transaction workloads) results. Only storage subsystems that cost less than $100/GB, trying to introduce some fairness.

  • Bubble size is a function of the total cost of the subsystem
  • Horizontal access is subsystem capacity in GB
  • Vertical access is peak SPC-1 IOPS(TM)

Also we decided to show a linear regression line and equation to better analyze the data. As shown in the chart there is a pretty good correlation between capacity and IOPS (R**2 of ~0.8). The equation parameters can be read from the chart but it seems pretty tight from a visual perspective.

The one significant outlier here at ~250K IOPS is TMS RAMSAN which uses SSD technology. The two large bubbles at the top right were two IBM SVC 5.1 runs at similar backend capacity. The top SVC run had 6 nodes and the bottom SVC run only had 4.

As always, a number of caveats to this:

  • Not all subsystems on the market today are benchmarked with SPC-1
  • The pricing cap eliminated high priced storage from this analysis
  • IOPS may or may not be similar to your workloads.

Nevertheless, most storage professionals come to realize that having more disks can often result in better performance. This is often confounded by RAID type used, disk drive performance, and cache size. However, the nice thing about SPC-1 runs, is that most (nearly all) use RAID 1, have the largest cache size that makes sense, and the best performing disk drives (or SSDs). The conclusion cannot be more certain – the more RAID 1 capacity one has the higher the number of IOPS one can attain from a given subsystem.

The full SPC report went out to our newsletter subscribers last month and a copy of the report will be up on the dispatches page of our website later this month. However, you can get this information now and subscribe to future newsletters to receive future full reports even earlier, just email us at SubscribeNews@SilvertonConsulting.com?Subject=Subscribe_to_Newsletter.

As always, we welcome any suggestions on how to improve our analysis of SPC or any of our other storage system performance results. This new chart was a result of one such suggestion.

Cloud Storage Gateways Surface

Who says there are no clouds today by akakumo (cc) (from Flickr)
Who says there are no clouds today by akakumo (cc) (from Flickr)

One problem holding back general purpose cloud storage has been the lack of a “standard” way to get data in and out of the cloud.  Most cloud storage providers supply a REST interface, an object file interface or other proprietary ways to use their facilities.  The problem with this is that they all require some form of code development on the part of the cloud storage customer in order to make use of these interfaces.

It would be much easier if cloud storage could just talk iSCSI, FCoE, FC, NFS, CIFS, FTP,  etc. access protocols.  Then any data center could use the cloud with a NIC/HBA/CNA and just configure the cloud storage as a bunch of LUNs or file systems/mount points/shares.  Probably FCoE or FC might be difficult to use due to timeouts or other QoS (quality of service) issues but iSCSI and the file level protocols should be able to support cloud storage access without such concerns.

So which cloud storage support these protocols today?  Nirvanix supports CloudNAS used to access their facilities via NFS, CIFS and FTP,  ParaScale supports NFS and FTP, while Amazon S3 and Rackspace CloudFiles do not seem to support any of these interfaces.  There are probably other general purpose cloud storage providers I am missing here but these will suffice for now.   Wouldn’t it be better if some independent vendor supplied one way to talk to all of these storage environments.

How can gateways help?

For one example, Nasuni recently emerged from stealth mode, releasing a beta version of a cloud storage gateway that supports file access to a number of providers. Currently, Nasuni supports CIFS file protocol as a front end for Amazon S3, IronMountain ASP, Nirvanix, and coming soon Rackspace CloudFile.

However, Nasuni is more than just a file protocol converter for cloud storage.  It also supplies a data cache, file snapshot services, data compression/encryption, and other cloud storage management tools. Specifically,

  • Cloud data cache – their gateway maintains a disk cache of frequently accessed data that can be accessed directly without having to go out to the cloud storage.  File data is chunked by the gateway and flushed out of cache to the backend provider. How such a disk cache is maintained coherently across multiple gateway nodes was not discussed.
  • File snapshot services – their gateway supports a point-in-time copy of file date used for backup and other purposes.  The snapshot is created on a time schedule and provides an incremental backup of cloud file data.  Presumably these snapshot chunks are also stored in the cloud.
  • Data compression/encryption services – their gateway compresses file chunks and then encrypts it before sending them to the cloud.  Encryption keys can optionally be maintained by the customer or be automatically maintained by the gateway
  • Cloud storage management services – the gateway configures the cloud storage services needed to define volumes, monitors cloud and network performance and provides a single bill for all cloud storage used by the customer.

By chunking the files and caching them, data read from the cloud should be accessible much faster than normal cloud file access.  Also by providing a form of snapshot, cloud data should be easier to backup and subsequently restore. Although Nasuni’s website didn’t provide much information on the snapshot service, such capabilities have been around for a long time and found very useful in other storage systems.

Nasuni is provided as a software only solution. Once installed and activated on your server hardware, it’s billed for as a service and ultimately is charged for on top of any cloud storage you use.  You sign up for supported cloud storage providers through Nasuni’s service portal.

How well all this works is open for discussion.  We have discussed caching appliances before both from EMC and others.  Two issues have emerged from our discussions, how well caching coherence is maintained across nodes is non-trivial and the economics of a caching appliance are subject to some debate.  However, cloud gateways are more than just caching appliances and as a way of advancing cloud storage adoption, such gateways can only help.

Full disclosure: I currently do no business with Nasuni.

Caching DaaD for federated data centers

Internet Splat Map by jurvetson (cc) (from flickr)
Internet Splat Map by jurvetson (cc) (from flickr)

Today, I attended a webinar where Pat Gelsinger, President of Information Infrastructure at EMC discussed their concept for a new product based on the Yotta Yotta technology they acquired a few years back.  Yotta Yotta’s product was a distributed, coherent caching appliance that had FC front end ports, an Infiniband appliance internal network and both FC and WAN backend links.

What one did with Yotta Yotta nodes was place them in front of your block storage, connect them together via infiniband locally and via a WAN technology (of your choice, then) and then you could access any data behind the appliances from any attached location.  They also provided very quick transferring of bulk data between remote nodes. So, their technology allowed for very rapid data transmission over standard WAN interfaces/distances and provided a distributed cache across those very same distances to the data behind the appliances.

I like caching appliances as much as anyone but they had become prominent only in the late 70’s and early 80’s mostly because caching technology was hard to do with the storage subsystems of the day, but they went away a long time ago.  Nowadays, you can barely purchase a lone disk drive without a cache in them.  So what’s different.

Introducing DaaD

Today we have SSDs and much cheaper processing power.  I wrote about new caching appliances like DataRam‘s XcelaSAN  in a Cache appliances rise from the dead post I did after last years SNW.  But EMC’s going after a slightly broader domain – the world.  The caching appliance that EMC is discussing is really intended to support distributed data access, or as I like to call it,  Data-at-a-Distance (DaaD).

How can this work?  Data is stored on subsystems at various locations around the world.  A DaaD appliance is inserted in front of each of these and connected over the WAN. Some or all of that data is now re-configured (at block or more likely LUN level) to be accessible at distance from each DaaD data center.  As each data center reads and writes data from/to their remote brethern, some portion of that data is cached locally in the DaaD appliance and the rest is only available by going to the remote site (with considerably higher latency).

This works moderately well for well behaved, read intensive workloads where 80% of the IO is to 20% of the data (most of which is cached locally).  But block writes present a particularly nasty problem as any data write has to be propagated to all cache copies before acknowledged.

It’s possible write propagation could be done via invalidating the data in cache (so any subsequent read would need to re-access the data from the original host).  Nevertheless, to even know which DaaD nodes have a cached copy of a particular block, one needs to maintain a dictionary of all globally identifiable blocks held in any DaaD cache node at every moment in time.  Any such table would change often, will necessarily need to be updated very carefully, deadlock free and atomically with non-failable transactions – therein lies one of the technological hurdles.  Doing this quickly without impacting performance is another hurdle.

So simple enough, EMC takes Yotta Yotta’s technology, updates it for todays processors, networking, and storage, and releases it as a data center federation enabler. So, what can one do with a federated data center, well that’s another question and it involves Vmotion, and must be a subject for a future post …

R&D effectiveness

A recent Gizmodo blog post compared a decade of R&D at Sony, Microsoft and Apple.  There were some interesting charts but mostly it showed that R&D as a percent of revenue, fluctuates from year to year and R&D spend has been rising for all the companies (although at different rates).

R&D Effectiveness, (C) 2010 Silverton Consulting, All Rights Reserved
R&D Effectiveness, (C) 2010 Silverton Consulting, All Rights Reserved

Overall from a percentage of Revenue basis, Microsoft wins, spending ~15% of revenue on R&D over the past decade, Apple loses, spending only ~4% on R&D and Sony is right in the middle at spending ~7% on R&D.  Yet viewing the impact on corporate revenue R&D spending had significantly different impacts on each company than what pure % R&D spending would indicate.

How can one measure R&D effectiveness.

  • Number of patents – this is often used as an indicator, but unclear how this correlates to business success.  Patents can be licensed but only if they prove important to other companies. However, patent counts can be gauged early on during the R&D activities rather than much later when a product reaches the market.
  • Number of projects – by projects we mean an idea from research taken into development.  Such projectst may or may not make it out to market.  At one level this can be a leading indicator of “research” effectiveness, as this means an idea was deemed at least of commercial interest.  A problem with this is that not all projects get released to the market or become commercially viable.
  • Number of products – by products, we mean something sold to customers.  At least such a measure reflects that the total R&D effort was deemed worthy enough to take to market.  How successful such a product is still to be determined.
  • Revenue of products – product revenue seems easy enough but often can be hard to allocate properly.  Looking at the iPhone, do we count just handset revenues or include application and cell service revenues. But assuming one can properly allocate revenue sources to R&D efforts, one can come up with a revenue from R&D spending.  The main problem with revenue generated from R&D ratios are all the other non-R&D factors confound it, e.g., marketing, manufacturing, competition, etc.
  • Profitability of products – product profitability is even messier than revenue when it comes to confoundability.  But ultimately profitability of R&D efforts may be the best factor to use as any product that’s truly effective should generate the most profits.

There are probably other R&D effectiveness factors that could be considered but these will suffice for now.

How did they do?

Returning to the Gizmodo discussion, their post didn’t include any patent counts, project counts (only visibly internally), product counts, or profitability measures but they did show revenue for each company.  From a purely Revenue impact one would have to say that Apple’s R&D was a clear winner with Microsoft a clear second.  Although we would have to say that Apple started from considerable smaller revenue than Sony or Microsoft but Apple’s ~$148B of revenue in 2005 was only small in comparison to other giants.  We all know the success of the iPhone and iPod but they also stumbled on the Apple TV.

Why did they do so well?

What then makes Apple do so good?  We have talked before about an elusive quality we called visionary leadership.  Certainly Bill Gates is as technically astute as Steve Jobs and there can be no denying that their respective marketing machines are evenly matched.  But both Microsoft and Apple were certainly led by more technical individuals than Sony over the last decade.   Both Microsoft and Apple have had significant revenue increases over the past ten years, that parallel one another while Sony, in comparison, has remained relatively flat.

I would say both Microsoft and Apple results show that “visionary leadership” has a certain portion of technicality to it that can’t be denied.  Moreover, I think that if one looked at Sony under Akio Morita, HP under Bill Hewlett and Dave Packard or many other large companies today, one could conclude that technical excellence is a significant component of visionary leadership.  All these companies highest revenue growth came under leadership which had significant technical knowledge.  There’s more to visionary leadership then technicality alone but it seems at least foundational.

I still owe a post on just what constitute’s visionary leadership, but I seem to be surrounding it rather than attacking it directly.

WD’s new SiliconEdge Blue SSD data write spec

Western Digital's Silicon Edge Blue SSD SATA drive (from their website)
Western Digital's SiliconEdge Blue SSD SATA drive (from their website)

Western Digital (WD) announced their first SSD drive for the desktop/laptop market space today.  Their drive offers the typical256, 128, and 64GB capacity points over a SATA interface.  Performance looks ok at 5K random read or write IO/s with sustained transfers at 250 and 140MB/s for read and write respectively.  But what caught my eye was a new specification I hadn’t seen before indicating Maximum GB written per day of 17.5, 35 and 70GB/d for their drives using WD’s Operational Lifespan – LifeEST(tm) definition.

I couldn’t find anywhere that said which NAND technology was used in the device but it likely uses MLC NAND.  In a prior posting we discussed a Toshiba study that said a “typical” laptop user writes about 2.4GB/d and a “heavy” laptop user writes about 9.2GB/d.  This data would indicate that WD’s new 64GB drive can handle almost 2X the defined “heavy” user workload for laptops and their other drives would handle it just fine.  A data write rate for desktop work, as far as I can tell, has not been published, but presumably it would be greater than laptop users.

From my perspective more information on the drives underlying NAND technology, on what a LifeEST specification actually means, and a specification as to how much NAND storage was actually present would be nice, but these are all personal nits.  All that aside, I applaud WD for standing up and saying what data write rate their drives can support.  This needs to be a standard part of any SSD specification sheet and I look forward to seeing more information like this coming from other vendors as well.

Describing Dedupe

Hard Disk 4 by Alpha six (cc) (from flickr)
Hard Disk 4 by Alpha six (cc) (from flickr)

Deduplication is a mechanism to reduce the amount of data stored on disk for backup, archive or even primary storage.  For any storage, data is often duplicated and any system that eliminates storing duplicate data will be more utilize storage more efficiently.

Essentially, deduplication systems identify duplicate data and only store one copy of such data.  It uses pointers to incorporate the duplicate data at the right point in the data stream. Such services can be provided at the source, at the target, or even at the storage subsystem/NAS system level.

The easiest way to understand deduplication is to view a data stream as a book and as such, it can consist of two parts a table of contents and actual chapters of text (or data).  The stream’s table of contents provides chapter titles but more importantly (to us), identifies a page number for the chapter.  A deduplicated data stream looks like a book where chapters can be duplicated within the same book or even across books, and the table of contents can point to any book’s chapter when duplicated. A deduplication service inputs the data stream, searches for duplicate chapters and deletes them, and updates the table of contents accordingly.

There’s more to this of course.  For example, chapters or duplicate data segments must be tagged with how often they are duplicated  so that such data is not lost when modified.  Also, one way to determine if data is duplicated is to take one or more hashes and compare this to other data hashes, but to work quickly, data hashes must be kept in a searchable index.

Types of deduplication

  • Source deduplication involves a repository, a client application, and an operation which copies client data to the repository.  Client software chunks the data, hashes the data chunks, and sends these hashes over to the repository.  On the receiving end, the repository determines which hashes are duplicates and then tells the client to send only the unique data.  The repository stores the unique data chunks and the data stream’s table of contents.
  • Target deduplication involves performing deduplication inline, in-parallel, or post-processing by chunking the data stream as it’s recieved, hashing the chunks, determining which chunks are unique, and storing only the unique data.  Inline refers to doing such processing while receiving data at the target system, before the data is stored on disk.  In-parallel refers to doing a portion of this processing while receiving data, i.e., portions of the data stream will be deduplicated while other portions are being received.  Post-processing refers to data that is completely staged to disk before being deduplicated later.
  • Storage subsystem/NAS system deduplication looks a lot like post-processing, target deduplication.  For NAS systems, deduplicaiot looks at a file of data after it is closed. For general storage subsystems the process looks at blocks of data after they are written.  Whether either system detects duplicate data below these levels is implementation dependent.

Deduplication overhead

Deduplication processes generate most overhead while deduplicating the data stream, essentially during or after the data is written, which is the reason that target deduplication has so many options, some optimize ingestion while others optimize storage use. There is very little additonal overhead for re-constituting (or un-deduplicating) the data for read back as retrieving the unique and/or duplicated data segments can be done quickly.  There may be some minor performance loss because of lack of  sequentiality but that only impacts data throughput and not that much.

Where dedupe makes sense

Deduplication was first implemented for backup data streams.  Because any backup that takes full backups on a monthly or even weekly basis will duplicate lot’s of data.  For example, if one takes a full backup of 100TBs every week and lets say new unique data created each week is ~15%, then at week 0, 100TB of data is stored both for the deduplicated and undeduplicated data versions; at week 1 it takes 115TB to store the deduplicated data but 200TB for the non-deduplicated data; at week 2 it takes ~132TB to store deduplicated data but 300TB for the non-deduplicated data, etc.  As each full backup completes it takes another 100TB of un-deduplicated storage but significantly less deduplicated storage.  After 8 full backups the un-deduplicated storage would require 8ooTB but only ~265TB for deduplicated storage.

Deduplication can also work for secondary or even primary storage.  Most IT shops with 1000’s of users, duplicate lot’s of data.  For example, interim files are sent from one employee to another for review, reports are sent out en-mass to teams, emails are blasted to all employees, etc.  Consequently, any storage (sub)system that can deduplicate data would more efficiently utilize backend storage.

Full disclosure, I have worked for many deduplication vendors in the past.