Better storage through hardware

Apple's Xserve (from Apple.com)
Apple's Xserve (from Apple.com)

Chuck Hollis from EMC wrote a post last week on Storage is software about how hardware parts are becoming commoditized and so highly functional that future storage differentiation will only come from software.  I commented that hardware differentiation is also becoming much easier with FPGAs and their ilk.  Chuck replied that yes this may be so but will anyone pay the cost for such differentiation.

My reply deserves a longer discussion.  Chuck’s mentioned Apple as one company differentiating successfully in hardware but thought that this would not apply to storage

Better storage through hardware differentiation

I am a big fan of Apple and so, it’s hard for me to see why something similar could not apply to storage.  IMHO, what Apple has done better than the rest is to reconstruct the user experience, in totality, from one of frustration to one of delight.

Can such a thing be done for storage and if so “will it sell”? I believe Yes to both questions.

Will such a new storage product necessarily require hardware/FPGA development as much as software/firmware development?  Again, yes.

Will anyone create this “better” storage? No easy answers here.

Developing better storage

Such a task involves a complete remaking, from the ground up of a new storage product from the user/admin experience perspective.  But the hard part is that the O/Ss and virtualization systems govern/control much of the storage user/admin experience, not the storage.  As such, much functionality will necessarily be done in software, not hardware.

However, that doesn’t mean that hardware differentiation can’t help. For example, consider storage interfaces.  Today, it’s not unusual to have 6 or more interfaces for a storage system.  But for me it’s hard to see how this couldn’t be better served with 2-10GbE and 2-8GFC and WiFi for an alternate admin interface.  In a similar fashion, look at internal storage interfaces.  It’s hard for me to see any absolute requirement for cabling here.  Ditto for power cabling. And all this just improves the out-of-the-box experience.

Could something similar be done for the normal storage configuration, monitoring, and protection activities? Most certainly.  Even so, much of this differentiation would be done via software/firmware and O/S APIs being used.  However, perhaps some small portion can make use of hardware/packaging differentiation.

I like to think that “I will know it when I see it”.  But, when someone can take storage out of a box, “install, use and protect it” on any O/S, virtualization environment with nothing more than a poster with 5 to 7 blocks as an guide, such “Apple-like” storage will have arrived.

Until then, storage admins will need training, a “storage admin” will be part of a job description, and storage will be something “we do” rather than something “we use”.

So my final answer to Chuck is:  will anyone do it – don’t know.

What do you think?

PC-as-a-Service (PCaaS) using VDI

IBM PC Computer by Mess of Pottage (cc) (from Flickr)
IBM PC Computer by Mess of Pottage (cc) (from Flickr)

Last year at VMworld, VMware was saying that 2010 was year for VDI (virtual desktop infrastructure), last week NetApp said that most large NY banks they talked with were looking at implementing VDI and prior to that, HP StorageWorks announced a new VDI reference platform that could support ~1600 VDI images.  It seems that VDI is gaining some serious interest.

While VDI works well for large organizations, there doesn’t seem to be any similar solution for consumers. The typical consumer today usually runs downlevel OS’s, anti-virus, office applications, etc.  and have no time, nor inclination to update such software.  These consumers would be considerably better served with something like PCaaS, if such a thing existed.

PCaaS

Essentially PCaaS would be a VDI-like service offering, using standard VDI tools or something similar with a lightweight kernel, use of local attached resources (printers, usb sticks, scanners, etc.) but running applications that were hosted elsewhere.  PCaaS could provide all the latest O/S and applications and provide enterprise class reliability, support and backup/restore services.

Broadband

One potential problem with PCaaS is the need for reliable broadband to the home. Just like other cloud services, without broadband, none of this will work.

Possibly this could be circumvented if a PCaaS viewer browser application were available (like VMware’s Viewer). With this in place, PCaaS could be supplied over any internet enabled location supporting browser access.   Such a browser based service may not support the same rich menu of local resources as a normal PCaaS client, but it would probably suffice when needed. The other nice thing about a viewer is that smart phones, iPads and other always-on web-enabled devices supporting standard browsers could provide PCaaS services from anywhere mobile data or WI-FI were available.

PCaaS business model

As for a businesses that could bring PC-as-a-Service to life, I see many potential providers:

  • Any current PC hardware vendor/supplier may want to supply PCaaS as it may defer/reduce hardware purchases or rather move such activity from the consumer to companies.
  • Many SMB hosting providers could easily offer such a service.
  • Many local IT support services could deliver better and potentially less expensive services to their customers by offering PCaaS.
  • Any web hosting company would have the networking, server infrastructure and technical know-how to easily provide PCaaS.

This list ignores any new entrants that would see this as a significant opportunity.

Google, Microsoft and others seem to be taking small steps to do this in a piecemeal fashion, with cloud enabled office/email applications. However, in my view what the consumer really wants is a complete PC, not just some select group of office applications.

As described above, PCaaS would bring enterprise level IT desktop services to the consumer marketplace. Any substantive business in PCaaS would free up untold numbers of technically astute individuals providing un-paid, on-call support to millions, perhaps billions of technically challenged consumers.

Now if someone would just come out with Mac-as-a-Service, I could retire from supporting my family’s Apple desktops & laptops…

Building a green data center

Diversity in the Ecological Soup by jurvetson (cc) (from Flickr)
Diversity in the Ecological Soup by jurvetson (cc) (from Flickr)

At NetApp’s Analyst Days last week David Robbins, CTO Information Technology, reported on a new highly efficient Global Dynamic Lab (GDL) data center which they built in Raleigh, North Carolina.  NetApp predicts this new data center  will have a power use effectiveness (PUE) ratio of 1.2.  Most data centers today do well if they can attain a PUE of 2.0.

Recall that PUE is the ratio of all power required by the data center (includes such things as IT power, chillers, fans, UPS, transformers, humidifiers, lights, etc.) over just IT power (for racks, storage, servers, and networking gear).  A PUE of 2 says that there is as much power used by IT equipment as is used to power and cool the rest of the data center.  An EPA report on Server and Data Center Efficiency said that data centers could reach a PUE of 1.4 if they used state of the art techniques outlined in the report.  A PUE of 1.2 is a dramatic improvement in data center power efficiency and reduces non-IT power in half.

There were many innovations used by NetApp to reach the power effectiveness at GDL. The most important ones were:

  • Cooling at higher temperatures which allowed for the use of ambient air
  • Cold-room, warm aisle layout which allowed finer control over cooling delivery to the racks
  • Top-down cooling which used physics to reduce fan load.

GDL was designed to accommodate higher rack power densities coming from today’s technology. GDL supports an average of 12kW per rack and can handle a peak load of 42kW per rack.  In addition, GDL uses 52U tall racks which helps reduce data center foot print.  Such high powered/high density racks requires rethinking data center cooling.

Cooling at higher temperatures

Probably the most significant factor that improved PUE was planning for the use much warmer air temperatures.  By using warmer air 70-80F/21.1-26.7C, much of the cooling could now be based on ambient air rather than chilled air.  NetApp estimates that they can use ambient air 75% of the year in Raleigh, a fairly warm and humid location.  As such, GDL chiller use is reduced significantly which generates significant energy savings from the number 2 power consumer in most data centers.

Also, NetApp is able to use ambient air for partial cooling for the much of the rest of the year when used in conjunction with chillers.  Air handlers were purchased that could use outside air, chillers or a combination of the two.  GDL chillers also operate more efficiently at the higher temperatures, reducing power requirements yet again.

Given the temperature rise of typical IT equipment cooling of ~20-25F/7.6-9.4C one potential problem is that the warm aisles can exceed 100F/37.8C which is about the upper limit for human comfort. Fortunately, by detecting lighting use in the hot aisles, GDL can increase cold room equipment cooling to bring temperatures in adjacent hot aisles down to a more comfortable level when humans are present.

One other significant advantage to using warmer temperatures is that warmer air is easier to move than colder air.  This provides savings by allowing lowered powered fans to cool the data center.

Cold rooms-warm aisles

GDL built cold rooms at the front side of racks and a relatively open warm aisle on the other side of the racks.  Such a design provides uniform cooling from the top to the bottom of a rack.  With a more open air design, hot air often accumulates and is trapped at the top of the rack which requires more cooling to compensate.  By sealing the cold room, GDL insures a more equilateral cooling of the rack and thus, more efficient use of cooling.

Another advantage provided by cold-rooms, warm aisles is that cooling activity can be regulated by pressure differentials between the two aisles rather than flow control or spot temperature sensors.  Such regulation effectiveness, allows GDL to reduce air supply to match rack requirements.  As such, GDL reduces excess cooling that is required by more open designs using flow or temperature sensors.

Top down cooling

I run into this every day at my office, cool air is dense and flows downward, hot air is light and flows upward.  NetApp designed GDL to have air handlers on top of the computer room rather than elsewhere.  This eliminates much of the ductwork which often reduces air flow efficiency requiring increased fan power to compensate.  Also by piping the cooling in from above, physics helps get that cold air to the racked equipment that needs it.  As for the hot aisles, warm air will naturally rise to the air return above the aisles and can then be vented to the outside, mixed with outside ambient air or chilled before it’s returned to the cold room.

For normal data centers cooled from below, fan power must be increased to move the cool air up to the top of the rack.  GDL’s top down cooling reduces the fan power requirements substantially from below the floor cooling.

—-

There were other approaches which helped GDL reduce power use such as using hot air for office heating but these seemed to be the main ones.  Much of this was presented at NetApp’s Analyst Days last week.  Robbins has written a white paper which goes into much more detail on GDL’s PUE savings and other benefits that accrued to NetApp when the built this data center.

One nice surprise was the capital cost savings generated by using GDL’s power efficient data center design.  This was also detailed in the white paper.  But at the time this post was published the paper was not available.

Now that summer’s here in the north, I think I want a cold room-warm aisle for my office…

One iPad per Child (OipC)

OLPC XO Beta1 (from wikipedia.org)
OLPC XO Beta1 (from wikipedia.org)

Starting thinking today that the iPad with some modifications  could be used to provide universal computing and information services to the world’s poor as a One iPad per Child (OipC).  Such a solution could easily replace the One Laptop Per Child (OLPC) that exists today with a more commercially viable product.

From my perspective only a few additions would make the current iPad ideal for universal OipC use.  Specifically, I would suggest we add

  • Solar battery charger – perhaps the back could be replaced with a solar panel to charge the battery.  Or maybe the front could be reconfigured to incorporate a solar charger underneath or within its touch panel screen.
  • Mesh WiFi – rather than being a standard WiFi target, it would be more useful for the OipC to support a mesh based WiFi system.  Such a mesh WiFi could route internet request packets/data from one OipC to another, until a base station were encountered providing a broadband portal for the mesh.
  • Open source free applications – it would be nice if more open office applications were ported to the new OipC so that free office tools could be used to create content.
  • External storage  – software support for NFS or CIFS over WiFi would allow for a more sophisticated computing environment and together with the mesh WiFi would allow a central storage repository for all activities.
  • Camera – for photos and video interaction/feedback.

    iPad (from wikipedia.org)
    iPad (from wikipedia.org)

Probably other changes needed but these will suffice for discussion purposes. With such a device and reasonable access to broadband, the world’s poor could easily have most of the information and computing capabilities of the richest nations.  They would have access to the Internet and as such could participate in remote k-12 education as well as obtain free courseware from university internet sites.  They would have access to online news, internet calling/instant messaging and free email services which could connect them to the rest of the world.

I believe most of the OipC hardware changes could be viable additions to the current iPad with the possible exception of the mesh WiFi.  But there might be a way to make a mesh WiFi that is software configurable with only modest hardware changes (using software radio transcievers).

Using the current iPad

Of course, the present iPad without change could be used to support all this, if one were to add some external hardware/software:

  • An external solar panel charging system – multiple solar charging stations for car batteries exist today which are used in remote areas.  If one were to wire up a cigarette lighter and purchase a car charger for the iPad this would suffice as a charging station. Perhaps such a system could be centralized in remote areas and people could pay a small fee to charge their iPads.
  • A remote WiFi hot spot – many ways to supply WiFi hot spots for rural areas.  I heard at one time Ireland was providing broadband to rural areas by using local pubs as hot spots.  Perhaps a local market could be wired/radio-connected to support village WiFi.
  • A camera – buy a cheap digital camera and the iPad camera connection kit.  This lacks real time video streaming but it could provide just about everything else.
  • Apps and storag – software apps could be produced by anyone.  Converting open office to work on an iPad doesn’t appear that daunting except for the desire to do it.  Providing external iPad storage can be provided today via cloud storage applications.  Supplying pure NFS or CIFS support as native iPad facilities that other apps could use would be more difficult but could be easily provided if there were a market.

The nice thing about the iPad is that it’s a monolithic, complete unit. Other than power there are minimal buttons/moving parts or external components present.  Such simplified componentry should make it more easily usable in all sorts of environments.  Not sure how rugged the current iPad is and how well it would work out in rural areas without shelter, but this could easily be gauged and changes made to improve it’s surviveability.

OipC costs

Having the mesh, solar charger, and camera all onboard the OipC would make this all easier to deploy but certainly not cheaper.  The current 16GB iPad parts and labor come in around US$260 (from livescience).  The additional parts to support the onboard camera, WiFi mesh and solar charger would drive costs up but perhaps not significantly.  For example, adding the iPhone 3m pixel camera to the iPad might cost about US$10 and a 3gS transciever (WiFi mesh substitute) would cost an additional US$3 (both from theappleblog).

As for the solar panel battery charger, I have no idea, but a 10W standalone solar panel can be had from Amazon for $80.  Granted it doesn’t include all the parts needed to convert power to something that the iPad can use and it’s big, 10″ by 17″.  This is not optimal and would need to be cut in half (both physically and costwise) to better fit the OicP back or front panel.

Such a device might be a worthy successor to OLPC at the cost of roughly double that devices price of US$150 per laptop.  Packaging all these capabilities in the OicP might bring some economies of scale that could potentially bring its price down some more.

Can the OipC replace the OLPC?

One obvious advantage that the OipC would have over the OLPC is that it was based on a commercial device.  If one were to use the iPad as it exists today with the external hardware discussed above it would be a purely commercial device.  As such, future applications should be more forthcoming, hardware advances should be automatically incorporated in the latest products, and a commercial market would exist to supply and support the products.  All this should result in better, more current software and hardware technology being deployed to 3rd world users.

Some disadvantages for the OipC vs. the OLPC include lack of a physical keyboard, open source operating system and access to all open source software, and usb ports.  Of course all the software and courseware specifically designed for the OLPC would also not work on the OipC.  The open sourced O/S and the USB are probably the most serious omissions. iPad has a number of external keyboard options which can be purchased if needed.

Now as to how to supply broadband to rural hot spots around the 3rd world, we must leave this for a future post…

Describing Dedupe

Hard Disk 4 by Alpha six (cc) (from flickr)
Hard Disk 4 by Alpha six (cc) (from flickr)

Deduplication is a mechanism to reduce the amount of data stored on disk for backup, archive or even primary storage.  For any storage, data is often duplicated and any system that eliminates storing duplicate data will be more utilize storage more efficiently.

Essentially, deduplication systems identify duplicate data and only store one copy of such data.  It uses pointers to incorporate the duplicate data at the right point in the data stream. Such services can be provided at the source, at the target, or even at the storage subsystem/NAS system level.

The easiest way to understand deduplication is to view a data stream as a book and as such, it can consist of two parts a table of contents and actual chapters of text (or data).  The stream’s table of contents provides chapter titles but more importantly (to us), identifies a page number for the chapter.  A deduplicated data stream looks like a book where chapters can be duplicated within the same book or even across books, and the table of contents can point to any book’s chapter when duplicated. A deduplication service inputs the data stream, searches for duplicate chapters and deletes them, and updates the table of contents accordingly.

There’s more to this of course.  For example, chapters or duplicate data segments must be tagged with how often they are duplicated  so that such data is not lost when modified.  Also, one way to determine if data is duplicated is to take one or more hashes and compare this to other data hashes, but to work quickly, data hashes must be kept in a searchable index.

Types of deduplication

  • Source deduplication involves a repository, a client application, and an operation which copies client data to the repository.  Client software chunks the data, hashes the data chunks, and sends these hashes over to the repository.  On the receiving end, the repository determines which hashes are duplicates and then tells the client to send only the unique data.  The repository stores the unique data chunks and the data stream’s table of contents.
  • Target deduplication involves performing deduplication inline, in-parallel, or post-processing by chunking the data stream as it’s recieved, hashing the chunks, determining which chunks are unique, and storing only the unique data.  Inline refers to doing such processing while receiving data at the target system, before the data is stored on disk.  In-parallel refers to doing a portion of this processing while receiving data, i.e., portions of the data stream will be deduplicated while other portions are being received.  Post-processing refers to data that is completely staged to disk before being deduplicated later.
  • Storage subsystem/NAS system deduplication looks a lot like post-processing, target deduplication.  For NAS systems, deduplicaiot looks at a file of data after it is closed. For general storage subsystems the process looks at blocks of data after they are written.  Whether either system detects duplicate data below these levels is implementation dependent.

Deduplication overhead

Deduplication processes generate most overhead while deduplicating the data stream, essentially during or after the data is written, which is the reason that target deduplication has so many options, some optimize ingestion while others optimize storage use. There is very little additonal overhead for re-constituting (or un-deduplicating) the data for read back as retrieving the unique and/or duplicated data segments can be done quickly.  There may be some minor performance loss because of lack of  sequentiality but that only impacts data throughput and not that much.

Where dedupe makes sense

Deduplication was first implemented for backup data streams.  Because any backup that takes full backups on a monthly or even weekly basis will duplicate lot’s of data.  For example, if one takes a full backup of 100TBs every week and lets say new unique data created each week is ~15%, then at week 0, 100TB of data is stored both for the deduplicated and undeduplicated data versions; at week 1 it takes 115TB to store the deduplicated data but 200TB for the non-deduplicated data; at week 2 it takes ~132TB to store deduplicated data but 300TB for the non-deduplicated data, etc.  As each full backup completes it takes another 100TB of un-deduplicated storage but significantly less deduplicated storage.  After 8 full backups the un-deduplicated storage would require 8ooTB but only ~265TB for deduplicated storage.

Deduplication can also work for secondary or even primary storage.  Most IT shops with 1000’s of users, duplicate lot’s of data.  For example, interim files are sent from one employee to another for review, reports are sent out en-mass to teams, emails are blasted to all employees, etc.  Consequently, any storage (sub)system that can deduplicate data would more efficiently utilize backend storage.

Full disclosure, I have worked for many deduplication vendors in the past.

Google vs. National Information Exchange Model

Information Exchange Package Documents (IEPI) lifecycle from www.niem.gov
Information Exchange Package Documents (IEPI) lifecycle from www.niem.gov

Wouldn’t the National information exchange be better served by deferring the National Information Exchange Model (NIEM) and instead implementing some sort of Google-like search of federal, state, and municipal text data records.  Most federal, state and local data resides in sophisticated databases using their information management tools but such tools all seem to support ways to create a PDF, DOC, or other text output for their information records.   Once in text form, such data could easily be indexed by Google or other search engines, and thus, searched by any term in the text record.

Now this could never completely replace NIEM, e.g., it could never offer even “close-to” real-time information sharing.  But true real-time sharing would be impossible even with NIEM.  And whereas NIEM is still under discussion today (years after its initial draft) and will no doubt require even more time to fully  implement, text based search could be available today with minimal cost and effort.

What would be missing from a text based search scheme vs. NIEM:

  • “Near” realtime sharing of information
  • Security constraints on information being shared
  • Contextual information surrounding data records,
  • Semantic information explaining data fields

Text based information sharing in operation

How would something like a Google type text search work to share government information.  As discussed above government information management tools would need to convert data records into text.  This could be a PDF, text file, DOC file, PPT, and more formats could be supported in the future.

Once text versions of data records were available, it would need to be uploaded to a (federally hosted) special website where a search engine could scan and index it.  Indexing such a repository would be no more complex than doing the same for the web today.  Even so it will take time to scan and index the data.  Until this is done, searching the data will not be available.  However, Google and others can scan web pages in seconds and often scan websites daily so the delay may be as little as minutes to days after data upload.

Securing text based search data

Search security could be accomplished in any number of ways, e.g., with different levels of websites or directories established at each security level.   Assuming one used different websites then Google or another search engine could be directed to search any security level site at your level and below for information you requested. This may take some effort to implement but even today one can restrict a Google search to a set of websites.  It’s conceivable that some script could be developed to invoke a search request based on your security level to restrict search results.

Gaining participation

Once the upload websites/repositories are up and running, getting federal, state and local government to place data into those repositories may take some persuasion.  Federal funding can be used as one means to enforce compliance.  Bootstrapping data loading into the searchable repository can help insure initial usage and once that is established hopefully, ease of access and search effectiveness, can help insure it’s continued use.

Interim path to NIEM

One loses all contextual and most semantic information when converting a database record into text format but that can’t be helped.   What one gains by doing this is an almost immediate searchable repository of information.

For example, Google can be licensed to operate on internal sites for a fair but high fee and we’re sure Microsoft is willing to do the same for Bing/Fast.  Setting up a website to do the uploads can take an hour or so by using something like WordPress and file management plugins like FileBase but other alternatives exist.

Would this support the traffic for the entire nation’s information repository, probably not.  However, it would be an quick and easy proof of concept which could go a long way to getting information exchange started. Nonetheless, I wouldn’t underestimate the speed and efficiency of WordPress as it supports a number of highly active websites/blogs.  Over time such a WordPress website could be optimized, if necessary, to support even higher performance.

As this takes off, perhaps the need for NIEM becomes less time sensitive and will allow it to take a more reasoned approach.  Also as the web and search engines start to become more semantically aware perhaps the need for NIEM becomes less so.  Even so, there may ultimately need to be something like NIEM to facilitate increased security, real-time search, database context and semantics.

In the mean time, a more primitive textual search mechanism such as described above could be up and available for download within a day or so. True, it wouldn’t provide real time search, wouldn’t provide everything NIEM could do, but it could provide viable, actionable information exchange today.

I am probably over simplifying the complexity to provide true information sharing but such a capability could go a long way to help integrate governmental information sharing needed to support national security.

Free P2P-Cloud Storage and Computing Services?

FFT_graph from Seti@home
FFT_graph from Seti@home

What would happen if somebody came up with a peer-to-peer cloud (P2P-Cloud) storage or computing service.  I see this as

  • Operating a little like Napster/Gnutella where many people come together and share out their storage/computing resources.
  • It could operate in a centralized or decentralized fashion
  • It  would allow access to data/computing resources anywhere from the internet

Everyone joining the P2P-cloud would need to set aside computing and/or storage resources they were willing to devote to the cloud.  By doing so, they would gain access to an equivalent amount (minus overhead) of other nodes computing and storage resources to use as they see fit.

P2P-Cloud Storage

For cloud storage the P2P-Cloud would create a common cloud data repository spread across all nodes in the network:

  • Data would be distributed across the network in such a way that would allow reconstruction within any reasonable time frame and would handle any reasonable amount of node outages without loss of data.
  • Data would be encrypted before being sent to the cloud rendering the data unreadable without the key.
  • Data would NOT necessarily be shared, but would be hosted on other users systems.

As such, if I were to offer up 100GB of storage to the P2P-Cloud, I would get at least a 100GB (less overhead) of protected storage elsewhere on the cloud to use as I see fit.  Some % of this would be lost to administration say 1-3% and redundancy protection say ~25% but the remaining 72GB of off-site storage could be very useful for DR purposes.

P2P-Cloud storage would provide a reliable, secure, distributed file repository that could be easily accessible from any internet location.  At a minimum, the service would be free and equivalent to what someone supplies (less overhead) to the P2P-Cloud Storage service.  If storage needs exceeded your commitment, more cloud storage could be provided at a modest cost to the consumer.  Such fees would be shared by all the participants offering excess [=offered – (consumed + overhead)] storage to the cloud .

P2P-Cloud Computing

Cloud computing is definitely more complex, but generally follows the Seti@HOME/BOINC model:

  • P2P-Cloud computing suppliers would agree to use something like a “new screensaver” which would perform computation while generating a viable screensaver.
  • Whenever the screensaver was invoked, it would start execution on the last assigned processing unit.  Intermediate work results would need to be saved and when completed, the answer could be sent to the requester and a new processing unit assigned.
  • Processing units would be assigned by the P2P-Cloud computing consumer, would be timeout-able and re-assignable at will.

Computing users won’t gain much if the computing time they consume is <= the computing time they offer (less overhead).  However, computing time offset may be worth something, i.e., computing time now might be more valuable than computing time tonite.  Which may offer a slight margin of value to help get this off the ground.  As such, P2P-Cloud computing suppliers would need to be able to specify when computing resources might be mostly available along with the type, quality and quantity.

Unclear how to secure the processing unit and this makes legal issues more prevalent.  That may not be much of a problem, as a complex distributed computing task makes little sense in isolation. But the (il-)legality of some data processing activities could conceivably put the provider in a precarious position. (Somebody from the legal profession would need clarify all this, but I would think that some “Amazon C2” like licensing might offer safe harbor here).

P2P-Cloud computing services wouldn’t necessarily be amenable to the more normal, non-distributed or linear computing tasks but one could view these as just a primitive version of distributed computing tasks.  In either case, any data needed for computation would need to be sent along with the computing software to be run on a distributed node.  Whether it’s worth the effort is something for the users to debate.

BOINC can provide a useful model here.  Also, the Condor(R) project at U. of Wisconsin/Madison can provide a similar framework for scheduling the work of a “less distributed” computing task model.  In my mind, both types of services ultimately need to be provided.

To generate more compute servers, the SETI@Home and similar BOINC projects rely on doing good deeds.  As such, if you can make your computing task  do something of value to most users then maybe that’s enough. In that case, I would suggest joining up as a BOINC project. For the rest of us, doing more mundane data processing, just offering our compute services to the P2P-Cloud will need to suffice.

Starting up the P2P-Cloud

Bootstrapping the P2P-Cloud might take some effort but once going it should be self sustaining (assuming no centralized infrastructure).  I envision an open source solution, taking off from the work done on Napster&Gnutella and/or Boinc&Condor.

I believe the P2P-Cloud Storage service would be the easiest to get started.  BOINC and SETI@home (list of active Boinc projects) have been around a lot longer than cloud storage but their existence suggests that with the right incentives, even the P2P-Cloud Computing service can make sense.

Backup is for (E)discovery too

Electronic Discovery Reference Model (from EDRM.net)
Electronic Discovery Reference Model (from EDRM.net)

There has been lot’s of talk in twitterverse and elsewhere on how “backup is used for restore and archive is for e-discovery”, but I beg to differ.

If one were to take the time to review the EDRM (Electronic Discovery Reference Model) and analyze what happens during actual e-discovery processes, one would see that nothing is outside the domain of court discovery requests. Backups have and always will hold discoverable data just as online and user desktop/laptop storage do. In contrast, archives are not necessarily a primary source of discoverable data.

In my view, any data not in archive, by definition is online or on user desktop/laptop storage. Once online, data is most likely being backed up periodically and will show up in backups long before it’s moved to archive. Data deletions and other modifications can often be reconstructed from backups much better than from archive (with the possible exception of records management systems). Also, reconstructing data proliferation, such as who had a copy of what data when, is often crucial to court proceedings and normally, can only be reconstructed from backups.

Archives have a number of purposes but primarily it’s to move data that doesn’t change off company storage and out of its backup stream. Another popular reason for archive is to be used to satisfy compliance regimens that require companies to hold data for periods of time, such as mandated by SEC, HIPPA, SOX, and others. For example, SEC brokerage records must be held long after an account goes inactive, HIPPA health records must be held long after a hospital visit, SOX requires corporate records to be held long after corporate transactions transpire. Such records are more for compliance and/or customer back-history request purposes than e-discovery but here again any data stored by the corporation is discoverable.

So I believe it’s wrong to say that Backup is only for restore and archive is only for discovery. Information, anywhere within a company is discoverable. However, I would venture to say that a majority of e-discovery data comes from backups rather than elsewhere.

Now, as for using backups for restore,…