Coming data bubble or explosion?

World population by Arenamontanus (cc) (from Flickr)
World population by Arenamontanus (cc) (from Flickr)

I was at another conference the other day where someone showed a chart that said the world will create 35ZB (10**21) of data and content in 2020 from 800EB (10**18) in 2009.

Every time I see something like this I cringe.   Yes, lot’s of data is being created today but what does that tell us about corporate data growth.  Not much, I’d wager.

Data bubble

That being said, I have a couple of questions I would ask of the people who estimated this:

  • How much is personal data and how much is corporate data.
  • Did you factor how entertainment data growth rates will change over time.

These two questions are crucial.

Entertainment dominates data growth

Just as personal entertainment is becoming the major consumer of national bandwidth (see study [requires login]), it’s clear to me that the majority of the data being created today is for personal consumption/entertainment – video, music, and image files.

I look at my own office, our corporate data (office files, PDFs, text, etc.) represents ~14% of the data we keep.  Images, music, video, audio take up the remainder of our data footprint.  Is this data growing yes, faster than I would like but the corporate data is only averaging ~30% YoY growth while the overall data growth for our shop is averaging a total of ~116% YoY growth . [As I interrupt this activity to load up another 3.3GB of photos and videos from our camera]

Moreover, although some media content is of significant external interest to select (Media and Entertainment, social media-photo/video sharing sites, mapping/satellite, healthcare, etc.) companies today, most corporations don’t deal with lot’s of video, music or audio data.  Thus, I personally see that the 30% growth is a more realistic growth rate for corporate data than 116%.

Will entertainment data growth flatten?

Will we see a drop in the entertainment data growth rates over time, undoubtedly.

Two factors will reduce the growth of this data.

  1. What happens to entertainment data recording formats.  I believe media recording formats are starting to level out.  I think the issue here is one of fidelity to nature, in terms of how closely a digital representation matches reality as we perceive it.  For example, the fact is that  most digital projection systems in movie theaters today run from ~2 to 8TBs per feature length motion picture which seems to indicate that at some point further gains in fidelity (or in more pixels/frame) may not be worth it.  Similar issues, will ultimately lead to a slowing down of other media encoding formats.
  2. When will all the people that can create content be doing so? Recent data indicates that more than 2B people will be on the internet this year or ~28% of the world’s.  But sometime we must reach saturation on internet penetration and when that happens data growth rates should also start to level out.  Let’s say for argument sake, that 800EB in 2009 was correct and let’s assume there were 1.5B internet users (in 2009).  As such, 1B internet users correlates to a data and content footprint of about 533EB or ~0.5TB/internet user — seems high but certainly doable.

Once these two factors level off, we should see world data and content growth rates plummet.  Nonetheless, internet user population growth could be driving data growth rates for some time to come.

Data explosion

The scary part is that the 35ZB represents only a ~41% growth rate over the period against the baseline 2009 data and content creation levels.

But I must assume this estimate doesn’t consider much growth in digital creators of content, otherwise these numbers should go up substantially.   In the last week, I ran across someone who said there would be 6B internet users by the end of the decade (can’t seem to recall where, but it was a TEDx video).  I find that a little hard to believe but this was based on the assumption that most people will have smart phones with cellular data plans by that time.  If that be the case, 35ZB seems awfully short of  the mark.

A previous post blows this discussion completely away with just one application, (see Yottabytes by 2015 for the NSA A Yottabyte (YB) is 10**24 bytes of data) and I had already discussed an Exabyte-a-day and 3.3 Exabytes-a-day in prior posts.  [Note, those YB by 2015 are all audio (phone) recordings but if we start using Skype Video, FaceTime and other video communications technologies can Nonabytes (10**27) be far behind… BOOM!]

—-

I started out thinking that 35ZB by 2020 wasn’t pertinent to corporate considerations and figured things had to flatten out, then convinced myself that it wasn’t large enough to accommodate internet user growth, and then finally recalled prior posts that put all this into even more perspective.

Comments?

1 on 1 auctions vs. person years of A/R time

1918 Farm Auction by dok1 (cc) (from Flickr)
1918 Farm Auction by dok1 (cc) (from Flickr)

I have had this conversation before (and have blogged about it with Crowdsourcing business analyst …) where there is lots of time and effort (person years?) devoted to scheduling one-on-one meetings between analyst firms and corporate executives. I may be repeating my earlier post but the problem persists and I see an obvious easier way to solve this.

Auction off 1 on 1 time slots

By doing this the company puts the burden on the analyst community by giving every  firm some amount of “analyst buck”s (A$) and then auction off executive meeting slots. In this way the crowd of analysts would determine who best to meet with whom (putting crowdsourcing to work).

Consider today’s solution:

  • Send out a list of topics to be discussed at the meeting,
  • Have the analyst firm select their top 3 or 5 topics, and
  • Have analyst relation’s sift the requests and executive availability to schedule the meetings.

For analyst events with 100s of analyst firms, 20 or more executives, and 10 or more time slots, the scheduling activity can become quite complex and time consuming.

I understand a corporation’s need to make the most effective use of analysts and executive management time, but what better way to make this determination than to let the (analyst) market decide.

How an executive 1 on 1 auction could work

The way I see it is to hold some sort of dutch or japanese auction (see wikipedia auction) where all the analyst firm representatives attended a webex session and bid for 1-1 time slots with various executives. In this fashion the company could have the whole schedule laid out in a single day with the only effort involved in identifying executives, time slots and supplying A$s to analyst firms.

It doesn’t even need to be that sophisticated and potentially could be done on eBay with real money supplied by the company (useable only for bidding in executive time slot auctions) and donated to charity when the process is finished.  There are any number of ways to do this on the quick and cheap.  However, using eBay may be a bit too public but doing this over a conference call with webex would probably suffice just as well and could be totally private.

Of course with this approach, the company may find that their are some executives that are in higher demand than others.  If such is the case, perhaps a secondary auction could be supplied with more time slots. Ditto for executives that have time slots that are not in demand – they could be released from providing time for 1 on 1 meetings.

In my prior post I mentioned the option that maybe the corporation might want more control over who meets who. In that case allocating some A$s to the corporate executives (or A/R as their proxy) to use to augment analyst firm bids might do the trick. Of course providing those firms more A$s would also give them preferential access. Obviously, this wouldn’t provide as much absolute control as spending person years of effort doing 1 on 1 scheduling but it would provide a quick and relatively easy solution to the problem from both the analyst firm as well as analyst relations.

But how much to grant to each analyst firm?

The critical question is the amount of A$’s to provide each firm.  This might take some thought but there is an easy solution. Just use last years analyst spend as the amount of A$s to provide the firm.  Another option is to provide some base level of analyst bucks to any firm invited to attend and then add more for the prior year spend.

Possibly, a less appealing approach (to me at least) is to give each analyst firm an amount proportional to their annual revenue regardless of company spending with the firm.  But perhaps some combination of the above, say

1/3 base amount for any invitee + 1/3 proportional to annual spend +1/3 proportional to annual firm revenue = A$s

would work.

In my previous post I suggested so many A$s per analyst. As such, bigger firms with more analysts would get more than firms with less analysts. But I feel the formula described above makes more sense to me.

Information provided to facilitate the 1 on 1’s auction

In order for the auction to work well, analyst firms would need to know more information about the executive whose time is being auctioned off.  But aside from that just a schedule of the time slots available would allow the auction to work. On the other hand, some idea of the company’s org chart and where the executive fit in would be very useful to facilitate the auction.

—-

That’s it, pretty simple, set up a conference call, send out executive information and org chart, allocate analyst bucks and let the bidding begin.

Auctioning off Lot-132: 30 minutes of Ray Lucchesi’s time …, let the bidding begin.

Comments?

SNIA CDMI plugfest for cloud storage and cloud data services

Plug by Samuel M. Livingston (cc) (from Flickr)
Plug by Samuel M. Livingston (cc) (from Flickr)

Was invited to the SNIA tech center to witness the CDMI (Cloud Data Managament Initiative) plugfest that was going on down in Colorado Springs.

It was somewhat subdued. I always imagine racks of servers, with people crawling all over them with logic analyzers, laptops and other electronic probing equipment.  But alas, software plugfests are generally just a bunch of people with laptops, ethernet/wifi connections all sitting around a big conference table.

The team was working to define an errata sheet for CDMI v1.0 to be completed prior to ISO submission for official standardization.

What’s CDMI?

CDMI is an interface standard for clients talking to cloud storage servers and provides a standardized way to access all such services.  With CDMI you can create a cloud storage container, define it’s attributes, and deposit and retrieve data objects within that container.  Mezeo had announced support for CDMI v1.0 a couple of weeks ago at SNW in Santa Clara.

CDMI provides for attributes to be defined at the cloud storage server, container or data object level such as: standard redundancy degree (number of mirrors, RAID protection), immediate redundancy (synchronous), infrastructure redundancy (across same storage or different storage), data dispersion (physical distance between replicas), geographical constraints (where it can be stored), retention hold (how soon it can be deleted/modified), encryption, data hashing (having the server provide a hash used to validate end-to-end data integrity), latency and throughput characteristics, sanitization level (secure erasure), RPO, and RTO.

A CDMI client is free to implement compression and/or deduplication as well as other storage efficiency characteristics on top of CDMI server characteristics.  Probably something I am missing here but seems pretty complete at first glance.

SNIA has defined a reference implementations of a CDMI v1.0 server [and I think client] which can be downloaded from their CDMI website.  [After filling out the “information on me” page, SNIA sent me an email with the download information but I could only recognize the CDMI server in the download information not the client (although it could have been there). The CDMI v1.0 specification is freely available as well.] The reference implementation can be used to test your own CDMI clients if you wish. They are JAVA based and apparently run on Linux systems but shouldn’t be too hard to run elsewhere. (one CDMI server at the plugfest was running on a Mac laptop).

Plugfest participants

There were a number people from both big and small organizations at SNIA’s plugfest.

Mark Carlson from Oracle was there and seemed to be leading the activity. He said I was free to attend but couldn’t say anything about what was and wasn’t working.  Didn’t have the heart to tell him, I couldn’t tell what was working or not from my limited time there. But everything seemed to be working just fine.

Carlson said that SNIA’s CDMI reference implementations had been downloaded 164 times with the majority of the downloads coming from China, USA, and India in that order. But he said there were people in just about every geo looking at it.  He also said this was the first annual CDMI plugfest although they had CDMI v0.8 running at other shows (i.e, SNIA SDC) before.

David Slik, from NetApp’s Vancouver Technology Center was there showing off his demo CDMI Ajax client and laptop CDMI server.  He was able to use the Ajax client to access all the CDMI capabilities of the cloud data object he was presenting and displayed the binary contents of an object.  Then he showed me the exact same data object (file) could be easily accessed by just typing in the proper URL into any browser, it turned out the binary was a GIF file.

The other thing that Slik showed me was a display of a cloud data object which was created via a “Cron job” referencing to a satellite image website and depositing the data directly into cloud storage, entirely at the server level.  Slik said that CDMI also specifies a cloud storage to cloud storage protocol which could be used to move cloud data from one cloud storage provider to another without having to retrieve the data back to the user.  Such a capability would be ideal to export user data from one cloud provider and import the data to another cloud storage provider using their high speed backbone rather than having to transmit the data to and from the user’s client.

Slik was also instrumental in the SNIA XAM interface standards for archive storage.  He said that CDMI is much more light weight than XAM, as there is no requirement for a runtime library whatsoever and only depends on HTTP standards as the underlying protocol.  From his viewpoint CDMI is almost XAM 2.0.

Gary Mazzaferro from AlloyCloud was talking like CDMI would eventually take over not just cloud storage management but also local data management as well.  He called the CDMI as a strategic standard that could potentially be implemented in OSs, hypervisors and even embedded systems to provide a standardized interface for all data management – cloud or local storage.  When I asked what happens in this future with SMI-S he said they would co-exist as independent but cooperative management schemes for local storage.

Not sure how far this goes.  I asked if he envisioned a bootable CDMI driver? He said yes, a BIOS CDMI driver is something that will come once CDMI is more widely adopted.

Other people I talked with at the plugfest consider CDMI as the new web file services protocol akin to NFS as the LAN file services protocol.  In comparison, they see Amazon S3 as similar to CIFS (SMB1 & SMB2) in that it’s a proprietary cloud storage protocol but will also be widely adopted and available.

There were a few people from startups at the plugfest, working on various client and server implementations.  Not sure they wanted to be identified nor for me to mention what they were working on. Suffice it to say the potential for CDMI is pretty hot at the moment as is cloud storage in general.

But what about cloud data consistency?

I had to ask about how the CDMI standard deals with eventual consistency – it doesn’t.  The crowd chimed in, relaxed consistency is inherent in any distributed service.  You really have three characteristics Consistency, Availability and Partitionability (CAP) for any distributed service.  You can elect to have any two of these, but must give up the third.  Sort of like the Hiesenberg uncertainty principal applied to data.

They all said that consistency is mainly a CDMI client issue outside the purview of the standard, associated with server SLAs, replication characteristics and other data attributes.   As such, CDMI does not define any specification for eventual consistency.

Although, Slik said that the standard does guarantee if you modify an object and then request a copy of it from the same location during the same internet session, that it be the one you last modified.  Seems like long odds in my experience.   Unclear how CDMI, with relaxed consistency can ever take the place of primary storage in the data center but maybe it’s not intended to.

—–

Nonetheless, what I saw was impressive, cloud storage from multiple vendors all being accessed from the same client, using the same protocols.  And if that wasn’t simple enough for you, just use your browser.

If CDMI can become popular it certainly has the potential to be the new web file system.

Comments?

 

Information commerce – part 2

3d personal printer by juhansonin (cc) (from Flickr)
3d personal printer by juhansonin (cc) (from Flickr)

I wrote a post a while back about how interplanetary commerce could be stimulated through the use of information commerce (see my Information based inter-planetary commerce post).  Last week I saw an article in the Economist magazine that discussed new 3D-printers used to create products with just the design information needed to describe a part or product.  Although this is only one type of information commerce, cultivating such capabilities can be one step to the future information commerce I envisioned.

3D Printers Today

3D printers grew up from the 2D inkjet printers of last century.  It turns out if 2D printers can precisely spray ink on a surface it stands to reason that similar technology could potentially build up a 3D structure one plane at a time.  After each layer is created, a laser, infrared light or some other technique is used to set the material into it’s proper form and then the part is incrementally lowered so that the next layer can be created.

Such devices use a form of additive manufacturing which adds material to the exact design specifications necessary to create one part. In contrast, normal part manufacturing activities such as those using a lathe are subtractive manufacturing activities, i.e., they take a block of material and chip away anything that doesn’t belong in the final part design.

3D printers started out making cheap, short-life plastic parts but recently, using titanium oxide powders, have been used to create extremely long lived, metal aircraft parts and nowadays can create any short- or long-lived plastic part imaginable.  A few limitations persist, namely, the size of the printer determines the size of the part or product and 3D printers that can create multi-material parts are fairly limited.

Another problem is the economics of 3D printing of parts, both in time and cost.  Volume production, using subtractive manufacturing of parts is probably still a viable alternative, i.e., if you need to manufacture 1000 or more of the same part, it probably still makes sense to use standard manufacturing techniques.   However, the boundary as to where it makes economic sense to 3D print a part or whether to use a lathe to manufacture a part is gradually moving upward.  Moreover, as more multi-material capable 3D printers start coming online, the economics of volume product manufacturing (not just a single part) will cause a sea change in product construction.

Information based, intra-planetary commerce

The Economist article discussed some implications of sophisticated 3D printers available in the near future.  Specifically, with 3D printers coming soon, manufacturing can now be done locally rather than having to ship parts and products from one country to another.  Using 3D printers all one needed to do was to transmit the product design to wherever it needs to be produced and sold.  They believed this would eliminate most cost advantages available today for low-wage countries that manufacturing parts and products.

The other implication that comes with newer 3D printers is that product customization is now much easier to do.  I envision clothing, furnishing, and other goods that can be literally tailor made for an individual with the proper use of design rule checking CAD software together with local, sophisicated 3D printers.  How Joe Consumer, fires up a CAD program and tailors their product is another matter.  But with 3D printers coming online, sophisticated, CAD knowledgeable users could almost do this today.

—-

In the end, the information needed to create a part or a product will be the key intellectual property.  It’s already been happening for years now but the dawn of 3D printers will accelerate this trend even more.

Also, 3D printers will expand information commerce, joining the already present, information activities provided by the finance, research/science, media, and other information purveyors around the planet today.  Anything that makes information more a part of everyday commerce can be beneficial, whenever we ultimately begin to move off this world to the next planet – let alone when I want to move to Tahitti…

Comments?

Personal medical record archive

MRI of my brain after surgery for Oligodendroglioma tumor by L_Family (cc) (From Flickr)
MRI of my brain after surgery for Oligodendroglioma tumor by L_Family (cc) (From Flickr)

I was reading a book the other day and it suggested that sometime in the near future we will all have a personal medical record archive. Such an archive would be a formal record of every visit to a healthcare provider, with every x-ray, MRI, CatScan, doctor’s note, blood analysis, etc. that’s ever done to a person.

Such data would be our personal record of our life’s medical history usable by any future medical provider and accessible by us.

Who owns medical records?

Healthcare is unusual.  For any other discipline like accounting, you provide information to the discipline expert and you get all the information you could possibly want back, to store, send to the IRS or or whatever, to do with it as you want.  If you decide to pitch it, you can pretty much request a copy (at your cost) of anything for a certain number of years after the information was created.

But, in medicine, X-rays are owned and kept by the medical provider, same with MRIs, CT scans, etc. and you hardly ever get a copy.  Occasionally, if the physician deems it useful for explicative reasons, you might get a grainy copy of an X-ray that shows a break or something but other than that and possible therapeutic instructions, typically nothing.

Getting Doctor’s notes is another question entirely.  It’s mostly text records in some sort of database somewhere online to the medical unit.  But, mainly what we get as patients, is a verbal diagnosis to take in and mull over.

Personal experience with medical records

I worked for an enlightened company a while back that had their own onsite medical practice providing all sorts of healthcare to their employees.  Over time, new management decided this service was not profitable and terminated it.  As they were winding down the operation, they offered to send patient medical information to any new healthcare provider or to us.  Not having a new provider, I asked they send them to me.

A couple of weeks later, a big brown manilla envelope was delivered.  Inside was a rather large, multy-page printout of notes taken by every medical provider I had visited throughout my tenure with this facility.  What was missing from this assemblage was lab reports, x-rays and other ancillary data that was taken in conjunction with those office visits. I must say the notes were comprehensive and somewhat laden with medical terminology but they were all there to see.

Printouts were not very useful to me and probably wouldn’t be to any follow-on medical group caring for me. However the lack of x-rays, blood work, etc. might be a serious deficiency for any follow-on treatment.  But, as far as I was concerned it was the first time any medical entity even offered me any information like this.

Making personal medical records useable, complete, and retrievable

To take this to the next level, and provide something useful for patients and follow-on healthcare, we need some sort of standardization of medical records across the healthcare industry.  This doesn’t seem that hard, given where we are today and need not be that difficult.  Standards for most medical data already exist, specifically,

  • DICOM or Digital Imaging and Communications in Medicine – is a standard file format used to digitally record X-Rays, MRIs, CT scans and more.  Most digital medical imaging technology (except for ultrasound) out there today optionally records information in DICOM format.  There just so happens to be an open source DICOM viewer that anyone can use to view these sorts of files if one is interested.
  • Ultrasound imaging –  is typically rendered and viewed as a sort of movie and is often used for soft tissue imaging and prenatal care.  I don’t know for sure but cannot find any standard like DICOM for ultrasound images.  However, if they are truly movies, perhaps HD movie files would suffice for a standard ultrasound imaging file.
  • Audiograms, blood chemistry analysis, etc. – is provided by many technicians or labs and could all be easily represented as PDFs, scanned images, JPEG/MPEG recordings, etc.  Doctors or healthcare providers often discuss salient items off these reports that are of specific interest to the patients condition.  Such affiliated notes could all be in an associated text file or even a recording made of the doctor discussing the results of the analysis that somehow references the other artifact  (“Blood chemistry analysis done on 2/14/2007 indicates …”).
  • Other doctor/healthcare provider notes – I find that everytime I visit a healthcare provider these days, they either take copious notes using WIFI connected laptops, record verbal notes to some voice recorder later transcribed into notes, or some combination of these. Any of such information could be provided in standard RTF  (text files) or MPEG recordings and viewed as is.

How patients can access medical data

Most voice recordings or text notes could easily be emailed to the patient.  As for DICOM images, ultrasound movies, etc., they could all be readily provided on DVDs or other removable media sent to the patient.

Another and possibly better alternative, is to have all this data uploaded to a healthcare provider’s designated URL, stored in a medical record cloud someplace, allowing patient access for viewing, downloading and/or copying.   I envision something akin to a photo sharing site, upload-able by any healthcare provider but accessible for downloads by any authorized user/patient.

Medical information security

Any patient data stored in such a medical record cloud would need to be secured and possibly encrypted by a healthcare provider supplied pass code which could be used for downloading/decrypting by the patient.  There are plenty of open source cryptographic algorithms which would suffice to encrypt this data (see GNU Privacy Guard for instance).

As for access passwords, possible some form of public key cryptography would suffice but it need not be that sophisticated.  I prefer to use open source tools for these security mechanisms as then it would be readily available to the patient or any follow-on medical provider to access and decrypt the data.

Medical information retention period

The patient would have a certain amount of time to download these files.  I lean towards months just to insure it’s done in a timely fashion but maybe it should be longer, something on the order of 7-years after a patients last visit might work.  This would allow the patient sufficient time to retrieve the data and to supply it to any follow-on medical provider or stored it in their own, personal medical record archive. There are plenty of cloud storage providers I know, that would be willing to store such data at a fair, but high price, for any period of time desired.

Medical information access credentials

All the patient would need is an email and/or possible a letter that provides the accessing URL, access password and encryption passcode information for the files.  Possibly such information could be provided in plaintext, appended to any bill that is cut for the visit which is sure to find its way to the patient or some financially responsible guardian/parent.

How do we get there

Bootstrapping this personal medical record archive shouldn’t be that hard.  As I understand it, Electronic Medical Record (EMR) legislation in the US and elsewhere has provisions stating that any patient has a legal right to copies of any medical record that a healthcare provider has for them.  If this is true, all we need do then is to institute some additional legislation that requires the healthcare provider to make those records available in a standard format, in a publicly accessible place, access controlled/encrypted via a password/passcode, downloadable by the patient and to provide the access credentials to the patient in a standard form.  Once that is done, we have all the pieces needed to create the personal medical record archive I envision here.

—-

While such legislation may take some time, one thing we could all do now, at least in the US, is to request access to all medical records/information that is legally ours already.  Once all the healthcare providers, start getting inundated with requests for this data, they might figure having some easy, standardized way to provide it would make sense.  Then the healthcare organizations could get together and work to finalize a better solution/legislation needed to provide this in some standard way.  I would think university hospitals could lead this endeavor and show us how it could be done.

Am I missing anything here?

Data Science!!

perspective by anomalous4 (cc) (from Flickr)
perspective by anomalous4 (cc) (from Flickr)

Ran across a web posting yesterday providing information on a University of Illinois summer program in Data Science.  I had never encountered the term before so I was intrigued.  When I first saw the article I immediately thought of data analytics but data science should be much broader than that.

What exactly is a data scientist?  I suppose someone who studies what can be learned from data but also what happens throughout data lifecycles.

Data science is like biology

I look to biology for an example.  A biologist studies all sorts of activity/interactions from what happens in a single cell organism, to plants, and animal kingdoms.  They create taxonomies which organizes all biological entities, past and present.  They study current and past food webs, ecosystems, and species.  They work in an environment of scientific study where results are openly discussed and repeatable.   In peer reviewed journals, they document everything from how a cell interacts within an organism, to how an organism interacts with its ecosystem, to whole ecosystem lifecycles.  I fondly remember my biology class in high school talking about DNA, the life of a cell, biological taxonomy and disection.

Where are these counterparts in Data Science?  Not sure but for starters let’s call someone who does data science an informatist.

Data ecosystems

What constitutes a data ecosystem in data science?  Perhaps an informatist would study the IT infrastructure(s) where a datum is created, stored, and analyzed.  Such infrastructure (especially with cloud) may span data centers, companies, and even the whole world.  Nonetheless, migratory birds can cover large distances, across multiple ecosystems and are still valid subjects for biologists.

So where a datum exists, where/when it’s moved throughout its lifecycle, and how it interacts with other datums is a proper subject for data ecosystem study.  I suppose my life’s study of storage could properly be called the study of data ecosytems.

Data taxonomy

Next, what’s a reasonable way for an informatist to organize data like a biological taxonomy with domain, kingdom, phylum, class, order, family, genus, and species (see wikipedia).  Seems to me that applications that create and access the data represent a rational way to organize data.  However my first thought on this was structured or unstructured data as the defining first level breakdown (maybe Phylum).  Order could be general application type such as email, ERP, office documents,  etc. Family could be application domain, genus could be application version and species could be application data type.  So that something like an Exchange 2010 email would be Order=EMAILus, Family=EXCHANGius, Genus=E2010ius, and Species=MESSAGius.

I think higher classifications such as kingdom and domain need to consider things such as oral history, handcopied manuscripts, movable type printed documents, IT, etc., at the Kingdom level.  Maybe Domain would be such things as biological domain, information domain, physical domain, etc.  Although where oral-h

When first thinking of  higher taxonomical designations I immediately went into O/S but now I think of an O/S as part of the ecological niche where data temporarily resides.

—-

I could go on, there are probably hundreds if not thousands of other characteristics of data science that need to be discussed – data lifecycle, the data cell, information use webs, etc.

Another surprise is how well the study of biology fits the study of data science.  Counterparts to biology seem to exist everywhere I look.  At some deep level, biology is information, wet-ware perhaps, but information nonetheless.  It seems to me that the use of biology to guide our elaboration of data science can be very useful.

Comments?

Information based inter-planetary commerce

NASA Blue Marble 2007 West by NASA Goddard Photo (cc) (from flickr)
NASA Blue Marble 2007 West by NASA Goddard Photo (cc) (from flickr)

[Long post, over 4 minute reading]

For a while now, I have been considering whether an inter-planetary economy would be possible.  Given that faster than light speed is impossible for any material substance, I believe the only viable foundation for any inter-planetory economy must be information transfers.  But can this sustain a multi-planet economy.

Worldwide information market today

How much money is made from information in the world today?  Just looking at published information sold for money we can come up with a lower bound estimate.  For example, USA book publishing was expected to generate $23.9 Bn (in 2009), magazine revenue was $9.8 Bn (in 2008), movie industry revenues were $9.6 Bn (in 2007).  If this represents just 25% (USA’s proportion of world GDP) of similar world wide revenues that says these activities would have generated ~$134.8 Bn last year .  Add to that worldwide software industry revenues of $303.8 Bn (in 2008), movie, TV, newspapers and radio revenues, such potentially pure informational transactions probably generates about $500 Bn worldwide or ~0.9% of world GDP.  Although, most certainly dwarfed by the rest of the world’s economy and ignores private information sales, this still represents a sizeable worldwide niche that could be addressed by any exo-earth colonies.

How would business work between parties light years apart?

Although we would colonize Mars for purely scientific reasons, any real exo-earth self-sufficient colonies would likely be 10 or more light years away from the earth.  So, most likely any business transactions to such colonies would be a long duration transaction. It would probably begin with a one way informational transfer for some period of time until the colonial expedition debt was paid off.  The exo-earth colony could provide basic planetary fauna, flora, colonial development and other information which depicts the colony’s home planet and its developmental history.  No doubt such information alone would be invaluable scientifically and would suffice to pay off any original expedition costs by itself.

But assuming such data was payment for the expedition and that this information stream was expected to continue indefinitely as a commitment by the colonists, what reason would the Earth have to continue supplying more information to the colony or vice versa. One would have to believe that the colony would take off on it’s own, developing unique expertise in scientific, technological or other intellectual pursuits that would be of value to Earth and as such, provide a surplus in their information account.  Once such information starts flowing from the colony it’s reasonable that the Earth should start payment in the other direction comprising of more information about developments here that would be valuable to Earth’s distant colony.

Now, any inter-planetary informational transaction would take possibly 20 to 50 years or more, with a 10 to 25 year wait time for the first exo-planetary byte of data was received.  Such transactions would need to be done with a great deal of trust that the other side would deliver their part of the informational bargain.  However, if one side were to cease informational transfers it’s entirely reasonable that the other side, after some necessary 10-25 year delay, could cease such transfers as well.  Enforcement potential activities such as these could be used to keep the information flowing bi-directionally, and establish a foundation for economic trust.

Would 50 year old information be useful today?

Given technological change, knowing scientific discoveries that happened fifty years ago doesn’t seem valuable at first.  However, any colony would have resources and material capabilities significantly different than Earth, such differences would dictate the pursuit of other paths than what would necessarily occur on Earth.  Such independent developments, that shares some common but dated knowledge base, could easily generate information that either side would value.

Also, long duration transactions have existed in the past and seem somewhat similar to expeditions funded in the 15th through 18th centuries by nation-states of that time.  Most such expeditions took years to complete and very rarely returned material value but often returned with information about routes, cultures, and places they found along the way.  These expeditions all started out under government funding but over time as the value of such expeditions became less risky, stock companies arose which provided commercial, non-governmental funding.  Shareholders would profit from any positive returns from the journey.  Today, one can see some similarities to this in venture capital activities.

Other parallels to exo-earth colonies are shown by looking at the more economically closed societies on the Earth.  Any country/culture which just allows information to flow in with limited material out would look similar to a planet light years away.  For example, take Japan.  For centuries Japan has gathered information, culture and technology first from China and then the rest of the world but had a strong societal culture that assimilated such information and expanded on it.  However early on, substantive business transactions in material were nowhere near as significant. But over time, as Japan became more integrated into global society, they were able to materially advance the world economy.

Inter-planatory commerce will ultimately depend on long duration transactions and informational transfers.  As shown above, long duration transactions have had a lengthy and profitable history on Earth.  Additionally, informational assimilator societies have existed in Earth’s past that have significantly advanced world society albeit in both informational and material goods but the informational return alone was significant.  As such, we believe inter-planatory commerce based on information transfers alone can prove to be profitable for governments and farsighted/long-lived institutions or individuals.

Plenty of implications for all of this, not the least of which will SETI find other civilizations if those civilizations only talk to their business partners.  But I will leave that for another post.