Ran across a web posting yesterday providing information on a University of Illinois summer program in Data Science. I had never encountered the term before so I was intrigued. When I first saw the article I immediately thought of data analytics but data science should be much broader than that.
What exactly is a data scientist? I suppose someone who studies what can be learned from data but also what happens throughout data lifecycles.
Data science is like biology
I look to biology for an example. A biologist studies all sorts of activity/interactions from what happens in a single cell organism, to plants, and animal kingdoms. They create taxonomies which organizes all biological entities, past and present. They study current and past food webs, ecosystems, and species. They work in an environment of scientific study where results are openly discussed and repeatable. In peer reviewed journals, they document everything from how a cell interacts within an organism, to how an organism interacts with its ecosystem, to whole ecosystem lifecycles. I fondly remember my biology class in high school talking about DNA, the life of a cell, biological taxonomy and disection.
Where are these counterparts in Data Science? Not sure but for starters let’s call someone who does data science an informatist.
What constitutes a data ecosystem in data science? Perhaps an informatist would study the IT infrastructure(s) where a datum is created, stored, and analyzed. Such infrastructure (especially with cloud) may span data centers, companies, and even the whole world. Nonetheless, migratory birds can cover large distances, across multiple ecosystems and are still valid subjects for biologists.
So where a datum exists, where/when it’s moved throughout its lifecycle, and how it interacts with other datums is a proper subject for data ecosystem study. I suppose my life’s study of storage could properly be called the study of data ecosytems.
Next, what’s a reasonable way for an informatist to organize data like a biological taxonomy with domain, kingdom, phylum, class, order, family, genus, and species (see wikipedia). Seems to me that applications that create and access the data represent a rational way to organize data. However my first thought on this was structured or unstructured data as the defining first level breakdown (maybe Phylum). Order could be general application type such as email, ERP, office documents, etc. Family could be application domain, genus could be application version and species could be application data type. So that something like an Exchange 2010 email would be Order=EMAILus, Family=EXCHANGius, Genus=E2010ius, and Species=MESSAGius.
I think higher classifications such as kingdom and domain need to consider things such as oral history, handcopied manuscripts, movable type printed documents, IT, etc., at the Kingdom level. Maybe Domain would be such things as biological domain, information domain, physical domain, etc. Although where oral-h
When first thinking of higher taxonomical designations I immediately went into O/S but now I think of an O/S as part of the ecological niche where data temporarily resides.
I could go on, there are probably hundreds if not thousands of other characteristics of data science that need to be discussed – data lifecycle, the data cell, information use webs, etc.
Another surprise is how well the study of biology fits the study of data science. Counterparts to biology seem to exist everywhere I look. At some deep level, biology is information, wet-ware perhaps, but information nonetheless. It seems to me that the use of biology to guide our elaboration of data science can be very useful.