I have been reading a book recently, called Uncharted: Big data as a lens on human culture by Erez Aiden and Jean-Baptiste Michel that discusses the use of Google’s Ngram search engine which counts phrases (Ngrams) used in all the books they have digitized. Ngram phrases are charted against other Ngrams and plotted in real time.
It’s an interesting concept and one example they use is “United States are” vs. “United States is” a 3-Ngram which shows that the singular version of the phrase which has often been attributed to emerge immediately after the Civil War actually was in use prior to the Civil War and really didn’t take off until 1880’s, 15 years after the end of the Civil War.
I haven’t finished the book yet but it got me to thinking. The authors petitioned Google to gain access to the Ngram data which led to their original research. But then they convinced Google after their original research time was up to release the information to the general public. Great for them but it’s only a one time event and happened to work this time with luck and persistance.
The world needs more data
But there’s plenty of other information or data out there where we could use to learn an awful lot about human social interaction and other attributes about the world that are buried away in corporate databases. Yes, sometimes this information is made public (like Google), or made available for specific research (see my post on using mobile phone data to understand people mobility in an urban environment) but these are special situations. Once the research is over, the data is typically no longer available to the general public and getting future or past data outside the research boundaries requires yet another research proposal.
And yet books and magazines are universally available for a fair price to anyone and are available in most research libraries as a general public good for free. Why should electronic data be any different?
Social and mobile dta as a public good
What I would propose is that the Library of Congress and other research libraries around the world have access to all corporate data that documents interaction between humans, humans and the environment, humanity and society, etc. This data would be freely available to anyone with library access and could be used to provide information for research activities that have yet to be envisioned.
Hopefully all of this data would be released, free of charge (or for some nominal fee) to these institutions after some period of time has elapsed. For example, if we were talking about Twitter feeds, Facebook feeds, Instagram feeds, etc. the data would be provided from say 7 years back on a reoccurring yearly or quarterly basis. Not sure if the delay time should be 7, 10 or 15 years, but after some judicious period of time, the data would be released and made publicly available.
There are a number of other considerations:
- Anonymity – somehow any information about a person’s identity, actual location, or other potentially identifying characteristics would need to be removed from all the data. I realize this may reduce the value of the data to future researchers but it must be done. I also realize that this may not be an easy thing to accomplish and that is why the data could potentially be sold for a fair price to research libraries. Perhaps after 35 to 100 years or so the identifying information could be re-incorporated into the original data set but I think this highly unlikely.
- Accessibility – somehow the data would need to have an easily accessible and understandable description that would enable any researcher to understand the underlying format of the data. This description should probably be in XML format or some other universal description language. At a minimum this would need to include meta-data descriptions of the structure of the data, with all the tables, rows and fields completely described. This could be in SQL format or just XML but needs to be made available. Also the data release itself would then need to be available in a database or in flat file formats that could be uploaded by the research libraries and then accessed by researchers. I would expect that this would use some sort of open source database/file service tools such as MySQL or other database engines. These database’s represent the counterpart to book shelves in today’s libraries and has to be universally accessible and forever available.
- Identifyability – somehow the data releases would need to be universally identifiable, not unlike the ISBN scheme currently in use for books and magazines and ISRC scheme used for recordings. This would allow researchers to uniquely refer to any data set that is used to underpin their research. This would also allow the world’s research libraries to insure that they purchase and maintain all the data that becomes available by using some sort of master worldwide catalog that would hold pointers to all this data that is currently being held in research institutions. Such a catalog entry would represent additional meta-data for the data release and would represent a counterpart to a online library card catalog.
- Legality – somehow any data release would need to respect any local Data Privacy and Protection laws of the country where the data resides. This could potentially limit the data that is generated in one country, say Germany to be held in that country only. I would think this could be easily accomplished as long as that country would be willing to host all its data in its research institutions.
I am probably forgetting a dozen more considerations but this covers most of it.
How to get companies to release their data
One that quickly comes to mind is how to compel companies to release their data in a timely fashion. I believe that data such as this is inherently valuable to a company but that its corporate value starts to diminish over time and after some time goes to 0.
However, the value to the world of such data encounters an inverse curve. That is, the longer away we are from a specific time period when that data was created, the more value it has for future research endeavors. Just consider what current researchers do with letters, books and magazine articles from the past when they are researching a specific time period in history.
But we need to act now. We are already over 7 years into the Facebook era and mobile phones have been around for decades now. We have probably already lost much of the mobile phone tracking information from the 80’s, 90’s, 00’s and may already be losing the data from the early ’10’s. Some social networks have already risen and gone into a long eclipse where historical data is probably their lowest concern. There is nothing that compels organizations to keep this data around, today.
Types of data to release
Obviously, any social networking data, mobile phone data, or email/chat/texting data should all be available to the world after 7 or more years. Also the private photo libraries, video feeds, audio recordings, etc. should also be released if not already readily available. Less clear to me are utility data, such as smart power meter readings, water consumption readings, traffic tollway activity, etc.
I would say that one standard to use might be if there is any current research activity based on private, corporate data, then that data should ultimately become available to the world. The downside to this is that companies may be more reluctant to grant such research if this is a criteria to release data.
But maybe the researchers themselves should be able to submit requests for data releases and that way it wouldn’t matter if the companies declined or not.
There is no way, anyone could possibly identify all the data that future researchers would need. So I would err on the side to be more inclusive rather than less inclusive in identifying classes of data to be released.
The dawn of Psychohistory
The Uncharted book above seems to me to represent a first step to realizing a science of Psychohistory as envisioned in Asimov’s Foundation Trilogy. It’s unclear whether this will ever be a true, quantified scientific endeavor but with appropriate data releases, readily available for research, perhaps someday in the future we can help create the science of Psychohistory. In the mean time, through the use of judicious, periodic data releases and appropriate research, we can certainly better understand how the world works and just maybe, improve its internal workings for everyone on the planet.