All that AI DL training data comes from us

Read a couple of articles the past few weeks that highlighted something that not many of us are aware of, most of the data used to train AI deep learning (DL) models comes from us.

That is through our ignorance or tacit acceptation of licenses for apps that we use every day and for just walking around/interacting with the world.

The article in Atlantic, The AI supply chain runs on ignorance, talks about Ever, a picture sharing app (like Flickr), where users opted in to its facial recognition software to tag people in pictures. Ever also used that (tagged by machine or person) data to train its facial recognition software which it sells to government agencies throughout the world.

The second article, in Engadget , Colorado College students were secretly used to train AI facial recognition (software), talks about a group using a telephoto security camera than was pointed at a high traffic area on campus. The data obtained was used to help train an AI DL model to identify facial characteristics from far away.

The article went on to say that gathering photos from people in public places is not against the law. The study was also cleared by the school. The database was not released until after the students graduated but it did have information about the time and date the photos were taken.

But that’s nothing…

The same thing applies to video sharing and photo animation models, podcasting and text speaking models, blogging and written word generation models, etc. All this data is just lying around the web, freely available for any AI DL data engineer to grab and use to train their models. The article which included the image below talks about a new dataset of millions of webpages.

From an OpenAI paper on better language models showing the accuracy of some AI DL models “trained on a new dataset of millions of webpages called WebText.”

,Google photo search is scanning the web and has access to any photo posted to use for training data. Facebook, IG, and others have millions of photos that people are posting online every day, many of which are tagged, with information identifying people in the photos. I’m sure some where there’s a clause in a license agreement that says your photos, when posted on our app, no longer belong to you alone.

As security cameras become more pervasive, camera data will readily be used to train even more advanced facial recognition models without your say so, approval or even appreciation that it is happening. And this is in the first world, with data privacy and identity security protections paramount, imagine how the rest of the world’s data will be used.

With AI DL models, it’s all about the data. Yes much of it is messy and has to be cleaned up, massaged and sometimes annotated to be useful for DL training. But the origins of that training data are typically not disclosed to the AI data engineers nor the people that created it.

We all thought China would have a lead in AI DL because of their unfettered access to data, but the west has its own way to gain unconstrained access to vast amounts of data. And we are living through it today.

Yes AI DL models have the potential to drastically help the world, humanity and government do good things better. But a dark side to AI DL models also exist to help bad actors, organizations and even some government agencies do evil.

Caveat usor (May the user beware)



Photo Credit(s): “Still Watching You” by jhcrow is licensed under CC BY-NC 2.0 

Computational Photography Homework 1 Results.” by kscottz is licensed under CC BY-NC 2.0 

From Language models are unsupervised multi-task learners OpenAI research paper

IT in space

Read an article last week about all the startup activity that’s taking place in space systems and infrastructure (see: As rocket companies proliferate … new tech emerges leading to a new space race). This is a consequence of cheap(er) launch systems from SpaceX, Blue Origin, Rocket Lab and others.

SpaceBelt, storage in space

One startup that caught my eye was SpaceBelt from Cloud Constellation Corporation, that’s planning to put PB (4X library of congress) of data storage in a constellation of LEO satellites.

The LEO storage pool will be populated by multiple nodes (satellites) with a set of geo-synchronous access points to the LEO storage pool. Customers use ground based secure terminals to talk with geosynchronous access satellites which communicate to the LEO storage nodes to access data.

Their main selling points appear to be data security and availability. The only way to access the data is through secured satellite downlinks/uplinks and then you only get to the geo-synchronous satellites. From there, those satellites access the LEO storage cloud directly. Customers can’t access the storage cloud without going through the geo-synchronous layer first and the secured terminals.

The problem with terrestrial data is that it is prone to security threats as well as natural disasters which take out a data center or a region. But with all your data residing in a space cloud, such concerns shouldn’t be a problem. (However, gaining access to your ground stations is a whole different story.

AWS and Lockheed-Martin supply new ground station service

The other company of interest is not a startup but a link up between Amazon and Lockheed Martin (see: Amazon-Lockheed Martin …) that supplies a new cloud based, satellite ground station as a service offering. The new service will use Lockheed Martin ground stations.

Currently, the service is limited to S-Band and attennas located in Denver, but plans are to expand to X-Band and locations throughout the world. The plan is to have ground stations located close to AWS data centers, so data center customers can have high speed, access to satellite data.

There are other startups in the ground station as a service space, but none with the resources of Amazon-Lockheed. All of this competition is just getting off the ground, but a few have been leasing idle ground station resources to customers. The AWS service already has a few big customers, like DigitalGlobe.

One thing we have learned, is that the appeal of cloud services is as much about the ecosystem that surrounds it, as the service offering itself. So having satellite ground stations as a service is good, but having these services, tied directly into other public cloud computing infrastructure, is much much better. Google, Microsoft, IBM are you listening?

Data centers in space

Why stop at storage? Wouldn’t it be better to support both storage and computation in space. That way access latencies wouldn’t be a concern. When terrestrial disasters occur, it’s not just data at risk. Ditto, for security threats.

Having whole data centers, would represent a whole new stratum of cloud computing. Also, now IT could implement space native applications.

If Microsoft can run a data center under the oceans, I see no reason they couldn’t do so in orbit. Especially when human flight returns to NASA/SpaceX. Just imagine admins and service techs as astronauts.

And yet, security and availability aren’t the only threats one has to deal with. What happens to the space cloud when war breaks out and satellite killers are set loose.

Yes, space infrastructure is not subject to terrestrial disasters or internet based security risks, but there are other problems besides those and war that exist such as solar storms and space debris clouds. .

In the end, it’s important to have multiple, non-overlapping risk profiles for your IT infrastructure. That is each IT deployment, may be subject to one set of risks but those sets are disjoint with another IT deployment option. IT in space, that is subject to solar storms, space debris, and satellite killers is a nice complement to terrestrial cloud data centers, subject to natural disasters, internet security risks, and other earth-based, man made disasters.

On the other hand, a large, solar storm like the 1859 one, could knock every data system on the world or in orbit, out. As for under the sea, it probably depends on how deep it was submerged!!

Photo Credit(s): Screen shots from SpaceBelt youtube video (c) SpaceBelt

Screens shot from AWS Ground Station as a Service sign up page (c) Amazon-Lockheed

Screen shots from Microsoft’s Under the sea news feature (c) Microsoft

The wizardry of StorMagic

We talked with Hans O’Sullivan, CEO and Chris Farey, CTO of StorMagic during Storage Field Days 6 (SFD6, view videos of their session) a couple of weeks back and they presented some interesting technology, at least to me.

Their SvSAN, software defined storage  solution has been around since 2009, and was originally intended to provide shared storage for SMB environments but was changed in 2011 to focus more on remote offices/branch offices (ROBO) for larger customers.

What makes the SvSAN such an appealing solution is that it’s a software-only storage solution that can use a minimum of 2 servers to provide a high availability, shared block storage cluster which can all be managed from one central site. Their SvSAN installs as a virtual storage appliance that runs as a virtual machine under a hypervisor and you can assign it to manage as much or as little of the direct access or SAN attached storage available to the server.

SvSAN customers

As of last count they had 30K licenses, in 64 countries, across 6 continents, were managing over 57PB of data, and had one (large retail) customer with over 2000 sites managed from one central location.  They had pictures of one customer in their presentation which judging by the color was obvious who it was but they couldn’t actually say.

One customer with a 1000’s of sites had prior storage that was causing 100’s of store outages a year, each of which averaged 6 hours to recover which cost them $6K each. Failure cost could be much larger and much longer, if there was a data loss.  They obviously needed a much more reliable storage system and wanted to reduce their cost of maintenance. Turning to SvSAN saved them lot’s of $s and time and eliminated their maintenance downtime.

Their largest vertical is retail but StorMagic does well in most ROBO environments which have limited IT staff, and limited data requirements. Other verticals they mentioned included defense (they specifically mentioned the German Army who have a parachute deployable, all-SSD SvSAN storage/data center), manufacturing (with small remote factories), government with numerous sites around the world, financial services (banks with many remote offices), restaurant and hotel chains, large energy companies, wind farms, etc.  Hans mentioned one a large wind farm operator that said their “field” data centers were so remote it took 6 days to get someone out to them to solve a problem but they needed 600GBs of shared storage to manage the complex.

SvSAN architecture

SvSAN uses synchronous mirroring between pairs of servers so that the data is constantly available in both servers of a pair. Presumably the amount of storage available to the SvSAN VSA’s running in the two servers have to be similar in capacity and performance.

An SvSAN cluster can grow by adding pairs of servers or by adding storage to an already present SvSAN cluster. One can have as many pairs of servers in an SvSAN local cluster as you want (probably some maximum here but I can’t recall what they said). The cluster interconnect is 1GbE or 10GbE. Most (~90%) of SvSAN implementations are under 2TB of data but their largest single clustered configuration is 200TB.

SvSAN supplies iSCSI storage services and runs inside a Linux virtual machine. But SvSAN can support both bare metal as well as virtualized server environments.

All the storage within a server that is assigned to SvSAN is pooled together and carved out as iSCSI virtual disks.  SvSAN can make use of raid controller with JBODs, DAS or even SAN storage, anything that is accessible to a virtual machine can be configured as part of SvSAN’s storage pool.

Servers that are accessing the shared iSCSI storage may access either of the servers in a synchronous mirrored pair. As it’s a synchronous mirror, any writes written to one of the servers is automatically mirrored to the other side before an acknowledgement is sent back to the host. Synchronous mirroring depends on multi-pathing software at the host.

As in any solution that supports active-active read-write access there is a need for a Quorum service to be hosted somewhere in the environment. Hopefully, at some location distinct from where a problem could potentially occur, but it doesn’t have to be. In StorMagic’s case this could reside on any physical server, even in the same environment. The Quorum service is there to decide which of the two copies is “more” current when there is some sort of split brain scenario. That is when the two servers in a synchronized pair lose communication with one another. At that point the Quorum service declares one dead and the other active and from that point on all IO activity must be done through the active SvSAN server. The Quorum service can also run on Linux or Windows and remotely or locally. Any configuration changes will need to be communicated to the Quorum service.

They have a bare metal recovery solution. Specifically, when one server fails, customers can ship out another server with a matching configuration to be installed in the remote site. When the new server comes up, it auto-configures it’s storage and networking by using the currently active server in the environment and starts a resynchronization process with that server. Which all means it can be brought up into a high availability mode with almost no IT support other than what it takes to power the server and connect some networking ports. This was made for ROBO!

Code upgrades can be done by taking one of the pair of servers down and loading the new code and resynching it’s data. Then once resynch completes you can do the same with the other server.

They support a fast-resynch service for when one of the pair goes down for any reason. At that point the active server starts tracking any changes that occur in a journal and when the other server comes up it just resends the changes that have occurred since the last time it was up.

SvSAN has support for SSDs and just released an SSD write back caching feature to help improve disk write speeds. They also support an all SSD configuration for harsh environments.

StorMagic also offers an option for non-mirrored disk but I can’t imagine why anyone would use it.

They can dynamically move one mirrored iSCSI volume from one pair of servers to another, without disrupting application activity.

Minimum hardware configuration requires a single core server but can use as many cores that you can give it. StorMagic commented that a single core maxes out at 50-60K IOPS but you can always just add more cores to the solution.

The SvSAN cluster can be managed in VMware vCenter or Microsoft System Center (MSSC) and it maintains statistics which help monitor the storage clusters in the remote office environments.

They also have a scripted recipe to help bring up multiple duplicate remote sites where local staff only need to plug in minimal networking and some storage information and they are ready to go.

SvSAN pricing and other information

Their product lists for 2 servers and 2TB of data storage is $2K and they have standard license options for 4, 8, and 16TB across a server pair after which it’s unlimited amounts of storage for the same price of $10K. This doesn’t include hardware or physical data storage this is just for the SvSAN software and management.

They offer a free 60 day evaluation license on their website (see link above).

There was a lot of twitter traffic and onsite discussion as to how this compared to HP’s StorVirtual VSA solution. The contention was that StorVirtual required more nodes but there was no-one from HP there to dispute this.

Didn’t hear much about snapshot, thin provisioning, remote replication, deduplication or encryption. But for ROBO office environments, that are typically under 2TB most of these features are probably overkill, especially when there’s no permanent onsite IT staff to support the local storage environment.


I had talked with StorMagic previously at one or more of the storage/IT conferences we have attended at the past and had relegated them to SMB  storage solutions. But after talking with them at SFD6, their solution became quite clearer. All of the sophisticated functionality they have developed together with their software only solution, seems to be  very appealing solution for these ROBO environments.




Protest intensity, world news database and big data – chart of the month

Read an article the other day on the analysis of the Arab Spring (Did the Arab Spring really spark a wave of global protests, in Foreign Policy) using a Google Ideas sponsored project, the GDELT ProjectTime domain run chart showing protest intensity every month for the last 30 years, with running average (Global Database of Events, Language and Tone) file of  events extracted from worldwide media sources.  The GDELT database uses sophisticated language processing to extract “event” data from news media streams and supplies this information in database form.  The database can be analyzed  to identify  trends in world events and possibly to better understand what led up to events that occur on our planet.

GDELT Project

The GDELT database records over 300 categories of events that are geo-referenced to city/mountaintop and time-referenced. The event data dates back to 1979.  The GDELT data captures 60 attributes of any event that occurs, generating a giant spreadsheet of event information with location, time, parties, and myriad other attributes all identified, and cross-referenceable.

Besides the extensive spreadsheet of world event attribute data the GDELT project also supplies a knowledge graph oriented view of its event data. The GDELT knowledge graph “compiles a list of every person, organization, company, location and over 230 themes and emotions from every news report” that can then be used to create network diagrams/graphs to be better able to visualize interactions between events. 

For example see the Global Conversation in Foreign Policy, for a network diagram of every person mentioned in the news during 6 months of 2013.  You can zoom in or out to see how people identified in news reports are connected during the six months. So if you we’re interested, in let’s say the Syrian civil war, one could easily see at a glance any news item that mentioned Syria or was located in Syria since 1979 to now. Wow!

Arab Spring and Worldwide Protest

Getting back to the chart-of-the-month, the graphic above shows the “protest intensity” by month for the last 30 years with a running average charted in black using GDELT data.  (It’s better seen in the FP article/link above or just click on it for an expanded view. ).

One can see from the chart that there was a significant increase in protest activity after January 2011, which corresponds to the beginning of the Arab Spring.  But the amazing inference from the chart above is that this increase has continued ever since. This shows that the Arab Spring has had a lasting contribution that has significantly increased worldwide protest activity.

This is just one example of the types of research available with the GDELT data.


I have talked in the past about how (telecom, social media and other) organizations should deposit their corporate/interaction data sets in some public repository for the better good of humanity so that any researcher could use it (see my Data of the world, lay down your chains post for more on this). The GDELT Project is Google Ideas doing this on a larger scale than I ever thought feasible. Way to go.


 Image credits: (c) 2014, All Rights Reserved



Replacing the Internet?

safe 'n green by Robert S. Donovan (cc) (from flickr)
safe ‘n green by Robert S. Donovan (cc) (from flickr)

Was reading an article the other day from TechCrunch that said Servers need to die to save the Internet. This article talked about a startup called MaidSafe which is attempting to re-architect/re-implement/replace the Internet into a Peer-2-Peer, mesh network and storage service which they call the SAFE (Secure Access for Everyone) network. By doing so, they hope to eliminate the need for network servers and storage.

Sometime in the past I wrote a blog post about Peer-2-Peer cloud storage (see Free P2P Cloud Storage and Computing if  interested). But it seems MaidSafe has taken this to a more extreme level. By the way the acronym MAID used in their name stands for Massive Array of Internet Disks, sound familiar?

Crypto currency eco-system

The article talks about MaidSafe’s SAFE network ultimately replacing the Internet but at the start it seems more to be a way to deploy secure, P2P cloud storage.  One interesting aspect of the MaidSafe system is that you can dedicate a portion of your Internet connected computers’ storage, computing and bandwidth to the network and get paid for it. Assuming you dedicate more resources than you actually use to the network you will be paid safecoins for this service.

For example, users that wish to participate in the SAFE network’s data storage service run a Vault application and indicate how much internal storage to devote to the service. They will be compensated with safecoins when someone retrieves data from their vault.

Safecoins are a new BitCoin like internet currency. Currently one safecoin is worth about $0.02 but there was a time when BitCoins were worth a similar amount. MaidSafe organization states that there will be a limit to the number of safecoins that can ever be produced (4.3Billion) so there’s obviously a point when they will become more valuable if MaidSafe and their SAFE network becomes successful over time. Also, earned safecoins can be used to pay for other MaidSafe network services as they become available.

Application developers can code their safecoin wallet-ids directly into their apps and have the SAFE network automatically pay them for application/service use.  This should make it much easier for App developers to make money off their creations, as they will no longer have to use advertising support, or provide differenct levels of product such as free-simple user/paid-expert use types of support to make money from Apps.  I suppose in a similar fashion this could apply to information providers on the SAFE network. An information warehouse could charge safecoins for document downloads or online access.

All data objects are encrypted, split and randomly distributed across the SAFE network

The SAFE network encrypts and splits any data up and then randomly distributes these data splits uniformly across their network of nodes. The data is also encrypted in transit across the Internet using rUDPs (reliable UDPs) and SAFE doesn’t use standard DNS services. Makes me wonder how SAFE or Internet network nodes know where rUDP packets need to go next without DNS but I’m no networking expert. Apparently by encrypting rUDPs and not using DNS, SAFE network traffic should not be prone to deep packet inspection nor be easy to filter out (except of course if you block all rUDP traffic).  The fact that all SAFE network traffic is encrypted also makes it much harder for intelligence agencies to eavesdrop on any conversations that occur.

The SAFE network depends on a decentralized PKI to authenticate and supply encryption keys. All SAFE network data is either encrypted by clients or cryptographically signed by the clients and as such, can be cryptographically validated at network endpoints.

The each data chunk is replicated on, at a minimum, 4 different SAFE network nodes which provides resilience in case a network node goes down/offline. Each data object could potentially be split up into 100s to 1000s of data chunks. Also each data object has it’s own encryption key, dependent on the data itself which is never stored with the data chunks. Again this provides even better security but the question becomes where does all this metadata (data object encryption key, chunk locations, PKI keys, node IP locations, etc.) get stored, how is it secured, and how is it protected from loss. If they are playing the game right, all this is just another data object which is encrypted, split and randomly distributed but some entity needs to know how to get to the meta-data root element to find it all in case of a network outage.

Supposedly, MaidSafe can detect within 20msec. if a node is no longer available and reconfigure the whole network. This probably means that each SAFE network node and endpoint is responsible for some network transaction/activity every 10-20msec, such as a SAFE network heartbeat to say it is still alive.

It’s unclear to me whether the encryption key(s) used for rUDPs and the encryption key used for the data object are one and the same, functionally related, or completely independent? And how a “decentralized PKI”  and “self authentication” works is beyond me but they published a paper on it, if interested.

For-profit open source business model

MaidSafe code is completely Open Source (available at MaidSafe GitHub) and their APIs are freely available to anyone and require no API key. They also have multiple approved and pending patents which have been provided free to the world for use, which they use in a defensive capacity.

MaidSafe says it will take a 5% cut of all safecoin transactions over the SAFE network. And as the network grows their revenue should grow commensurately. The money will be used to maintain the core network software and  MaidSafe said that their 5% cut will be shared with developers that help develop/fix the core SAFE network code.

They are hoping to have multiple development groups maintaining the code. They currently have some across Europe and in California in the US. But this is just a start.

They are just now coming out of stealth, have recently received $6M USD investment (by auctioning off MaidSafeCoins a progenitor of safecoins) but have been in operation now, architecting/designing/developing the core code now for 8+ years now, which probably qualifies them for the longest running startup on the planet.

Replacing the Internet

MaidSafe believes that the Internet as currently designed is too dependent on server farms to hold pages and other data. By having a single place where network data is held, it’s inherently less secure than by having data spread out, uniformly/randomly across a multiple nodes. Also the fact that most network traffic is in plain text (un-encrypted) means anyone in the network data path can examine and potentially filter out data packets.

I am not sure how the SAFE network can be used to replace the Internet but then I’m no networking expert. For example, from my perspective, SAFE is dependent on current Internet infrastructure to store and forward rUDPs on along its trunk lines and network end-paths. I don’t see how SAFE can replace this current Internet infrastructure especially with nodes only present at the endpoints of the network.

I suppose as applications and other services start to make use of SAFE network core capabilities, maybe the SAFE network can become more like a mesh network and less dependent on the current hub and spoke current Internet we have today.  As a mesh network, node endpoints can store and forward packets themselves to locally accessed neighbors and only go out on Internet hubs/trunk lines when they have to go beyond the local network link.

Moreover, the SAFE can make any Internet infrastructure less vulnerable to filtering and spying. Also, it’s clear that SAFE applications are no longer executing in data center servers somewhere but rather are actually executing on end-point nodes of the SAFE network. This has a number of advantages, namely:

  • SAFE applications are less susceptible to denial of service attacks because they can execute on many nodes.
  • SAFE applications are inherently more resilient because the operate across multiple nodes all the time.
  • SAFE applications support faster execution because the applications could potentially be executing closer to the user and could potentially have many more instances running throughout the SAFE network.

Still all of this doesn’t replace the Internet hub and spoke architecture we have today but it does replace application server farms, CDNs, cloud storage data centers and probably another half dozen Internet infrastructure/services I don’t know anything about.

Yes, I can see how MaidSafe and its SAFE network can change the Internet as we know and love it today and make it much more secure and resilient.

Not sure how having all SAFE data being encrypted will work with search engines and other web-crawlers but maybe if you want the data searchable, you just cryptographically sign it. This could be both a good and a bad thing for the world.

Nonetheless, you have to give the MaidSafe group a lot of kudos/congrats for taking on securing the Internet and making it much more resilient. They have an active blog and forum that discusses the technology and what’s happening to it and I encourage anyone interested more in the technology to visit their website to learn more



Computational Anthropology & Archeology

7068119915_732dd1ef63_zRead an article this week from Technology Review on The Emerging Science of Computational Anthropology. It was about the use of raw social media feeds to study the patterns of human behavior and how they change over time. In this article, they had come up with some heuristics that could be used to identify when people are local to an area and when they are visiting or new to an area.

Also, this past week there was an article in the Economist about Mining for Tweets of Gold about the startup DataMinr that uses raw twitter feeds to supply information about what’s going on in the world today. Apparently DataMinr is used by quite a few financial firms, news outlets, and others and has a good reputation for discovering news items that have not been reported yet. DataMinr is just one of a number of commercial entities doing this sort of analysis on Twitter data.

A couple of weeks ago I wrote a blog post on Free Social and Mobile Data as a Public Good. In that post I indicated that social and mobile data should be published, periodically in an open format, so that any researcher could examine it around the world.

Computational Anthropology

Anthropology is the comparative study of human culture and condition, both past and present. Their are many branches to the study of  Anthropology including but not limited to physical/biological, social/cultural, archeology and linguistic anthropologies. Using social media/mobile data to understand human behavior, development and culture would fit into the social/cultural branch of anthropology.

I have also previously written about some recent Computational Anthropological research (although I didn’t call it that), please see my Cheap phones + big data = better world and Mobile phone metadata underpins a new science posts. The fact is that mobile phone metadata can be used to create a detailed and deep understanding of a societies mobility.  A better understanding of human mobility in a region can be used to create more effective mass transit, more efficient  road networks, transportation and reduce pollution/energy use, among other things.

Social media can be used in a similar manner but it’s more than just location information, and some of it is about how people describe events and how they interact through text and media technologies. One research paper discussed how tweets could be used to detect earthquakes in real time (see: Earthquake Shakes Twitter Users: Real-time Event Detection by Social Sensors).

Although the location information provided by mobile phone data is more important to governments and transportation officials, it appears as if social media data is more important to organizations seeking news, events, or sentiment trending analysis.

Sources of the data today

Recently, Twitter announced that it would make its data available to a handful of research organizations (see: Twitter releasing trove of user data …).

On the other hand Facebook and LinkedIn seems a bit more restrictive in allowing access to their data. They have a few data scientists on staff but if you want access to their data you have to apply for it and only a few are accepted.

Although Google, Twitter, Facebook, LinkedIn and Telecoms represent the lions share of social/mobile data out there today, there are plenty of others sources of information that could potentially be useful that come to mind. Notwithstanding the NSA, currently there is limited research accessibility to the actual texts of mobile phone texts/messaging, and god forbid, emails.  Although privacy concerns are high, I believe ultimately this needs to change.

Imagine if some researchers had access to all the texts of a high school student body. Yes much of it would be worthless but some of it would tell a very revealing story about teenage relationships, interest and culture among other things. And having this sort of information over time could reveal the history of teenage cultural change. Much of this would have been previously available through magazines but today texts would represent a much more granular level of this information.

Computational Archeology

Archeology is just anthropology from a historical perspective, i.e, it is the study of the history of cultures, societies and life.  Computational Archeology would apply to the history of the use of computers, social media, telecommunications, Internet/WWW, etc.

There are only few resources that are widely available for this data such as the Internet Archive. But much of the history of WWW, social media, telecom, etc. use is in current and defunct organizations that aside from Twitter, continue to be very stingy with their data.

Over time all such data will be lost or become inaccessible unless something is done to make it available to research organizations. I believe sooner or later humanity will wise up to the loss of this treasure trove of information and create some sort of historical archive for this data and require companies to supply this data over time.


Photo Credit(s): State of the Linked Open Data Cloud (LOD), September 2011 by Duncan Hull

Releasing social and mobile data as a public good

I have been reading a book recently, called Uncharted: Big data as a lens on human culture by Erez Aiden and Jean-Baptiste Michel that discusses the use of Google’s Ngram search engine which counts phrases (Ngrams) used in all the books they have digitized. Ngram phrases are charted against other Ngrams and plotted in real time.

It’s an interesting concept and one example they use is “United States are” vs. “United States is” a 3-Ngram which shows that the singular version of the phrase which has often been attributed to emerge immediately after the Civil War actually was in use prior to the Civil War and really didn’t take off until 1880’s, 15 years after the end of the Civil War.

I haven’t finished the book yet but it got me to thinking. The authors petitioned Google to gain access to the Ngram data which led to their original research. But then they convinced Google after their original research time was up to release the information to the general public. Great for them but it’s only a one time event and happened to work this time with luck and persistance.

The world needs more data

But there’s plenty of other information or data out there where we could use to learn an awful lot about human social interaction and other attributes about the world that are buried away in corporate databases. Yes, sometimes this information is made public (like Google), or made available for specific research (see my post on using mobile phone data to understand people mobility in an urban environment) but these are special situations. Once the research is over, the data is typically no longer available to the general public and getting future or past data outside the research boundaries requires yet another research proposal.

And yet books and magazines are universally available for a fair price to anyone and are available in most research libraries as a general public good for free.  Why should electronic data be any different?

Social and mobile dta as a public good

What I would propose is that the Library of Congress and other research libraries around the world have access to all corporate data that documents interaction between humans, humans and the environment, humanity and society, etc.  This data would be freely available to anyone with library access and could be used to provide information for research activities that have yet to be envisioned.

Hopefully all of this data would be released, free of charge (or for some nominal fee) to these institutions after some period of time has elapsed. For example, if we were talking about Twitter feeds, Facebook feeds, Instagram feeds, etc. the data would be provided from say 7 years back on a reoccurring yearly or quarterly basis. Not sure if the delay time should be 7, 10 or 15 years, but after some judicious period of time, the data would be released and made publicly available.

There are a number of other considerations:

  • Anonymity – somehow any information about a person’s identity, actual location, or other potentially identifying characteristics would need to be removed from all the data.  I realize this may reduce the value of the data to future researchers but it must be done. I also realize that this may not be an easy thing to accomplish and that is why the data could potentially be sold for a fair price to research libraries. Perhaps after 35 to 100 years or so the identifying information could be re-incorporated into the original data set but I think this highly unlikely.
  • Accessibility – somehow the data would need to have an easily accessible and understandable description that would enable any researcher to understand the underlying format of the data. This description should probably be in XML format or some other universal description language. At a minimum this would need to include meta-data descriptions of the structure of the data, with all the tables, rows and fields completely described. This could be in SQL format or just XML but needs to be made available. Also the data release itself would then need to be available in a database or in flat file formats that could be uploaded by the research libraries and then accessed by researchers. I would expect that this would use some sort of open source database/file service tools such as MySQL or other database engines. These database’s represent the counterpart to book shelves in today’s libraries and has to be universally accessible and forever available.
  • Identifyability – somehow the data releases would need to be universally identifiable, not unlike the ISBN scheme currently in use for books and magazines and ISRC scheme used for recordings. This would allow researchers to uniquely refer to any data set that is used to underpin their research. This would also allow the world’s research libraries to insure that they purchase and maintain all the data that becomes available by using some sort of master worldwide catalog that would hold pointers to all this data that is currently being held in research institutions. Such a catalog entry would represent additional meta-data for the data release and would represent a counterpart to a online library card catalog.
  • Legality – somehow any data release would need to respect any local Data Privacy and Protection laws of the country where the data resides. This could potentially limit the data that is generated in one country, say Germany to be held in that country only. I would think this could be easily accomplished as long as that country would be willing to host all its data in its research institutions.

I am probably forgetting a dozen more considerations but this covers most of it.

How to get companies to release their data

One that quickly comes to mind is how to compel companies to release their data in a timely fashion. I believe that data such as this is inherently valuable to a company but that its corporate value starts to diminish over time and after some time goes to 0.

However, the value to the world of such data encounters an inverse curve. That is, the longer away we are from a specific time period when that data was created, the more value it has for future research endeavors. Just consider what current researchers do with letters, books and magazine articles from the past when they are researching a specific time period in history.

But we need to act now. We are already over 7 years into the Facebook era and mobile phones have been around for decades now. We have probably already lost much of the mobile phone tracking information from the 80’s, 90’s, 00’s and may already be losing the data from the early ’10’s. Some social networks have already risen and gone into a long eclipse where historical data is probably their lowest concern. There is nothing that compels organizations to keep this data around, today.

Types of data to release

Obviously, any social networking data, mobile phone data, or email/chat/texting data should all be available to the world after 7 or more years.  Also the private photo libraries, video feeds, audio recordings, etc. should also be released if not already readily available. Less clear to me are utility data, such as smart power meter readings, water consumption readings, traffic tollway activity, etc.

I would say that one standard to use might be if there is any current research activity based on private, corporate data, then that data should ultimately become available to the world. The downside to this is that companies may be more reluctant to grant such research if this is a criteria to release data.

But maybe the researchers themselves should be able to submit requests for data releases and that way it wouldn’t matter if the companies declined or not.

There is no way, anyone could possibly identify all the data that future researchers would need. So I would err on the side to be more inclusive rather than less inclusive in identifying classes of data to be released.

The dawn of Psychohistory

The Uncharted book above seems to me to represent a first step to realizing a science of Psychohistory as envisioned in Asimov’s Foundation Trilogy. It’s unclear whether this will ever be a true, quantified scientific endeavor but with appropriate data releases, readily available for research, perhaps someday in the future we can help create the science of Psychohistory. In the mean time, through the use of judicious, periodic data releases and appropriate research, we can certainly better understand how the world works and just maybe, improve its internal workings for everyone on the planet.


Picture Credit(s): Amazon and Wikipedia 

Data of the world, lay down your chains

Prison Planet by AZRainman (cc) (from Flickr)
Prison Planet by AZRainman (cc) (from Flickr)

GitHub, that open source free repository of software, is taking on a new role, this time as a repository for municipal data sets. At least that’s what a recent article on the website (see Catch my Diff: GitHub’s New Feature Means Big Things for Open Data) after GitHub announced new changes in its .GeoJSON support (see Diffable, more customizable maps)

The article talks about the fact that maps in Github (using .GeoJSON data) can be now DIFFed, that is see at a glance what changes have been made to it. In the one example in the article (easier to see in GitHub) you can see how one Chicago congressional district has changed over time.

Unbeknownst to me, GitHub started becoming a repository for geographical data. That is any .GeoJson data file can be now be saved as a repository on GitHub and can be rendered as a map using desktop or web based tools. With the latest changes at GitHub, now one can see changes that are made to a .GeoJSON file as two or more views of a map or properties of map elements.

Of course all the other things one can do with GitHub repositories are also available, such as FORK, PULL, PUSH, etc. All this functionality was developed to support software coding but can apply equally well to .GeoJSON data files. Because .GeoJSON data files look just like source code (really more like .XML, but close enough).

So why maps as source code data?

Municipalities have started to use GitHub to host their Open Data initiatives. For example Digital Chicago has started converting some of their internal datasets into .GeoJSON data files and loading them up on GitHub for anyone to see, fork, modify, etc.

I was easily able to login and fork one of the data sets. But there’s a little matter of pushing your committed changes to the project owner that needs to happen before you can modify the original dataset.

Also I was able to render the .GeoJSON data into a viewable map by just clicking on a commit file (I suppose this is a web service). The ReadME file has instructions for doing this on your desktop outside of a web browser for R, Ruby and Python.

In any case, having the data online, editable and commitable would allow anyone with GitHub account to augment the data to make it better and more comprehensive. Of course with the data now online, any application could make use of it to offer services based on the data.

I guess that’s what Open Data movement is all about, make government, previously proprietary data freely available in a standardized format, and add tools to view and modify it, in the hope that businesses see a way to make use of it in new ways. As such, In  the data should become more visible and more useful to the world and the cities that are supporting it.

If you want to learn more about Project Open Data see the blog post from last year on or the GitHub Project [wiki] pages.