What’s wrong with SPECsfs2008?

I have been analyzing SPECsfs results now for almost 7 years now and I feel that maybe it’s time for me to discuss some of the t problems with SPECsfs2008 today that should be fixed in the next SPECsfs20xx whenever that comes out.


First and foremost, for CIFS SMB 1 is no longer pertinent to today’s data center. The world of Microsoft has moved on to SMB 2 mostly and are currently migrating to SMB 3.  There were plenty of performance fixes in the last years SMB 3.0 release which would be useful to test with current storage systems. But I would be even be somewhat happy with SMB2 if that’s all I can hope for.

My friends at Microsoft would consider me remiss if I didn’t mention that since SMB 2 they no longer call it CIFS and have moved to SMB. SPECsfs should follow this trend. I have tried to use CIFS/SMB in my blog posts/dispatches as a step in this direction mainly because SPEC continues to use CIFS and Microsoft wants me to use SMB.

In my continuing quest to better compare different protocol performance I believe it would be useful to insure that the same file size distributions are used for both CIFS and NFS benchmarks. Although the current Users Guide discusses some file size information for NFS it is silent when it comes to CIFS. I have been assuming that they were the same because of lack of information but this would be worthy to have confirmed in documentation.

Finally for CIFS, it would be very useful if there could be a closer approximation of the same amount of data transfers that are done for NFS.  This is a nit but when I compare CIFS to NFS storage system results there is a slight advantage to NFS because NFS’s workload definition doesn’t do as much reading as CIFS. In contrast, CIFS has slightly less file data write activity than the NFS benchmark workload. Having them be exactly the same would help in any (unsanctioned) comparisons.


As for NFSv3, although NFSv4 has been out for more than 3 years now, it has taken a long time to be widely adopted. However, these days there seems to be more client and storage support coming online every day and maybe this would be a good time to move on to NFSv4.

The current NFS workloads, while great for the normal file server activities, have not kept pace with much of how NFS is used today especially in virtualized environments. As far as I can tell under VMware NFS data stores don’t do a lot of meta-data operations and do an awful lot more data transfers than normal file servers do. Similar concerns apply to NFS used for Oracle or other databases. Unclear how one could incorporate a more data intensive workload mix into the standard SPECsfs NFS benchmark but it’s worthy of some thought. Perhaps we could create a SPECvms20xx benchmark that would test these types of more data intensive workloads.

For both NFSv3 and CIFs benchmarks

Both the NFSv3 and CIFS benchmarks typically report [throughput] ops/sec. These are a mix of all the meta-data activities and the data transfer activities.  However, I think many storage customers and users would like a finer view of system performance. .

I have often been asked just how many files a storage system actually support. This depends of course on the workload and file size distributions but SPECsfs already defines this. As a storage performance expert, I would also like to know how much data transfer can a storage system support in MB/sec read and written.  I believe both of these metrics can be extracted from the current benchmark programs with a little additional effort. Probably another half dozen metrics that would be useful maybe we could sit down and have an open discussion of what these might be.

Also the world has changed significantly over the last 6 years and SSD and flash has become much more prevalent. Some of your standard configuration tables could be better laid out to insure that readers understand just how much DRAM, flash, SSDs and disk drives are in a configuration.

Beyond file NAS

Going beyond SPECsfs there is a whole new class of storage, namely object storage where there are no benchmarks available. I would think now that Amazon S3 and Openstack Cinder are well defined and available that maybe a new set of SPECobj20xx benchmarks would be warranted. I believe with the adoption of software defined data centers, object storage may become the storage of choice over the next decade or so. If that’s the case then having some a benchmark to measure object storage performance would help in its adoption. Much like the original SPECsfs did for NFS.

Then there’s the whole realm of server SAN or (hyper-)converged storage which uses DAS inside a cluster of compute servers to support block and file services. Not sure exactly where this belongs but NFS is typically the first protocol of choice for these systems and having some sort of benchmark configuration that supports converged storage would help adoption of this new type of storage as well.

I think thats about it for now but there’s probably a whole bunch more that I am missing out here.


Posted in CIFS/SMB throughput, Clustered storage, desktop virtualization, File Storage, NFS throughput, SPECsfs, SPECsfs2008, Storage, Storage performance, Storage virtualization, System effectiveness | Tagged , , , , , , | Leave a comment

Latest SPC-2 performance results – chart of the month

Spider chart top 10 SPC-1 MB/second broken out by workload LFP, LDQ and VODIn the figure above you can see one of the charts from our latest performance dispatch on SPC-1 and SPC-2  benchmark results. The above chart shows SPC-2 throughput results sorted by aggregate MB/sec order, with all three workloads broken out for more information.

Just last quarter I was saying it didn’t appear as if any all-flash system could do well on SPC-2, throughput intensive workloads.  Well I was wrong (again) and with an aggregate MBPS™ of ~33.5GB/sec. Kaminario’s all-flash K2 took the SPC-2 MBPS results to a whole different level, almost doubling the nearest competitor in this category (Oracle ZFS ZS3-4).

Ok, Howard Marks (deepstorage.net), my GreyBeardsOnStorage podcast co-host and long-time friend, had warned me that SSDs had the throughput to be winners at SPC-2, but they would probably cost to much to be viable.  I didn’t believe him at the time — how wrong could I be.

As for cost, both Howard and I misjudged this one. The K2 came in at just under a $1M USD, whereas the #2, Oracle system was under $400K. But there were five other top 10 SPC-2 MPBS systems over $1M so the K2, all-flash system price was about average for the top 10.

Ok, if cost and high throughput aren’t the problem why haven’t we seen more all-flash systems SPC-2 benchmarks.  I tend to think that most flash systems are optimized for OLTP like update activity and not sequential throughput. The K2 is obviously one exception. But I think we need to go a little deeper into the numbers to understand just what it was doing so well.

The details

The LFP (large file processing) reported MBPS metric is the average of 1MB and 256KB data transfer sizes, streaming activity with 100% write, 100% read and 50%:50% read-write. In K2′s detailed SPC-2 report, one can see that for 100% write workload the K2 was averaging ~26GB/sec. while for the 100% read workload the K2 was averaging ~38GB/sec. and for the 50:50 read:write workload ~32GB/sec.

On the other hand the LDQ workload appears to be entirely sequential read-only but the report shows that this is made up of two workloads one using 1MB data transfers and the other using 64KB data transfers, with various numbers of streams fired up to generate  stress. The surprising item for K2′s LDQ run is that it did much better on the 64KB data streams than the 1MB data streams, an average of 41GB/sec vs. 32GB/sec.. This probably says something about an internal flash data transfer bottleneck at large data transfers someplace in the architecture.

The VOD workload also appears to be sequential, read-only and the report doesn’t indicate a data transfer size but given K2′s actual results, averaging ~31GB/sec it would seem to indicate it was on the order of 1MB.

So what we can tell is that K2′s SSD write throughput is worse than reads (~1/3rd worse) and relatively smaller sequential reads are better than relatively larger sequential reads (~1/4 better).  But I must add that even at the relatively “slower write throughput”, the K2 would still have beaten the next best disk-only storage system by ~10GB/sec.

Where’s the other all-flash SPC-2 benchmarks?

Prior to K2 there was only one other all-flash system (TMS RamSan-630) submission for SPC-2. I suspect that writing 26 GB/sec. to an all-flash system would be hazardous to its health and maybe other all-flash storage system vendors don’t want to encourage this type of activity.

Just for the record the K2 SPC-2 result has been submitted for “review” (as of 18Mar2014) and may be modified before finally “accepted”. However, the review process typically doesn’t impact performance results as much as other report items. So, officially, we will need to await for final acceptance before we can truly believe these numbers.



The complete SPC  performance report went out in SCI’s February 2014 newsletter.  But a copy of the report will be posted on our dispatches page sometime next quarter (if all goes well).  However, you can get the latest storage performance analysis now and subscribe to future free newsletters by just using the signup form above right.

Even more performance information and OLTP, Email and Throuphput ChampionCharts for Enterprise, Mid-range and SMB class storage systems are also available in SCI’s SAN Buying Guide, available for purchase from  website.

As always, we welcome any suggestions or comments on how to improve our SPC  performance reports or any of our other storage performance analyses.

Posted in Block Storage, Disk storage, FC, LDQ, LFP, MPBS, SPC-2, SSD storage, Storage, Storage performance, Storage Performance Council, VOD | Tagged , , , , , , | Leave a comment

Securing synch & share data-at-rest


1003163361_ba156d12f7Snowden at SXSW said last week that it’s up to the vendors to encrypt customer data. I think he was talking mostly about data-in-flight but there’s just a big an exposure for data-at-rest, maybe more so because then, all the data is available, at one sitting.

iMessage security

A couple of weeks ago there was a TechCrunch article (see Apple Explains Exactly How Secure iMessage Really Is or see the Apple IOS Security document) about Apple’s iMessage security.

The documents said that Apple iMessage uses public key encryption where every IOS/OS X device generates a pair of public and private keys (one for messages and one for signing) which are used to encrypt the data while it is transmitted through Apple’s iMessage service.  Apple encrypts the data on its iMessage App running in the devices with every destination device’s public key before it’s saved on the iMessage server cloud, which can then be decrypted on the device with its private key whenever the message is received by the device.

It’s a bit more complex for longer messages and attachments but the gist is that this data is encrypted with a random key at the device and is saved in encrypted form while residing iMessage servers. This random key and URI is then encrypted with the destination devices public keys which is then stored on the iMessage servers. Once the destination device retrieves the message with an attachment it has the location and the random key to decrypt the attachment.

According to Apple’s documentation when you start an iMessage you identify the recipient, the app retrieves the public keys for all these devices and then it encrypts the message (with each destination device’s public message key) and signs the message (with the originating device’s private signing key). This way Apple servers never see the plain text message and never holds the decryption keys.

Synch & share data security today

As mentioned in prior posts, I am now a Dropbox user and utilize this service to synch various IOS and OSX device file data. Which means a copy of all this synch data is sitting on Dropbox (AWS S3) servers, someplace (possibly multiple places) in the cloud.

Dropbox data-at-rest security is explained in their How secure is Dropbox document. Essentially they use SSL for data-in-flight security and AES-256 encryption with a random key for data-at-rest security.

This probably makes it easier to support multiple devices and perhaps data sharing because they only need to encrypt/save the data once and can decrypt the data on its servers before sending it through (SSL encrypted, of course) to other devices.

The only problem is that Dropbox holds all the encryption keys for all the data that sits on its servers. I (and possibly the rest of the tech community) would much prefer that the data be encrypted at the customer’s devices and never decrypted again except at other customer devices. This would be true end-to-end data security for sync&share

As far as I know from a data-at-rest security perspective Box looks about the same, so does EMC’s Syncplicity, Oxygen Cloud, and probably all the others. There are some subtle differences about how and where the keys are kept and how many security domains exist in each service, but in the end, the service holds the keys to all data that is encrypted on their storage cloud.

Public key cryptography to the rescue

I think we could do better and public key cryptography should show us the way. I suppose it would probably be easiest to follow the iMessage approach and just encrypt all the data with each device’s public key at the time you create/update the data and send it to the service but,

  • That would further delay the transfer of new and updated data to the synch service, also further delaying its availability at other devices linked to the login.
  • That would cause the storage requirement for your sync&share data to be multiplied by the number of devices you wish to synch with.

Synch data-at-rest security

If we just take on the synch side of the discussion first maybe it would be easiest. For example,  if a new public and private key pair for encryption and signing were to be assigned to each new device at login to the service then the service could retain a directory of the device’s public keys for data encryption and signing.

The first device to login to a synch service with a new user-id, would assign a single encryption key for all data to be shared by all devices that could use this login.  As other devices log into the service, the prime device sends the single service encryption key encrypted using the target device’s public key and signing the message with the source device’s private key. Actually any device in the service ring could do this but the primary device could be used to authenticate the new devices login credentials. Each device’s synch service would have a list of all the public keys for all the devices in the “synch” region.

As data is created or updated there are two segments of each file that are created, the AES-256 encrypted data package using the “synch” region’s random encryption key and the signature package, signed by the device doing the creation/update of the file.  Any device could authenticate the signature package at the time it receives a file, as could the service. But ONLY the devices with the AES-256 encryption key would have access to the plain text version of the data.

There are some potential holes in this process, first is that the service could still intercept the random encryption key, at the primary device when it’s created or could retrieve it anytime later at its leisure using the app running in the device. This same exposure exists for the iMessage App running in IOS/OS X devices, the private keys in this instance could be sent to another party at any time. We would need to depend on service guarantees to not do this.

Share data-at-rest security

For Apple’s iMessage attachment security the data is kept in the cloud encrypted by a random key but the key and the URI are sent to the devices when they receive the original message. I suppose this could just as easily work for a file share service but the sharing activity might require a share service app running in the target device to create public-private key pairs and access the file.

Yes this leaves any “shared” data keys being held by the service but it can’t be helped. The data is being shared with others so maybe having it be a little more accessible to prying eyes would be acceptable.


I still prefer the iMessage approach, having multiple copies of encrypted shared data, that is encrypted by each device’s public key. It’s simpler this way, a bit more verifiable and doesn’t need to have as much out-of-channel communication (to send keys to other devices).

Yes it would cost more to store any amount of data and would take longer to transmit, but I feel we would all would be willing to support this extra constraints as long as the service guaranteed that private keys were only kept on devices that have logged into the service.

Data-at-rest and -in-flight security is becoming more important these days. Especially since Snowden’s exposure of what’s happening to web data. I love the great convenience of sync&share services, I just wish that the encryption keys weren’t so vulnerable…


Photo Credits: Prizon Planet by AZRainman

Posted in Data, Data security, data services, Distributed computing, File Storage, Information economy | Tagged , , , , , , , , , | Leave a comment

Two dimensional magnetic recording (TDMR)

A head assembly on a Seagate disk drive by Robert Scoble (cc) (from flickr)

A head assembly on a Seagate disk drive by Robert Scoble (cc) (from flickr)

I attended a Rocky Mountain IEEE Magnetics Society meeting a couple of weeks ago where Jonathan Coker, HGST’s Chief Architect and an IEEE Magnetics Society Distinguished Lecturer was discussing HGST’s research into TDMR heads.

It seems that disk track density is getting so high, track pitch is becoming so small, that the magnetic read heads have become wider than the actual data track width.  Because of this, read heads are starting to pick up more inter-track noise and it’s getting more difficult to obtain a decent signal to noise ratio (SNR) off of a high-density disk platter with a single read head.

TDMR read heads can be used to counteract this extraneous noise by using multiple read heads per data track and as such, help to create a better signal to noise ratio during read back.

What are TDMR heads?

TDMR heads are any configuration of multiple read heads used in reading a single data track. There seemed to be two popular configurations of HGST’s TDMR heads:

  • In-series, where one head is directly behind another head. This provides double the signal for the same (relative) amount of random (electronic) noise.
  • In-parallel (side by side), where three heads were configured in-parallel across the data track and the two inter-track bands. That is, one head was configured directly over the data track with portions spanning the inter-track gap to each side, one head was half way across the data track and the next higher track, and a third head was placed half way across the data track and the next lower track.

At first, the in-series configuration seemed to make the most sense to me. You could conceivably average the two signals coming off the heads and be able to filter out the random noise.  However, the “random noise” seemed to be mostly coming from the inter-track zone and this wasn’t as much random electronics noise as random magnetic noise, coming off of the disk platter, between the data tracks.

In-parallel wins the SNR race

So, much of the discussion was on the in-parallel configuration. The researcher had a number of simulated magnetic recordings which were then read by simulated, in parallel, tripartite read heads.  The idea here was that the information read from each of the side band heads that included inter-track noise could be used as noise information to filter the middle head’s data track reading. In this way they could effectively increase the SNR across the three signals, and thus, get a better data signal from the data track.

Originally, TDMR was going to be the technology that was needed to get the disk industry to 100Tb/sqin. But, what they are finding at HGST and elsewhere, is even today, at “only” ~5Tb/sqin (HGST helium drives), there seems to be an increasing need to help reduce noise coming from read heads.

Disk density increase has been slowing lately but is still on a march to double density every 2 years or so. As such,  1TB platter today will be a 2TB platter in 2 years and a4TB platter in 4 years, etc. TDMR heads may be just the thing that gets the industry to that 4TB platter (20Tb/sqin) in 4 years.

The only problem is what’s going to get them to 100Tb/sqin now?



Posted in Data density, Disk storage, Scenario planning, Storage, System effectiveness | Tagged , , , | Leave a comment

Cloud based database startups are heating up

IBM recently agreed to purchase Cloudant an online database service using a NoSQL database called CouchDB. Apparently this is an attempt by IBM to take on Amazon and others that support cloud based services using a NoSQL database backend to store massive amounts of data.

In other news, Dassault Systems, a provider of 3D and other design tools has invested $14.2M in NuoDB, a cloud-based NewSQL compliant database service provider. Apparently Dassault intends to start offering its design software as a service offering using NuoDB as a backend database.

We have discussed NewSQL and NoSQL database’s before (see NewSQL and the curse of old SQL databases post) and there are plenty available today. So, why the sudden interest in cloud based database services. I tend to think there are a couple of different trends playing out here.

IBM playing catchup

In the IBM case there’s just so much data going to the cloud these days that IBM just can’t have a hand in it, if it wants to continue to be a major IT service organization.  Amazon and others are blazing this trail and IBM has to get on board or be left behind.

The NoSQL or no relational database model allows for different types of data structuring than the standard tables/rows of traditional RDMS databases. Specifically, NoSQL databases are very useful for data that can be organized in a tree (directed graph), graph (non-directed graph?) or key=value pairs. This latter item is very useful for Hadoop, MapReduce and other big data analytics applications. Doing this in the cloud just makes sense as the data can be both gathered and tanalyzed in the cloud without having anything more than the results of the analysis sent back to a requesting party.

IBM doesn’t necessarily need a SQL database as it already has DB2. IBM already has a cloud-based DB2 service that can be implemented by public or private cloud organizations.  But they have no cloud based NoSQL service today and having one today can make a lot of sense if IBM wants to branch out to more cloud service offerings.

Dassault is broadening their market

As for the cloud based, NuoDB NewSQL database, not all data fits the tree, graph, key=value pair structuring of NoSQL databases. Many traditional applications that use databases today revolve around SQL services and would be hard pressed to move off RDMS.

Also, one ongoing problem with NoSQL databases is that they don’t really support ACID transaction processing and as such, often compromise on data consistency in order to support highly parallelizable activities. In contrast, a SQL database supports rigid transaction consistency and is just the thing for moving something like a traditional OLTP processing application to the cloud.

I would guess, how NuoDB handles the high throughput needed by it’s cloud service partners while still providing ACID transaction consistency is part of its secret sauce.

But what’s behind it, at least some of this interest may just be the internet of things (IoT)

The other thing that seems to be driving a lot of the interest in cloud based databases is the IoT. As more and more devices become internet connected, they will start to generate massive amounts of data. The only way to capture and analyze this data effectively today is with NoSQL and NewSQL database services. By hosting these services in the cloud, analyzing/processing/reporting on this tsunami of data becomes much, much easier.

Storing and analyzing all this IoT data should make for an interesting decade or so as the internet of things gets built out across the world.  Cisco’s CEO, John Chambers recently said that the IoT market will be worth $19T and will have 50B internet connected devices by 2020. Seems a bit of a stretch seeings as how they just predicted (June 2013) to have 10B devices attached to the internet by the middle of last year, but who am I to disagree.

There’s much more to be written about the IoT and its impact on data storage, but that will need to wait for another time… stay tuned.


Photo Credit(s): database 2 by Tim Morgan 


Posted in Cloud services, data access, Data analytics, Information economy, System effectiveness | Tagged , , , , , , , , | Leave a comment

Holograms, not just for storage anymore

A recent article I read (Holograms put storage capacity in a spin) discusses a novel approach to holographic data storage, this time using magnetic spin waves to encode holographic information on magnetic memory.

It turns out holograms can be made with any wave like phenomena and optical holograms aren’t the only way to go. Magnetic (spin?) waves can also be used to create and read holograms.

These holograms are made in magnetic semiconductor material rather than photographic material. And because the wave nature of magnetic spin operates at a lower frequency than optics there is the potential for even greater densities than corresponding optical holographic storage.

A new memory emerges

The device is called a Magnonic Holographic Memory and it seems to work by applying spin waves through a magnetic substrate and reading (sensing) the resulting interference patterns below the device.

According to the paper, the device is theoretically capable of reading the magnetic (spin) state of hundreds of thousands of nano-magnetic bits in parallel. (Let’s see, that would be about 100KB of information in parallel). Which must have something to do with the holographic nature of the read out I would guess.

I haven’t the foggiest notion how all this works but it seems to be a fallout of some earlier spintronics work the researchers were doing. The paper showed a set of three holograms read out of  grid. And the prototype device seems to require a grid (almost core like) of magnetic material on top of the substrate which is the write head. Not clear if there was a duplicate of this grid below the material to read the spin waves but something had to be there.

The researchers indicated some future directions to shrink the device, primarily by shrinking what appears to be the write head and maybe the read headseven further. It’s also not clear what the magnetic substrate which is being read/written to and whether that can be shrunk any further.

The researchers said that although spin wave holographics cannot compete with the optical holographic storage in terms of propagation delays and seems to be noisier, spin wave holographics do appear to be much more appropriate for NM scale direct integration with electronic circuits.

Is this new generation of solid state storage?

Photo Credits: Spinning Top by RusselStreet

Posted in Holographic storage, Optical storage, SSD storage | Tagged , , | Leave a comment

Taking wind power to new “heights”

Not sure how I found this (I think Reddit technology) but an OffGrid World article on a new wind tower  configuration looks very odd but seems to work up to 6X(?!) better than most others that are deployed today.  The new wind tower is being built by SheerWind.

The tower has what looks to be a set of funnels around the top (checkout the picture) that funnel wind power and vent it downward into a long duct out along the ground.

Inside the ground duct lies a Venturi effect tube section which constricts and speeds up the wind coming in from the top.  After the Venturi effect component lies a small wind turbine power unit.

Bby placing the wind turbine on the ground, it becomes a lot more accessible. Also, the wind turbine blades can be much smaller due to the increase in wind speed. Finally, the tower is not nearly as tall as current wind turbine towers.

Sheerwind says that the new wind tower can generate electricity from wind speeds as little as 2MPH and will generate, on average, 3.14X more electricity from the same wind power as standard turbines do. These statistics are from some field data they published from their testing.

Still, the thing is huge, the down duct exit funnel has a diameter of twice the man standing next to it.

Couple of potential Improvements

Here are a few thoughts that came to me on some improvements to the tower configuration. One thing I noticed in the field results data is that the turbine speed (wind speed?) seems to be somewhat faster in certain directions than others at the same intake funnel wind speed.  From this I would surmise that wind flowing in the direction of the ground duct works better than wind in the opposite direction.

As such, I would suggest that they do away with the long ground duct all together and just place the Venturi valve and the wind turbine somewhere in the down ducting. This would eliminate one curve which should boost effective wind speed at the turbine, at least and should eliminate any direction sensitivity to the turbine speeds.

Below the turbine I would have a sort of reverse funnel with a cone at the bottom of it to  push the air out along the ground in all directions.

Also, as the wind speed ratio (incoming speed to wind speed at turbine) averages out to be a factor of 1.8, I would think a second turbine downstream from the first with perhaps two blades could extract some more useful power from the air stream before it’s dumped  into the atmosphere.

Finally, as wind speed is often different depending on the height off the ground, I might consider lifting or lowering the top of the tower (the funnel section) to supply the optimum wind speed available. You’d need to have the down duct and its support superstructure to be expandable or contractable and you would want to lesson the weight of the flexible part of the top of the tower. You could do this with an array of electronic wind sensors and servo control logic and motors which would take wind speed samples at various heights and cause the servo motors to raise or lower the tower height. But I believe letting the wind power passively move the top up or down would be more effective in the long run and potentially cheaper to boot. How this would work I have no idea.

Also I would be interested to understand the exit wind speed after all the above, there may still be some energy to be gained from the airstream.


Photo Credit(s): From SheerWind.com’s Website, © 2014 SheerWind

Posted in Energy efficiency, Strategic Inflection Points, System effectiveness | Tagged , , , , | Comments Off

Data of the world, lay down your chains

Prison Planet by AZRainman (cc) (from Flickr)

Prison Planet by AZRainman (cc) (from Flickr)

GitHub, that open source free repository of software, is taking on a new role, this time as a repository for municipal data sets. At least that’s what a recent article on the Atlantic.com website (see Catch my Diff: GitHub’s New Feature Means Big Things for Open Data) after GitHub announced new changes in its .GeoJSON support (see Diffable, more customizable maps)

The article talks about the fact that maps in Github (using .GeoJSON data) can be now DIFFed, that is see at a glance what changes have been made to it. In the one example in the article (easier to see in GitHub) you can see how one Chicago congressional district has changed over time.

Unbeknownst to me, GitHub started becoming a repository for geographical data. That is any .GeoJson data file can be now be saved as a repository on GitHub and can be rendered as a map using desktop or web based tools. With the latest changes at GitHub, now one can see changes that are made to a .GeoJSON file as two or more views of a map or properties of map elements.

Of course all the other things one can do with GitHub repositories are also available, such as FORK, PULL, PUSH, etc. All this functionality was developed to support software coding but can apply equally well to .GeoJSON data files. Because .GeoJSON data files look just like source code (really more like .XML, but close enough).

So why maps as source code data?

Municipalities have started to use GitHub to host their Open Data initiatives. For example Digital Chicago has started converting some of their internal datasets into .GeoJSON data files and loading them up on GitHub for anyone to see, fork, modify, etc.

I was easily able to login and fork one of the data sets. But there’s a little matter of pushing your committed changes to the project owner that needs to happen before you can modify the original dataset.

Also I was able to render the .GeoJSON data into a viewable map by just clicking on a commit file (I suppose this is a web service). The ReadME file has instructions for doing this on your desktop outside of a web browser for R, Ruby and Python.

In any case, having the data online, editable and commitable would allow anyone with GitHub account to augment the data to make it better and more comprehensive. Of course with the data now online, any application could make use of it to offer services based on the data.

I guess that’s what Open Data movement is all about, make government, previously proprietary data freely available in a standardized format, and add tools to view and modify it, in the hope that businesses see a way to make use of it in new ways. As such, In  the data should become more visible and more useful to the world and the cities that are supporting it.

If you want to learn more about Project Open Data see the blog post from last year on Whitehouse.gov or the GitHub Project [wiki] pages.


Posted in Crowdsourcing, Data availability, Data readability, data services, Information economy | Tagged , , , , | Comments Off