Scratch file use in HPC @ORNL, a statistical analysis

Attended SC17 (Supercomputing Conference) this past week and I received a copy of the accompanying research proceedings. There are a number of interesting papers in the research and I came across one, Scientific User Behavior and Data Sharing Trends in a Peta Scale File System by Seung-Hwan Lim, et al from Oak Ridge National Laboratory (ORNL) and the use of files at the Oak Ridge Leadership Computing Facility (OLCF) which was very interesting.

The paper statistically describes the use of a Scratch files in a multi PB file system (Lustre) at OLCF from January 2015 to August 2016. The OLCF supports over 32PB of storage, has a peak aggregate of over 1TB/s and Spider II (current Lustre file system) consists of 288 Lustre Object Storage Servers, all interconnected and connected to all the supercomputing cluster of  servers via an InfiniBand network. Spider II supports all scratch storage requirements for active/queued jobs for the Titan (#4 in Top 500 [super computer clusters worldwide] list) and other clusters at ORNL.

ORNL uses an HPSS (High Performance Storage System) archive for permanent storage but uses the Spider II file system for all scratch files generated and used during supercomputing applications.  ORNL is expecting Spider III (2018-2023) to host 10 billion files.

Scratch files are purged from Spider II after 90 days of no access.The paper is based on metadata analysis captured during scratch purging process for 500 days of access.

The paper displays a number of statistics and metrics on the use of Spider II:

  • Less than 3% of projects have a directory depth >15, the maximum directory depth was recorded at 432, with most projects having a shallow (<10) directory depth.
  • A project typically has 10X the files that a specific researcher has and a median file count/researcher is 2000 files with a median project having 20,000 files.
  • Storage system performance is actively managed by many projects. For instance, 20 out of 35 science domains manually managed their Lustre cluster configuration to improve throughput.
  • File count continues to grow and reached a peak of 1B files during the time being analyzed.
  • On average only 3% of files were accessed readonly, 10% of files updated (read-write) and 76% of files were untouched during a week period. However, median and maximum file age was 138 and 214 days respectively, which means that these scratch files can continue to be accessed over the course of 200+ days.

There was more information in the paper but one item missing is statistics on scratch file size distribution a concern.

Nonetheless, in paints an interesting picture of scratch file use in HPC application/supercluster environments today.

Comments?

(Storage QoM 16-001): Will we see NVM Express (NVMe) drives GA’d in enterprise storage over the next year

NVMeFirst, let me state that QoM stands for Question of the Month. Doing these forecast can be a lot of work, and rather than focusing my whole blog on weekly forecast questions and answers, I would like to do something else as well. So, from now on we are doing only one new forecast a month.

So for the first question of 2016, we will forecast whether NVMe SSDs will be GA’d in enterprise storage over the next year.

NVM Express (NVMe) means the new PCIe interface for SSD storage. Wikipedia has a nice description of NVMe. As discussed there, NVMe was designed for higher performance and enhanced parallelism which comes with the PCI Express (PCIe) bus. The current version of the NVMe spec is 1.2a (available here).

GA means generally available for purchase by any customer.

Enterprise storage systems refers to mid-range and enterprise class storage systems from major AND non-major storage vendors, which includes startups.

Over the next year means by 19 January 2017.

Special thanks to Kacey Lai (@mrdedupe), Primary Data for suggesting this months question.

Current and updates to previous forecasts

 

Update on QoW 15-001 (3DX) forecast:

News out today indicates that 3DX (3D XPoint non-volatile memory) samples may be available soon but it could take another 12 to 18 months to get it into production. 3DX manufacturing is more challenging than current planar NAND technology and uses about 100 new materials, many of which are currently single sourced. We already built into our 3DX forecast potential delays in reaching production in 6 months. The news above says this could be worse than  expected. As such, I feel even stronger that there is less of a possibility of 3DX shipping in storage systems by next December. So I would update my forecast for QoW 15-001 to NO with an 0.75 probability at this time.

So current forecasts for QoW 15-001 are:

A) YES with 0.85 probability; and

B) NO with 0.75 probability

Current QoW 15-002 (3D TLC) forecast

We have 3 active participants, current forecasts are:

A) Yes with 0.95 probability;

B) No with 0.53 probability; and

C) Yes with 1.0 probability

Current QoW 15-003 (SMR disk) forecast

We have 1 active participant, current forecast is:

A) Yes with 0.85 probability

 

Primary data’s path to better data storage presented at SFD8

IMG_5606rz A couple of weeks ago we met with Primary Data, Lance Smith, CEO, David Flynn, CTO and Kaycee Lai, SVP Product & Sales who were presenting at Storage Field Day 8 (SFD8, videos of their sessions available here). Primary Data has just emerged out of stealth late last year and has ~$60M in funding. Also they have Steve Wozniak (of Apple fame) as Chief Scientist, but he wasn’t at the SFD8 session 🙁

Primary Data seems out to change the world. At first I thought this was just another form of storage virtualization but they are laser focused on data virtualization or what they call data mobility. It differs from pure storage virtualization by being outside the data path.  (I have written about data virtualization before as well as the data hypervisor a long time ago). Nowadays they seem to be using the tag line of data in motion.

Why move data?

David has a theory behind the proliferation of startup storage companies. The spectrum behind capacity and performance has gotten immense, over time, which has provided an opening for a number of companies to address these widening needs.

David believes that caching at the storage system or in the servers is an attempt to address this issue by “loaning” the data from the storage silo to the cache. This is trying to supply a lower cost $/IOP for the data. Similar considerations are apparent at the other side where customer’s use archive or backup services to take advantage of much cheaper $/GB storage.

However, given the difficulty of moving data around in present day storage environments, customer data has become essentially immobile. Primary Data is trying to bring about a data mobility revolution and allow data to move over this spectrum of performance and capacity of storage with ease. Doing so easily, will provide significant benefits as customers can more fully take advantage of the various levels of performance and capacity in their data center storage environments.

Primary Data architecture

IMG_5607Primary Data is providing data mobility by using their meta-data service called the DataSphere appliance and their client software running on host servers called the Data Portal. Their offering can be best explained in three layers:

  • Data virtualization layer – provides continuity of identity and continuity of access across multiple physical storage systems. That is the same data (identity continuity) can be accessed wherever it resides (access continuity) by server applications. Such access and identity must transcend access protocols and interfaces. The Data Portal client software intercepts the server data activity and does control plane activity using the DataSphere appliance and performs IO directly using the physical storage.
  • Objective based data management – supplies a data affinity service. That is data can have a temporary location relationship with physical storage depending on the current performance (R:W, IOPS, bandwidth, latency) and protection (durability, availability, disaster recoverability, security, copy-ability, version-ability) characteristics of the data. These data objectives are matched to the capabilities or service catalog of the storage infrastructure and data objectives can change over time
  • Analytics in the loop – detects the performance and other characteristics of the storage and data in real-time. That is by monitoring the storage IO activity Primary Data can determine the actual performance attribute of the storage. Similarly, by monitoring the applications IO characteristics over time the system can determine the performance objectives of its data. The system also takes advantage of SMI-S to define some of the other characteristics of the storage systems.

How does Primary Data work?

Primary Data has taken advantage of parallel NFS extensions (pNFS) in NFSv4 to externalize and separate the storage control plane from the IO data plane. This works well for native Linux where the main developer of the Linux file system stack is on their payroll.IMG_5608rz

In Windows they put a filter driver in front of SMB to split off the control from data IO plane. Something similar is done for VMware ESX servers to supply the control-data plane split but in this case there is a software defined Data Portal that goes along with the DataSphere Service client that can do it all within the same ESX server. Another alternative exists and that is to use the Data Portal appliance as a storage virtualization service but then the IO data path goes through the portal.

According to their datasheet they currently support data virtualization services for NetApp cDOT and 7-mode, EMC Isilon OneFS7.2, and Nexenta 4.x&5.0 but plan on more.

They are not quite GA yet, but are close.

Comments?

 

 

 

The data is the hybrid cloud

CRKtHnqVEAABeviI have been at NetApp Insight2015 conference the past two days and have been struck with one common theme. They have been talking since the get-go about the Data Fabric and how Clustered Data ONTAP (cDOT) is the foundation to the NetApp Data Fabric which spans on premises, private cloud, off premises public cloud and everything in between.

But the truth of the matter is that it’s data that real needs to span all these domains. Hybrid cloud really needs to have data movement everywhere. NetApp cDOT is just the enabler that helps move the data around much easier.

NetApp cDOT data services

From a cDOT perspective, NetApp has available today:

  • Cloud ONTAP – a software defined ONTAP storage service executing in the cloud, operating on cloud server provider hardware using DAS storage and providing ONTAP data services for your private cloud resident data.
  • ONTAP Edge – similar to Cloud ONTAP, but operating on premises with customer commodity server & DAS hardware and providing ONTAP data services.
  • NetApp Private Storage (NPS) – NetApp storage systems operating in a “near cloud” environment that is directly connected to cloud service providers that provides NetApp storage services with low latency/high IOPs storage to cloud compute applications.
  • NetApp cDOT on premises storage hardware – NetApp storage hardware with All Flash FAS as well as normal disk-only and hybrid FAS storage hardware supplying ONTAP data services to on premises applications.

NetApp Data Fabric

NetApp’s Data Fabric is built on top of ONTAP data services and allows a customer to use any of the above storage instances to host their private data. Which is great in and of itself, but when you realize that a customer can also move their data from anyone of these ONTAP storage instances to any other storage instance that’s when you see the power of the Data Fabric.

The Data Fabric depends mostly on storage efficient ONTAP SnapMirror data replication and ONTAP data cloning capabilities. These services can be used to replicate ONTAP data (LUNs/volumes) from one cDOT storage instance to another and then use ONTAP data cloning services to create accessible copies of this data at the new location. This could be on premises to near cloud, to public cloud or back again, all within the confines of ONTAP data services.

Data Fabric in action

Now I like the concept but they also showed an impressive demo of using cDOT and AltaVault (NetApp’s solution acquired last year from Riverbed, their SteelStor backup appliance) to perform an application consistent backup of a SQL database. But once they had this it went a little crazy.

They SnapMirrored this data from the on premises storage to a near cloud, NPS storage instance, then cloned the data from the mirrors and after that fired up applications running in Azure to process the data. Then they shut down the Azure application and fired up a similar application in AWS using the exact same NPS hosted data. Of course they then SnapMirrored the same backup data (I think from the original on premises storage) to Cloud ONTAP, just to show it could be done there as well.

Ok I get it, you can replicate (mirror) data from any cDOT storage instance (whether on premises or remote site or near cloud NPS or in the cloud using Cloud ONTAP or …). Once there you can clone this data and use it with applications running in any environment running with access to this data instance (such as AWS, Azure and cloud service providers).

And I like the fact that all this can be accomplished in NetApp’s Snap Center software. And I especially like the fact that the clones don’t take up any extra space and the replicant mirroring is done in a quick, space efficient (read deduped) manner

But, having to setup a replication or mirror association between cDOT on premises and cDOT at NPS or Cloud ONTAP and then having to clone the volumes at the target side seems superflous. What I really want to do is just copy or move the data and have it be at the target site without the mirror association in the middle. It’s almost like what I want is CLONE that operates across cDOT storage instances wherever they reside.

Well I’m an analyst and don’t have to implement any of this (thank god). But what NetApp seems to have done is to use their current tools and ONTAP data service capabilities to allow customer data to move anywhere it needs to be, in  customer controlled, space efficient, private and secure manner. Once hosted at the new site, applications have access to this data and customers still have all the ONTAP data services they had on premises but in cloud and near cloud locations.

Seems pretty impressive to me for all of a customer’s ONTAP data. But when you combine the Data Fabric with Foreign LUN Import (importing non-NetApp data into ONTAP storage) and FlexArray (storage virtualization under ONTAP) you can see how all the Data Fabric can apply to non-NetApp storage instances as well and then it becomes really interesting.

~~~~

There was a company that once said that “The Network is the Computer” but today, I think a better tag line is “The Data is the Hybrid Cloud”.

Comments?

 

Nanterro emerges from stealth with CNT based NRAM

512px-Types_of_Carbon_NanotubesNanterro just came out of stealth this week and bagged $31.5M in a Series E funding round. Apparently, Nanterro has been developing a new form of non-volatile RAM (NRAM), based on Carbon Nanotubes (CNT), which seems to work like an old T-bar switch, only in the NM sphere and using CNT for the wiring.

They were founded in 2001, and are finally  ready to emerge from stealth. Nanterro already has 175+ issued patents, with another 200 patents pending. The NRAM is currently in production at 7 CMOS fabs already and they are sampling 4Mb NRAM chips  to a number of customers.

NRAM vs. NAND

Performance of the NRAM is on a par with DRAM (~100 times faster than NAND), can be configured in 3D and supports MLC (multi-bits per cell) configurations.  NRAM also supports orders of magnitude more (assume they mean writes) accesses and stores data much longer than NAND.

The only question is the capacity, with shipping NAND on the order of 200Gb, NRAM is  about 2**14X behind NAND. Nanterre claims that their CNT-NRAM CMOS process can be scaled down to <5nm. Which is one or two generations below the current NAND scale factor and assuming they can pack as many bits in the same area, should be able to compete well with NAND.They claim that their NRAM technology is capable of Terabit capacities (assumed to be at the 5nm node).

The other nice thing is that Nanterro says the new NRAM uses less power than DRAM, which means that in addition to attaining higher capacities, DRAM like access times, it will also reduce power consumption.

It seems a natural for mobile applications. The press release claims it was already tested in space and there are customers looking at the technology for automobiles. The company claims the total addressable market is ~$170B USD. Which probably includes DRAM and NAND together.

CNT in CMOS chips?

Key to Nanterro’s technology was incorporating the use of CNT in CMOS processes, so that chips can be manufactured on current fab lines. It’s probably just the start of the use of CNT in electronic chips but it’s one that could potentially pay for the technology development many times over. CNT has a number of characteristics which would be beneficial to other electronic circuitry beyond NRAM.

How quickly they can ramp the capacity up from 4Mb seems to be a significant factor. Which is no doubt, why they went out for Series E funding.

So we have another new non-volatile memory technology.On the other hand, these guys seem to be a long ways away from the lab, with something that works today and the potential to go all the way down to 5nm.

It should interesting as the other NV technologies start to emerge to see which one generates sufficient market traction to succeed in the long run. Especially as NAND doesn’t seem to be slowing down much.

Comments?

Picture Credits: Wikimedia.com

Transporter, a private Dropbox in a tower

Move over DropboxBox and all you synch&share wannabees, there’s a new synch and share in town.

At SFD7 last month, we were visiting with Connected Data where CEO, Geoff Barrell was telling us all about what was wrong with today’s cloud storage solutions. In front of all the participants was this strange, blue glowing device. As it turns out, Connected Data’s main product is the File Transporter, which is a private file synch and share solution.

All the participants were given a new, 1TB Transporter system to take home. It was an interesting sight to see a dozen of these Transporter towers sitting in front of all the bloggers.

I was quickly, established a new account, installed the software, and activated the client service. I must admit, I took it upon myself to “claim” just about all of the Transporter towers as the other bloggers were still paying attention to the presentation.  Sigh, they later made me give back (unclaim) all but mine, but for a minute there I had about 10TB of synch and share space at my disposal.

Transporters rule

transporterB2So what is it. The Transporter is both a device and an Internet service, where you own the storage and networking hardware.

The home-office version comes as a 1 or 2TB 2.5” hard drive, in a tower configuration that plugs into a base module. The base module runs a secured version of Linux and their synch and share control software.

As tower power on, it connects to the Internet and invokes the Transporter control service. This service identifies the node, who owns it, and provides access to the storage on the Transporter to all desktops, laptops, and mobile applications that have access to it.

At initiation of the client service on a desktop/laptop it creates (by default) a new Transporter directory (folder). Files that are placed in this directory are automatically synched to the Transporter tower and then synchronized to any and all online client devices that have claimed the tower.

Apparently you can have multiple towers that are claimed to the same account. I personally tested up to 10 ;/ and it didn’t appear as if there was any substantive limit beyond that but I’m sure there’s some maximum count somewhere.

A couple of nice things about the tower. It’s your’s so you can move it to any location you want. That means, you could take it with you to your hotel or other remote offices and have a local synch point.

Also, initial synchronization can take place over your local network so it can occur as fast as your LAN can handle it. I remember the first time I up-synched 40GB to DropBox, it seemed to take weeks to complete and then took less time to down-synch for my laptop but still days of time. With the tower on my local network, I can synch my data much faster and then take the tower with me to my other office location and have a local synch datastore. (I may have to start taking mine to conferences. Howard (@deepstorage.net, co-host on our  GreyBeards on Storage podcast) had his operating in all the subsequent SFD7 sessions.

The Transporter also allows sharing of data. Steve immediately started sharing all the presentations on his Transporter service so the bloggers could access the data in real time.

They call the Transporter a private cloud but in my view, it’s more a private synch and share service.

Transporter heritage

The Transporter people were all familiar to the SFD crowd as they were formerly with  Drobo which was at a previous SFD sessions (see SFD1). And like Drobo, you can install any 2.5″ disk drive in your Transporter and it will work.

There’s workgroup and business class versions of the Transporter storage system. The workgroup versions are desktop configurations (looks very much like a Drobo box) that support up to 8TB or 12TB supporting 15 or 30 users respectively.  The also have two business class, rack mounted appliances that have up to 12TB or 24TB each and support 75 or 150 users each. The business class solution has onboard SSDs for meta-data acceleration. Similar to the Transporter tower, the workgroup and business class appliances are bring your own disk drives.

Connected Data’s presentation

transporterA1Geoff’s discussion (see SFD7 video) was a tour of the cloud storage business model. His view was that most of these companies are losing money. In fact, even Amazon S3/Glacier appears to be bleeding money, although this may not stop Amazon. Of course, DropBox and other synch and share services all depend on cloud storage for their datastores. So, the lack of a viable, profitable business model threatens all of these services in the long run.

But the business model is different when a customer owns the storage. Here the customer owns the actual storage cost. The only thing that Connected Data provides is the client software and the internet service that runs it. Pricing for the 1TB and 2TB transporters with disk drives are $150 and $240.

Having a Transporter

One thing I don’t like is the lack of data-at-rest encryption. They use TLS for data transfers across your LAN and the Internet. But the nice thing about having possession of the actual storage is that you can move it around. But the downside is that you may move it to less secure environments (like conference hotel rooms). And as with the any disk storage, someone can come up to the device and steel the disk. Whether the data would be easily recognizable is another question but having it be encrypted would put that question to rest. There’s some indication on the Transporter support site that encryption may be coming for the business class solution. But nothing was said about the Transporter tower.

On the Mac, the Transporter folder has the shared folders as direct links (real sub-folders) but the local data is under a Transporter Library soft link. It turns out to be a hidden file (“.Transporter Library”) under the Transporter folder. When you Control click on this file your are given the option to view deleted files. You can also do this with shared files as well.

One problem with synch and share services is once someone in your collaboration group deletes some shared files they are gone (over time) from all other group users. Even if some of them wanted them. Transporter makes it a bit easier to view these files and save them elsewhere. But I assume at some point they have to be purged to free up space.

When I first installed the Transporter, it showed up as a network node on my finder shared servers. But the latest desktop version (3.1.17) has removed this.

Also some of the bloggers complained about files seeing files “in flux” or duplicates of the shared files but with unusual file suffixes appended to them, such as ” filename124224_f367b3b1-63fa-4d29-8d7b-a534e0323389.jpg”. Enrico (@ESignoretti) opened up a support ticket on this and it’s supposedly been fixed in the latest desktop and was a temporary filename used only during upload and should have been deleted-renamed after the upload was completed. I just uploaded 22MB with about 40 files and didn’t see any of this.

I really want encryption as I wanted one transporter in a remote office and another in the home office with everything synched locally and then I would hand carry the remote one to the other location. But without encryption this isn’t going to work for me. So I guess I will limit myself to just one and move it around to wherever I want to my data to go.

Here are some of the other blog posts by SFD7 participants on Transporter:

Storage field day 7 – day 2 – Connected Data by Dan Firth (@PenguinPunk)

File Transporter, private Synch&Share made easy by Enrico Signoretti (@ESignoretti)

Transporter – Storage Field Day 7 preview by Keith Townsend (@VirtualizedGeek)

Comments?

Data virtualization surfaces

There’s a new storage startup out of stealth, called Primary Data and it’s implementing data (note, not storage) virtualization.

They already have $60M in funding with some pretty highpowered talent from Fusion IO, namely David Flynn, Rick White and Steve Wozniak (the ‘Woz’)  (also of Apple fame).

There have been a number of attempts at creating a virtualization layers for data namely ViPR (See my post ViPR virtues, vexations but no storage virtualization) but Primary Data is taking a different tack to the problem.

Data virtualization explained

Data hypervisor, software defined storage, data plane, control plane
(c) 2012 Silverton Consulting, Inc. All rights reserved

Essentially they want to separate the data plane from the control plane (See my Data Hypervisor post and comments for another view on this).

  • The data plane consists of those storage system activities that actually perform IO or read and writes.
  • The control plane is those storage system activities that do everything else that has to be done by a storage system, including provisioning, monitoring, and managing the storage.

Separating the data plane from the control plane offers a number of advantages. EMC ViPR does this but it’s data plane is either standard storage systems like VMAX, VNX, Isilon etc, or software defined storage solutions. Primary Data wants to do it all.

Their meta data or control plane engine is called a Data Director which holds information about the data objects that are stored in the Primary Data system, runs a data policy management engine and handles data migration.

Primary Data relies on purpose-built, Data Hypervisor (client) software that talks to Data Directors to understand where data objects reside and how to go about accessing them. But once the metadata information is transferred to the client SW, then IO activity can go directly between the host and the storage system in a protocol independent fashion.

[The graphic above is from my prior post and I assumed the data hypervisor (DH) would be co-located with the data but Primary Data has rightly implemented this as a separate layer in host software.]

Data Hypervisor protocol independence?

As I understand it this means that customers could use file storage, object storage or block storage to support any application requirement. This also means that file data (objects) could be migrated to block storage and still be accessed as file data. But the converse is also true, i.e., block data (objects) could be migrated to file storage and still be accessed as block data. You need to add object, DAS, PCIe flash and cloud storage to the mix to see where they are headed.

All data in Primary Data’s system are object encapsulated and all data objects are catalogued within a single, global namespace that spans file, block, object and cloud storage repositories

Data objects can reside on Primary storage systems, external non-Primary data aware file or block storage systems, DAS, PCIe Flash, and even cloud storage.

How does Data Virtualization compare to Storage Virtualization?

There are a number of differences:

  1. Most storage virtualization solutions are in the middle of the data path and because of this have to be fairly significant, highly fault-tolerant solutions.
  2. Most storage virtualization solutions don’t have a separate and distinct meta-data engine.
  3. Most storage virtualization systems don’t require any special (data hypervisor) software running on hosts or clients.
  4. Most storage virtualization systems don’t support protocol independent access to data storage.
  5. Most storage virtualization systems don’t support DAS or server based, PCIe flash for permanent storage. (Yes this is not supported in the first release but the intent is to support this soon.)
  6. Most storage virtualization systems support internal storage that resides directly inside the storage virtualization system hardware.
  7. Most storage virtualization systems support an internal DRAM cache layer which is used to speed up IO to internal and external storage and is in addition to any caching done at the external storage system level.
  8. Most storage virtualization systems only support external block storage.

There are a few similarities as well:

  1. They both manage data migration in a non-disruptive fashion.
  2. They both support automated policy management over data placement, data protection, data performance, and other QoS attributes.
  3. They both support multiple vendors of external storage.
  4. They both can support different host access protocols.

Data Virtualization Policy Management

A policy engine runs in the Data Directors and provides SLAs for data objects. This would include performance attributes, protection attributes, security requirements and cost requirements.  Presumably, policy specifications for data protection would include RAID level, erasure coding level and geographic dispersion.

In Primary Data, backup becomes nothing more than object snapshots with different protection characteristics, like offsite full copy. Moreover, data object migration can be handled completely outboard and without causing data access disruption and on an automated policy basis.

Primary Data first release

Primary Data will be initially deployed as an integrated data virtualization solution which includes an all flash NAS storage system and a standard NAS system. Over time, Primary Data will add non-Primary Data external storage and internal storage (DAS, SSD, PCIe Flash).

The Data Policy Engine and Data Migrator functionality will be separately charged for software solutions. Data Directors are sold in pairs (active-passive) and can be non-disruptively upgraded. Storage (directors?) are also sold separately.

Data Hypervisor (client) software is available for most styles of Linux, Openstack and coming for ESX. Windows SMB support is not split yet (control plane/data plane) but Primary data does support SMB. I believe the Data Hypervisor software will also be released in an upcoming version of the Linux kernel.

They are currently in testing. No official date for GA but they did say they would announce pricing in 2015.

~~~~

Comments?

Disclosure: We have done work for Primary Data over the past year.

Photo Credits:

  1. Screen shot of beta test system supplied by Primary Data
  2. Graphic created by SCI for prior Data Hypervisor post

Protest intensity, world news database and big data – chart of the month

Read an article the other day on the analysis of the Arab Spring (Did the Arab Spring really spark a wave of global protests, in Foreign Policy) using a Google Ideas sponsored project, the GDELT ProjectTime domain run chart showing protest intensity every month for the last 30 years, with running average (Global Database of Events, Language and Tone) file of  events extracted from worldwide media sources.  The GDELT database uses sophisticated language processing to extract “event” data from news media streams and supplies this information in database form.  The database can be analyzed  to identify  trends in world events and possibly to better understand what led up to events that occur on our planet.

GDELT Project

The GDELT database records over 300 categories of events that are geo-referenced to city/mountaintop and time-referenced. The event data dates back to 1979.  The GDELT data captures 60 attributes of any event that occurs, generating a giant spreadsheet of event information with location, time, parties, and myriad other attributes all identified, and cross-referenceable.

Besides the extensive spreadsheet of world event attribute data the GDELT project also supplies a knowledge graph oriented view of its event data. The GDELT knowledge graph “compiles a list of every person, organization, company, location and over 230 themes and emotions from every news report” that can then be used to create network diagrams/graphs to be better able to visualize interactions between events. 

For example see the Global Conversation in Foreign Policy, for a network diagram of every person mentioned in the news during 6 months of 2013.  You can zoom in or out to see how people identified in news reports are connected during the six months. So if you we’re interested, in let’s say the Syrian civil war, one could easily see at a glance any news item that mentioned Syria or was located in Syria since 1979 to now. Wow!

Arab Spring and Worldwide Protest

Getting back to the chart-of-the-month, the graphic above shows the “protest intensity” by month for the last 30 years with a running average charted in black using GDELT data.  (It’s better seen in the FP article/link above or just click on it for an expanded view. ).

One can see from the chart that there was a significant increase in protest activity after January 2011, which corresponds to the beginning of the Arab Spring.  But the amazing inference from the chart above is that this increase has continued ever since. This shows that the Arab Spring has had a lasting contribution that has significantly increased worldwide protest activity.

This is just one example of the types of research available with the GDELT data.

~~~~

I have talked in the past about how (telecom, social media and other) organizations should deposit their corporate/interaction data sets in some public repository for the better good of humanity so that any researcher could use it (see my Data of the world, lay down your chains post for more on this). The GDELT Project is Google Ideas doing this on a larger scale than I ever thought feasible. Way to go.

Comments?

 Image credits: (c) 2014 ForeignPolicy.com, All Rights Reserved