Data processing logistics

IBM System/370 Model 145 By jovike (cc) (from Flickr)
IBM System/370 Model 145 By jovike (cc) (from Flickr)

Chuck Hollis wrote a great post on “information logistics” as a new paradigm for IT centers to have to consider as they deploy applications around the globe and into the cloud.  The problem is that there’s lot’s of data to move around in order to make all this work.

Supercomputing’s Solution

Big data/super computing groups have been thinking about this problem for a long time and have some solutions that might help but it all harken’s back to batch processing and JCL (job control language) of the last century.  In my comment to Chuck’s post I mentioned the University of Wisconsin’s Condor(r) Project which can be used to schedule data transmission and data processing across distributed server nodes in a network, but there are others namely the Globus ToolKit 4 (GT4)  which creates a data grid to support collaborative research on PB of data currently being used by CERN for LHC data, EU for their data grid and others.  We have discussed Condor in our Free Cloud Storage and Cloud Computing post and GT4 in our 15PB a year created by CERN post.

These super computing projects were designed to move data around so that analysis could be done locally with results shared within the community.  However, at least with GT4, they replicate data at a number of nodes, which may not be storage efficient but does provide quicker access for data analysis.  In CERN, there are a hierarchy of nodes which participate in a GT4 data grid and the data is replicated between tiers and within peer nodes just to have better access to it.

In olden days, …

With JCL someone would code up a sequence of batch steps, each of which could be conditional on previous steps that would manipulate data into some transient and at the end, final form.  Sometimes JCL would invoke another job (set of JCL) for a follow on step if everything in this job worked as planned.  The JCL would wait in a queue until the data and execution resources were available for it.

This could mean mounting removable media, creating disk storage “datasets”, or waiting until other jobs were down with datasets being needed, jobs would execute in a priority sequence, and scheduling options could include using different hosts (servers) that would coordinate to provide job execution services.   For all I know, z/OS still supports JCL for batch processing, but it’s been a long time since I have used JCL.

Cloud computing and storage services

Where does that bring us for today. Cloud computing and Cloud storage are bringing this execution paradigm back into vogue. But instead of batch jobs, we are talking about virtual machines, or web applications or anything else that can be packaged up and run generically on anybody’s hardware and storage.

The only problem is that there are only application specific ways to control these execution activities.  I am thinking here of web services that hand off web handling to any web server that happens to have cycles to support it.  Similarly, database machines seem capable of handing off queries to any database server that has idle ergs to process with.  There are myriad others like this but they all seem specific to one application domain.  Nothing exists that is generic or can cross many application domains.

That’s where something like Condor, GT4 or god forbid, JCL can make some sense.  In essence, all of these approaches are application independent.  By doing so, they can be used for any number of applications to take advantage of cloud computing and cloud storage services.

Just had to get this out.  Chuck’s post had me thinking about JCL again and there had to be another solution.

15PB a year created by CERN

The Large Hadron Collider/ATLAS at CERN by Image Editor (cc) (from flickr)
The Large Hadron Collider/ATLAS at CERN by Image Editor (cc) (from flickr)

That’s what CERN produces from their 6 experiments each year.  How this data is subsequently accessed and replicated around the world is an interesting tale in multi-tier data grids.

When an experiment is run at CERN, the data is captured locally in what’s called Tier 0.  After some initial processing, this data is stored on tape at Tier 0 (CERN) and then replicated to over 10 Tier 1 locations around the world which then become a second permanent repository for all CERN experiment data.  Tier 2 centers can request data from Tier 1 locations and process the data and return results to Tier 1 for permanent storage.  Tier 3 data centers can request data and processing time from Tier 2 centers to analyze the CERN data.

Each experiment has it’s own set of Tier 1 data centers that store its results.  According to the latest technical description I could find, the Tier 0 (at CERN) and most Tier 1 data centers provide a tape storage permanent repository for experimental data frontended by a disk cache.  Tier 2 can have similar resources but are not expected to be a permanent repository for data.

Each Tier 1 data center has it’s own hierarchical management system (HMS) or mass storage system (MSS) based on any number of software packages such as HPSS, CASTOR, Enstore, dCache, DPM, etc., most of which are open source products.  But regardless of the HMS/MSS implementation they all a set of generic storage management services based on the Storage Resource Manager (SRM) as defined by a consortium of research centers and provide a set of file transfer protocols defined by yet another set of standards by Globus or gLite.

Each Tier 1 data center manages their own storage element (SE).  Each experiment storage element has disk storage with optionally tape storage (using one or more of the above disk caching packages) and provides authentication/security, provides file transport,  and maintains catalogs and local databases.  These catalogs and local databases index the data sets or files available on the grid for each experiment.

Data stored in the grid are considered read-only and can never be modified.  It is intended for users that need to process this data read it from Tier 1 data centers, process the data and create new data which is then stored in the grid. New data added Data to be placed in the grid must be registered to the LCG file catalogue and transferred to a storage element to be replicated throughout the grid.

CERN data grid file access

“Files in the Grid can be referred to by different names: Grid Unique IDentifier (GUID), Logical File Name (LFN), Storage URL (SURL) and Transport URL (TURL). While the GUIDs and LFNs identify a file irrespective of its location, the SURLs and TURLs contain information about where a physical replica is located, and how it can be accessed.” (taken from the gLite user guide).

  • GUIDs look like guid:<unique_string> files are given an unique GUID when created and can never be changed.  The unique string portion of the GUID is typically a combination of MAC address and time-stamp and is unique across the grid.
  • LFNs look like lfn:<unique_string> files can have many different LFNs all pointing or linking to the same data.  LFN unique strings typically follow unix-like conventions for file links.
  • SURLs look like srm:<se_hostname>/path and provide a way to access data located at a storage element.  SURLs are transformed to TURLs.  SURLs are immutable and are unique to a storage element.
  • TURLs look like <protocol>://<se_hostname>:<port>/path and are obtained dynamically from a storage element.  TURLs can have any format after the // that uniquely identifies the file to the storage element but typically they have a se_hostname, port and file path.

GUIDs and LFNs are used to lookup a data set in the global LCG file catalogue .  After file lookup a set of site specific replicas are returned (via SURLs) which are used to request file transfer/access from a nearby storage element.  The storage element accepts the file’s SURL and assigns a TURL which can then be used to transfer the data to wherever it’s needed.  TURLs can specify any file transfer protocol supported across the grid

CERN data grid file transfer protocols supported for transfer and access currently include:

  • GSIFTP – a grid security interface enabled subset of the GRIDFTP interface as defined by Globus
  • gsidcap – a grid security interface enabled feature of dCache for ftp access.
  • rfio – remote file I/O supported by DPM. There is both a secure and an insecure version of rfio.
  • file access – local file access protocols used to access the file data locally at the storage element.

While all storage elements provide the GSIFTP protocol, the other protocols supported depend on the underlying HMS/MSS system implemented by the storage element for each experiment.  Most experiments use one type of MSS throughout their  world wide storage elements and as such, offer the same file transfer protocols throughout the world.

If all this sounds confusing, it is.  Imagine 15PB a year of data replicated to over 10 Tier 1 data centers which can then be securely processed by over 160 Tier 2 data centers around the world.  All this supports literally thousands of scientist who have access to every byte of data created by CERN experiments and scientists that post process this data.

Just exactly how this data is replicated to the Tier 1 data centers and how a scientists processes such data must be the subject for other posts.