Data processing logistics

IBM System/370 Model 145 By jovike (cc) (from Flickr)
IBM System/370 Model 145 By jovike (cc) (from Flickr)

Chuck Hollis wrote a great post on “information logistics” as a new paradigm for IT centers to have to consider as they deploy applications around the globe and into the cloud.  The problem is that there’s lot’s of data to move around in order to make all this work.

Supercomputing’s Solution

Big data/super computing groups have been thinking about this problem for a long time and have some solutions that might help but it all harken’s back to batch processing and JCL (job control language) of the last century.  In my comment to Chuck’s post I mentioned the University of Wisconsin’s Condor(r) Project which can be used to schedule data transmission and data processing across distributed server nodes in a network, but there are others namely the Globus ToolKit 4 (GT4)  which creates a data grid to support collaborative research on PB of data currently being used by CERN for LHC data, EU for their data grid and others.  We have discussed Condor in our Free Cloud Storage and Cloud Computing post and GT4 in our 15PB a year created by CERN post.

These super computing projects were designed to move data around so that analysis could be done locally with results shared within the community.  However, at least with GT4, they replicate data at a number of nodes, which may not be storage efficient but does provide quicker access for data analysis.  In CERN, there are a hierarchy of nodes which participate in a GT4 data grid and the data is replicated between tiers and within peer nodes just to have better access to it.

In olden days, …

With JCL someone would code up a sequence of batch steps, each of which could be conditional on previous steps that would manipulate data into some transient and at the end, final form.  Sometimes JCL would invoke another job (set of JCL) for a follow on step if everything in this job worked as planned.  The JCL would wait in a queue until the data and execution resources were available for it.

This could mean mounting removable media, creating disk storage “datasets”, or waiting until other jobs were down with datasets being needed, jobs would execute in a priority sequence, and scheduling options could include using different hosts (servers) that would coordinate to provide job execution services.   For all I know, z/OS still supports JCL for batch processing, but it’s been a long time since I have used JCL.

Cloud computing and storage services

Where does that bring us for today. Cloud computing and Cloud storage are bringing this execution paradigm back into vogue. But instead of batch jobs, we are talking about virtual machines, or web applications or anything else that can be packaged up and run generically on anybody’s hardware and storage.

The only problem is that there are only application specific ways to control these execution activities.  I am thinking here of web services that hand off web handling to any web server that happens to have cycles to support it.  Similarly, database machines seem capable of handing off queries to any database server that has idle ergs to process with.  There are myriad others like this but they all seem specific to one application domain.  Nothing exists that is generic or can cross many application domains.

That’s where something like Condor, GT4 or god forbid, JCL can make some sense.  In essence, all of these approaches are application independent.  By doing so, they can be used for any number of applications to take advantage of cloud computing and cloud storage services.

Just had to get this out.  Chuck’s post had me thinking about JCL again and there had to be another solution.