Archeology meets Big Data

Polynya off the Antarctic Coast by NASA Earth Observatory (cc) (From Flickr)
Polynya off the Antarctic Coast by NASA Earth Observatory (cc) (From Flickr)

Read an article yesterday about the use of LIDAR (light detection and ranging, Wikipedia) to map the residues of an pre-columbian civilization in Central America, the little know Purepecha empire, peers of the Aztecs.

The original study (seeLIDAR at Angamuco) cited in the piece above was a result of the Legacies of Resilience project sponsored by Colorado State University (CSU) and goes into some detail about the data processing and archeological use of the LIDAR maps.

Why LIDAR?

LIDAR sends a laser pulse from an airplane/satellite to the ground and measures how long it takes to reflect back to the receiver. With that information and “some” data processing, these measurements can be converted to an X, Y, & Z coordinate system or detailed map of the ground.

The archeologists in the study used LIDAR to create a detailed map of the empire’s main city at a resolution of +/- 0.25m (~10in). They mapped about ~207 square kilometers (80 square miles) at this level of detail. In 4 days of airplane LIDAR mapping, they were able to gather more information about the area then they were able to accumulate over 25 years of field work. Seems like digital archeology was just born.

So how much data?

I wanted to find out just how much data this was but neither the article or the study told me anything about the size of the LIDAR map. However, assuming this is a flat area, which it wasn’t, and assuming the +/-.25m resolution represents a point every 625sqcm, then the area being mapped above should represent a minimum of ~3.3 billion points of a LIDAR point cloud.

Another paper I found (see Evaluation of MapReduce for Gridding LIDAR Data) said that a LIDAR “grid point” (containing X, Y & Z coordinates) takes 52 bytes of data.

Given the above I estimate the 207sqkm LIDAR grid point cloud represents a minimum of ~172GB of data. There are LIDAR compression tools available, but even at 50% reduction, it’s still 85GB for 210sqkm.

My understanding is that the raw LIDAR data would be even bigger than this and the study applied a number of filters against the LIDAR map data to extract different types of features which of course would take even more space. And that’s just one ancient city complex.

With all the above the size of LIDAR raw data, grid point fields, and multiple filtered views is approaching significance (in storage terms). Moving and processing all this data must also be a problem. As evidence, the flights for the LIDAR runs over Angamuco, Mexico occurred in January 2011 and they were able to analyze the data sometime that summer, ~6 months late. Seems a bit long from my perspective maybe the data processing/analysis could use some help.

Indiana Jones meets Hadoop

That was the main subject of the second paper mentioned above done by researchers at the San Diego Supercomputing Center (SDSC). They essentially did a benchmark comparing MapReduce/Hadoop running on a relatively small cluster of 4 to 8 commodity nodes against an HPC cluster (running 28-Sun x4600M2 servers, using 8 processor, quad core nodes, with anywhere from 256 GB to 512GB [only on 8 nodes] of DRAM running a C++ implementation of the algorithm.

The results of their benchmarks were that the HPC cluster beat the Hadoop cluster only when all of the LIDAR data could fit in memory (on a DRAM per core basis), after that the Hadoop cluster performed just as well in elapsed wall clock time. Of course from a cost perspective the Hadoop cluster was much more economical.

The 8-node, Hadoop cluster was able to “grid” a 150M LIDAR derived point cloud at the 0.25m resolution in just a bit over 10 minutes. Now this processing step is just one of the many steps in LIDAR data analysis but it’s probably indicative of similar activity occurring earlier and later down the (data) line.

~~~~

Let’s see 172GB per 207sqkm, the earth surface is 510Msqkm, says a similar resolution LIDAR grid point cloud of the entire earth’s surface would be about 0.5EB (Exabyte, 10**18 bytes). It’s just great to be in the storage business.

 

2 thoughts on “Archeology meets Big Data

  1. Hey, just got a pingback for this. My name is Christopher Fisher, director of the Legacies of Resilience project. The article above is slightly confusing the Angamuco research for the long-term archaeological project at Caracol, Belize, which is also using LiDAR. We acquired data for roughly 8km2 of the ancient city of Angamuco. This equates into a spreadsheet with roughly 20 million rows of X,Y,Z, and other data – not sure how many actual bytes that is as I've never calculated it. The total flying time was ~45 minutes which saved ~10 years or so of on the ground surveying on an academic schedule – pretty good deal IMO.

    We got the data after the flight in I think early March, 2011, and it took two months of sustained effort to filter the point cloud so that we could begin to field-check the results in late May, 2011. This is on top of classes, field preparations, and other academic responsibilities. At the moment Legacies is a small project with limited resources. I thought that this was quick (LOL).

    Computing power, storage, access to multiple users globally, software licenses, etc. is a major issue for us at the moment as immediately outstripped our existing resources when the data were delivered. And yes, we could use a full-time staff, better computing power, storage, etc.!!

    I'd be happy to respond to other questions or comments. Thanks for the post!!

    Chris

    1. Chris,Thanks for the comment. The ArsTechnica article seemed to imply it was 80 sqmi of area, not 8 sqkm which is what I used for the storage calculations. Nonetheless, I would be curious to know all the major steps in transforming the “raw” LIDAR data to the maps you had in your article and what the intermediate and final data set sizes were. Given my understanding, I would estimate your 20M row grid point cloud to be at least a GB or more. As for the time frame, flight data in early March and fielding check it in May is much faster than what I read in the original article and would be pretty good, given all the other work that you had going on at the time.Being both in Colorado, I would love to get up to Fort Collins sometime to see what your team is doing.Sorry about any confusion and thanks again for your clarification.Ray

Comments are closed.