Archeology meets Big Data

Polynya off the Antarctic Coast by NASA Earth Observatory (cc) (From Flickr)
Polynya off the Antarctic Coast by NASA Earth Observatory (cc) (From Flickr)

Read an article yesterday about the use of LIDAR (light detection and ranging, Wikipedia) to map the residues of an pre-columbian civilization in Central America, the little know Purepecha empire, peers of the Aztecs.

The original study (seeLIDAR at Angamuco) cited in the piece above was a result of the Legacies of Resilience project sponsored by Colorado State University (CSU) and goes into some detail about the data processing and archeological use of the LIDAR maps.


LIDAR sends a laser pulse from an airplane/satellite to the ground and measures how long it takes to reflect back to the receiver. With that information and “some” data processing, these measurements can be converted to an X, Y, & Z coordinate system or detailed map of the ground.

The archeologists in the study used LIDAR to create a detailed map of the empire’s main city at a resolution of +/- 0.25m (~10in). They mapped about ~207 square kilometers (80 square miles) at this level of detail. In 4 days of airplane LIDAR mapping, they were able to gather more information about the area then they were able to accumulate over 25 years of field work. Seems like digital archeology was just born.

So how much data?

I wanted to find out just how much data this was but neither the article or the study told me anything about the size of the LIDAR map. However, assuming this is a flat area, which it wasn’t, and assuming the +/-.25m resolution represents a point every 625sqcm, then the area being mapped above should represent a minimum of ~3.3 billion points of a LIDAR point cloud.

Another paper I found (see Evaluation of MapReduce for Gridding LIDAR Data) said that a LIDAR “grid point” (containing X, Y & Z coordinates) takes 52 bytes of data.

Given the above I estimate the 207sqkm LIDAR grid point cloud represents a minimum of ~172GB of data. There are LIDAR compression tools available, but even at 50% reduction, it’s still 85GB for 210sqkm.

My understanding is that the raw LIDAR data would be even bigger than this and the study applied a number of filters against the LIDAR map data to extract different types of features which of course would take even more space. And that’s just one ancient city complex.

With all the above the size of LIDAR raw data, grid point fields, and multiple filtered views is approaching significance (in storage terms). Moving and processing all this data must also be a problem. As evidence, the flights for the LIDAR runs over Angamuco, Mexico occurred in January 2011 and they were able to analyze the data sometime that summer, ~6 months late. Seems a bit long from my perspective maybe the data processing/analysis could use some help.

Indiana Jones meets Hadoop

That was the main subject of the second paper mentioned above done by researchers at the San Diego Supercomputing Center (SDSC). They essentially did a benchmark comparing MapReduce/Hadoop running on a relatively small cluster of 4 to 8 commodity nodes against an HPC cluster (running 28-Sun x4600M2 servers, using 8 processor, quad core nodes, with anywhere from 256 GB to 512GB [only on 8 nodes] of DRAM running a C++ implementation of the algorithm.

The results of their benchmarks were that the HPC cluster beat the Hadoop cluster only when all of the LIDAR data could fit in memory (on a DRAM per core basis), after that the Hadoop cluster performed just as well in elapsed wall clock time. Of course from a cost perspective the Hadoop cluster was much more economical.

The 8-node, Hadoop cluster was able to “grid” a 150M LIDAR derived point cloud at the 0.25m resolution in just a bit over 10 minutes. Now this processing step is just one of the many steps in LIDAR data analysis but it’s probably indicative of similar activity occurring earlier and later down the (data) line.


Let’s see 172GB per 207sqkm, the earth surface is 510Msqkm, says a similar resolution LIDAR grid point cloud of the entire earth’s surface would be about 0.5EB (Exabyte, 10**18 bytes). It’s just great to be in the storage business.