Read an article the other day, NASA Turns to the Cloud for Help With Next-Generation Earth Missions about how NASA was had started to migrate all their data to the cloud and intended to store all new data there as well. The hope is that researchers would no longer need to download NASA data but rather could access it directly using cloud compute resources.
It turns out that newer earth science satellites are generating so much data that hosting all this data is becoming a challenge and with the quantities being discussed, researchers downloading the data, to perform research in their own environments may take days.
Until recently, earth science data has been hosted and downloadable from NASA, ESA and other space organization sites. For example, see NASA’s GHCR DAAC (Global Hydrometerological Resource Center Distributed Active Archive Center), ESA EarthOnline, JAXA GPM website, etc. Generally one could download a time series of data from any of their prior and current earth/planetary science missions without too much trouble.
But NASA’s newest earth science satellites will be generating lot’s of data. For instance, the SWOT (Surface Water and Ocean Topography) mission data load will be 20TB/day and the NISAR (NASA-Indian Synthetic Aperture Radar) mission data load will be 80TB/day. And it’s only getting worse as more missions with newer instruments come online.
NASA estimates that, over time, they will store 247PB of data in their EarthData Cloud. At the moment, they have already migrated some (all of ASF [Alaska Satellite Facility] DAAC and some of PO.DAAC [Physical Ocean]) of their Earth Science data to AWS (us-west-2) and over time all of it will migrate there.
NASA will eat any egress charges for EOSDIS data and are also paying any and all hosting fees to storage the data in AWS. Unclear whether they are using standard S3 or S3-Intelligent Tiering. And presumably they are using S3 replication to ensure they don’t lose DAAC data in the cloud, but I don’t see any evidence of that in the literature I’ve read. Of course this doubles the storage costs for their 247PB of DAAC data.
Access to all this data is available to anyone with an EarthData login. There you can register for a profile to access NASA earth sciences data.
NASA’s EarthData also offers a number of AWS cloud based services to help one access this data:
- EarthData search – filtered search facility to access NASA EarthData by platform (e.g. satellite), instrument (e.g. camera/visual data), organization (e.g. NASA/JPL), etc.
- EarthData Common Metadata Repository – API driven metadata repository that ” catalogs all data and service metadata records for NASA’s EOSDIS (Earth Observing System Data and Information System) system” data, that can be accessed by anyone, which includes programatic access to EarthData search.
- EarthData Harmony – which is a EarthData Jupyter notebook examples and API documentation to perform research on earth science data in the EarthData cloud.
One reason to movie EOSDIS DAAC data to the cloud is to allow researchers to not have to download data to run their analysis. By using in cloud EC2 compute instances, they can run their research in AWS with direct , high speed access to the EarthData.
Of course, the researcher would need to purchase their EC2 compute facility directly from AWS. w. NASA publishes a sort of AWS pricing primer for researchers to use AWS EC2 compute to do research directly on the data in the cloud. Also NASA offers a series of tutorials on how to use the AWS cloud for doing research on NASA DAAC data.
Where to from here?
I find this all somewhat discouraging. Yes it’s the Gov’t but one needs to wonder what the overall costs of hosting NASA DAAC data on the AWS cloud will be over the long haul. Most organizations use the cloud to prototype and scale up services but once these services have stabilized, theymigrate them back to onprem/CoLoinfrastructure. See for example, Dropbox’s move away from the [AWS] cloud for ~600PB of data.
I get it, the public cloud allows for nearly infinite data scaleability. But cloud storage costs is not cheap, especially when you are talking about 100s of PBs. And in today’s world, with a whole bunch of open source solutions for object storage and services, one can almost recreate any cloud service in your own data center, at much lower price.
Sure it will still take IT infrastructure and personnel to put it all together. But NASA doesn’t seem to be lacking in infrastructure or IT personnel. Even if you are enamored with AWS services and software infrastructure, one can always run AWS Outpost in your data centers. And DAAC services seem to be pretty stable over time. Yes new satellites will generate more data, but the data load is understood and very predictable. So one should be able to anticipate all this and have infrastructure in place to deal with it.
Yes, having the ability to run analysis in the cloud directly on the data sitting also in the cloud is useful, especially not having to download TB of data. But these costs can also be significant and they are born by the researcher not NASA.
Another grip is why use AWS alone. The other cloud providers all have similar object storage and compute capabilities. It seems wiser to me to set up the EarthData service such that, different DAACs reside in different clouds. This would he more complex and harder to administer and use but I believe in the long run would lead to better more effective services at a more reasonable price.
Going to the cloud doesn’t have to be a one way endeavor. After using the cloud for a while, NASA should have a better idea of the costs of doing so and at that time understand better what it can and cannot afford to do on its own.
It will be interesting to see what ESA, JAXA, CERN and other big science organizations do as they are all in the same bind, data seems to be growing unbounded.