Data Science storage with NetApp’s Python Toolkit

I’ve got a book someplace (yet to be read completely) with the title Data science with Python. At a recent Storage Field Day 21 last month, NetApp was there discussing a number of their product offerings one of which was their Python SDK to manage NetApp storage for data scientists and AI researchers (see videos of their sessions here).

I’m not a data science expert but a Python SDK for storage management just makes so much sense to me I just had to take a look. Their GitHub repo is available online and they call it the NetApp Data Science Toolkit.

But first please take our new poll:

The challenge for data science and AI researchers is that it’s all about the data. How do you find the data, gain access to it, clean it, and process it quickly so you can do it all over again. Having some sort of Python SDK that allows you to do some rudimentary storage volume configuration, access, snapshotting etc. can make these sorts of pipelines be self-serviced rather than going back and forth with operations to get volumes configured, mounted, and services established.

NetApp Data Science Toolkit

The NetApp Data Science Toolkit can be PIP installed into anything with Python 3.5 or later and can be invoked via a command line or as a library of Python functions that can be invoked. The command line utility and the Python calls appear to be functionally equivalent.

pip3 install netapp-ontap pandas tabulate requests boto3

The Toolkit must be configured for your environment and NetApp storage but once that’s done your ready to rock and roll.

MLOps pipeline from Google

The command line is invoked with

./ntap_dsutil.py

following that command are subcommands and parameters specifying what ONTAP operation you want to perform and how it is to be done. Python function calls seem to follow the same parameterization as the CLI.

The CLI and Python function calls can run on MacOS or any Linux distribution. There’s a paper that discusses how to use the SDK to accelerate AI pipelines as well as another ReadMe that describes it’s use in Kubernetes with NetApp’s Trident CSI plugin.

The functionality supports NetApp AFF, FAS, Cloud Volumes and Select that are running ONTAP 9.7 or later. For a current list of ONTAP functions available, check out the toolkit. But for a overview these ONTAP functions were available.

  • For Volume Management – cloning, creating, listing all, deleting or mounting a volume,
  • For Snapshot Management – creating, deleting, listing and restoring snapshots (of volumes)
  • For Data Fabric Management – listing all cloud sync relationships, triggering a cloud sync operation, multi-thread pulling a bucket down from S3 storage (into a NetApp volume directory), pulling a single object down from S3 into a file, pushing the contents of a directory to bucket on S3 and pushing a file into an object on S3.
  • For Advanced Data Fabric Management – listing all SnapMirror relationships and triggering a sync operation for an existing SnapMirror relationship.

This is a pretty comprehensive list of NetApp ONTAP storage functionality. Having all this under control of Python and CLI for data scientist or AI researcher seems pretty impressive.

Of course not every option for all those functions are supported but it’s just a start (V1.1 of the toolkit). I’m sure there’s more to come, especially if customers demand it.

However, it would be nice to have an ONTAP simulator available with the toolkit that could be used to test out your Python code and CLI commands before using real NetApp storage. This would be very useful for those of us lacking our own test ONTAP storage, just hanging around on prem or in the cloud.

As Python becomes the language of choice for AI and now data science, it seems only natural that storage and data protection companies would start releasing Python SDKs/APIs for their product functionality. That way AI and data science researchers could embed any storage functionality they needed directly into their Python code or Jupyter Notebook application.

Having a Python SDK for NetApp ONTAP storage, means using data storage for your MLops or data science pipelines is that much easier.

Great move by NetApp. Ok where’s the rest of the industry?

Picture credit(s):

Data in motion #DellTechWorld

I (virtually) attended DTW this week and Michael Dell and others in their keynote segments mentioned that the new world involves both data at rest and data in motion. I was curious as to about this new concept of data in motion, so I spent some time looking into it.

AWS Lambda server less processing service and Apache Kafka probably best represents this idea of data in motion. Dell Boomi, IBM MQ, Google cloud Pub/Sub, etc. also provide similar services to Kafka.

With AWS Lambda, clients deposit data in object buckets and AWS automatically invokes some program, container, service, etc. to process that data and then the service goes away until the next data is deposited. Kafka is AWS Lambda on steroids.

Kafka is a completely open source (GitHub) system that’s run using a cluster of servers and provides a “message processing” system. A minimum Kafka cluster is 3 servers (containers, VMs or bare metal).

How does Kafka work

In Apache Kafka, you have producers, server/brokers and consumers. With Kafka, data comes in as events, with a key, values (essentially a bit stream, could be anything) and time-stamps which are created by producer clients and are automatically stored by Kafka servers or brokers and appended to topics (a sort of folder) in an ordered sequence. Topic events are then processed by consumer clients.

Topics are partitioned (sharded) using keys, and can be optionally replicated across a defined number of Kafka brokers within a cluster. Kafka clusters can span data centers , regions, clouds etc. Replication is done for fault tolerance. Topic partitioning provides scale out, distributed performance for Kafka.

Events can be simple messages for real time analysis or larger files for offline analysis. But they are all essentially produced, stored and consumed in an ordered, log like fashion.

Topic partitions can be multi-producer and multi-consumer. That is there can be 0, 1 or many producers of events in a topic (partition) and topic partitions can have 0, 1 or more consumers.

In Kafka, events are saved for a specified period and are not automatically jetisoned/deleted. As such, events can be read multiple times by consumers.

Kafka can also offer a guarantee that events are only processed once. Kafka can also guarantee that consumers of topic partitioned events always read events in arrival order.

Consumers register to see events they are interested in. As mentioned earlier, there can be multiple consumers of the same events. Consumers can take the form of micro services/containers, programs, systems, etc. When an event is stored, consumer clients registered for that event, get notified to process the event.

In Kafka producers and consumers are fully decoupled. They have no need to know about one another and indeed, can exist in different servers, clusters, data centers, etc. Event producers don’t wait on consumers. Event consumers are notified when an event is available and can do whatever processing is required for that event.

Kafka APIs

Kafka has APIs for:

  • Admin services API to provide monitoring and management of the Kafka cluster and services
  • Producers API to publish and create events
  • Consumers API to subscribe to read events
  • Kafka Streams API to supply higher level stream processing for events, such as micro services, with stateless processing, stateful processing, and within stateful processing, providing events within a (time) window. Events can be processed from one or more topics and used to transform (process) these into other events to be written to one or more topics. It supports per event processing with (2) millisecond latency for highly tuned systems. Streams use a Java API that can be deployed in containers, VMs, bare metal, in the cloud etc. Stream processing is not performed on the Kafka cluster but must be performed elsewhere. Kafka streams can be used to create advanced and complex data pipelines.
  • Kafka Connect API to supply the connections needed to get events from other outside, perhaps more traditional applications, environments, services into Kafka topics for processing and vice versa, output topic events to more traditional services. Connect services are available for many different applications, databases, systems, etc.. Connect can be used, for example, to provide a connection between an relational database and topics as well as connect topics to relational databases. You don’t code in Connect but rather provide declarative statements that define what data goes where.

Kafka is used in very many organizations (NY Times, LinkedIn, LINE, etc) to provide an almost, enterprise wide, all encompassing, processing bus where data comes in, is partitioned out to topics and then processed in real time or not. Kafka can be addictive. You start with a relatively small application and find uses for it throughout the company. Pretty soon, you are running your whole organization through Kafka.

Data in motion

So that’s an example of data in motion. Another way to think about Kafka and its data in motion is it’s represents the final step in the evolution of batch processing from mainframes of last century.

Batch processing of old, took a bunch of transactions, batched them together, and processed them one by one until the batch was done. With Kafka and similar systems, you essentially have a batch of one transaction and they provide all the framework and facilities needed to create, store and consume that single transaction (batch).

But in addition to this simplistic one transaction in, one process and one output. Kafka and other systems, provide a more general purpose system, with multiple transaction types (events) being created by multiple producers and being consumed by a multitude of processes, that can each produce one or more outputs which could be other events to start the process over again. This create event, process, create event, process, could go on ad infinitum.

And that’s what a data pipeline looks like. Event data comes in, it’s processed (filtered, aggregated, merged, etc.) and generates a different event which causes more processing, which creates other events, which causes other processing…..

And that’s data in motion.

Photo Credit(s):

All graphics and photos are from Apache Kafka website

Where should IoT data be processed – part 2

I wrote a post a while back on Where IOT data should be processed – part 1. We will get back to that post in a moment, but recently I read an article (How big data forced the hunt for ET intelligence to evolve) that mentioned after 20 years, they were shutting down SETI@home.

SETI@home was a crowdsourced computational network that took snippets of radio spectrum, sent them to 1000s of home computers to be analyzed during idle computer time, once processed the analysis was sent back to SETI@home. It was one of the first to use a crowdsourced approach to perform data processing. The data was collected at a radio telescope, sent to SETI@home and distributed from there.

6 Factors for IOT data processing

In my post I talked about 6 factors that should help determine where data is processed. Those 6 factors included

  • Data size which is a measure of the amount (GB, TB or PBs) of data that is being generated at an IOT node
  • Data pipe availability, which is all about the networking bandwidth that’s available at the IOT node. If we are talking some sort of low-bandwidth networking access then it probably makes sense to process the data more locally and send only results of processing up the stack.
  • Processing criticality which indicates how important is the processing of the data. If the processing could save a life then maybe it should be done as close as possible to where the data is generated. If the data processing is less critical it could perhaps be done at other nodes in an IOT network
  • Processing time and infrastructure cost which is all about what sort of computational resources are required to perform the processing and how much would it cost. If processing of the data is to undergo multiple passes or requires multi-core CPUs or GPUs, moving data off the IoT node and onto a more comprehensive server to process it, could make sense.
  • Compliance, governance and archive requirements, which discussed the potential need for all data to be available for regulatory audits and as such may need to be available at a central location anyway so why not perform processing there.
  • Data information funnel, which talked about the fact that an IoT network should be configured in layers and that each layer in the stack should probably be responsible for some portion of the data processing needed by the overall system, if nothing more than compressing the information before it is sent elsewhere.

Now that I review the list, the last, Data information funnel, factor really should be a function of the other factors rather than a separate factor.

In that blog post I promised to follow it up with some examples of the logic applied to real world problems. SETI is the first one I’ve seen in the literature

SETI’s IoT processing problem

Closeup front view of one antenna of the Allan Telescope Array, a radio telescope for combined radio astronomy and SETI (Search for Extraterrestrial Intelligence) research being built by the University of California at Berkeley, outside San Francisco. The first phase, consisting of 42 6 meter dish antennas like the one shown here, was completed in 2007. Eventually it will have 350 antennas. This type of antenna is called an offset Gregorian design. The incoming radio waves are reflected by the large parabolic dish onto a secondary concave parabolic reflector in front of the dish, and then into a feed horn. A metal shroud can be seen along the bottom of the secondary reflector which shields the antenna from ground noise. It covers the frequency range from 0.5 to 11.2 GHz.

The SETI researchers found that “The telescopes are now capable of producing so much data that it’s not possible to get that volume of data out to volunteers,” And “The discovery space is in these massive, massive data streams. And it’s just not efficient to distribute many terabits per second out to volunteers all over the world. It’s more efficient for that data processing to happen at the actual observatory.”

So they moved the data processing for the SETI IoT network from being distributed out to home computers throughout the world to being done at the (telescope) source where the data was originally generated.

This decision seems to rely on a couple of the factors above. Namely the pipe availability and data size factors. They had to move processing because no pipes existed to send Tb of data to 1000s of home computers. And finally, the processing time and infrastructure cost has come down so much, that it was just easier to do the processing onsite.

It doesn’t seem like processing criticality or compliance-governance-archive had any bearing on the decision.

So there’s the first example that seems to fit well into our data processing framework.

~~~~

We ought to be able to come up with a formula that uses all these factors and comes up to with a yes or no as to whether to process the data on the node or not.

Photo Credit(s)

ZooArchNet.org, a new collaboration for zoological-archeological data

Read an article the other day about a new collaboration data platform, the ZooArchNet, for archeological and zoological data ( data about animals and the history of humankind).

The collaboration was started at the Florida Museum at the University of Florida. They intend to construct a database that would allow researchers to track the history of animals and how humans have interacted with them over time.

The problem is there’s a lot of historical animal specimen information available in various locations/sites around the world and similarly, there’s a lot of data about the history of humanity, but there’s little that cross links the two. And by missing those cross links, researchers aren’t seeing the big picture, that humankind and animal-kind have co-existed since the dawn of time and have impacted each other throughout their history.

However, if there was a site where one could trace the history of animal and human life, across time, in a region, one could develop a better understanding of how they interact and impact one another.

Humankind interacting with animalkind

In the article, they discuss a number of examples where animals have been impacted by humankind over time. For example, originally the Mexican Turkey was domesticated for its feathers during Mayan, Aztec and other civilizations of Central America,. but over time it became a prized for use as food. While this was occurring, its range expanded considerably throughout North (and South) America.

It’s the understanding of habitat range over time and how humankind helped or hindered this range that’s best served by linking zoological and archeological data sets that exist in research libraries throughout the world.

How it works

One problem in cross linking such data is that it often exists in different formats and uses different metadata to describe it.

A key, early decision was to use a standard metadata format ,the Darwin Core (DwC) an outgrowth of the Dublin Core which is more focused on zoological data.

With this in place, the next problem was to translate specimen metadata into the DwC and extract the actual data (or URI’s) that described the specimen for harvesting. Once all that was accomplished they could migrate the specimen data or archeological data and host it/cross link it in their ZooArchNet database.

For example, the researchers at Florida Museum used the Open Context database to provide archeological informationand the Global Biodiversity Information Facility (GBIF) to supply biological diversity information and together the two were linked and cross indexed in the ZooArchNet database.

Once the data was available and located in Google Cloud storage, researchers could use Google BigQuery data analytics as well as other apps like (Google) indexers to create more data rich views and searchable indices for their ZooArchNet and VertNet web portals.

ZooArchNet is just starting. Most of the information currently available is about the few examples chosen to demonstrate the technology. As with anything like this, there’s a certain amount of crowd sourcing needed to make it worthwhile. It’s popularity will be a prime determinant on its usefulness over time. But anything that helps the world understand the true history of humanity’s impact on this life of this planet is worthwhile.

~~~~

Comments?

Photo Credit(s): “turkey bird” by watts photos1 is licensed under CC BY 2.0 

Workflow from ZooArchNet: Connecting zooarchaeological specimens to the biodiversity and archaeology data networks article

Darwin Core overview from Darwin Core: An Evolving Community-Developed Biodiversity Data Standard article

Earth globe within a locked cage

Blockchains at IBM

img_6985-2I attended IBM Edge 2016 (videos available here, login required) this past week and there was a lot of talk about their new blockchain service available on z Systems (LinuxONE).

IBM’s blockchain software/service  is based on the open source, Open Ledger, HyperLedger project.

Blockchains explained

1003163361_ba156d12f7We have discussed blockchain before (see my post on BlockStack). Blockchains can be used to implement an immutable ledger useful for smart contracts, electronic asset tracking, secured financial transactions, etc.

BlockStack was being used to implement Private Key Infrastructure and to implement a worldwide, distributed file system.

IBM’s Blockchain-as-a-service offering has a plugin based consensus that can use super majority rules (2/3+1 of members of a blockchain must agree to ledger contents) or can use consensus based on parties to a transaction (e.g. supplier and user of a component).

BitCoin (an early form of blockchain) consensus used data miners (performing hard cryptographic calculations) to determine the shared state of a ledger.

There can be any number of blockchains in existence at any one time. Microsoft Azure also offers Blockchain as a service.

The potential for blockchains are enormous and very disruptive to middlemen everywhere. Anywhere ledgers are used to keep track of assets, information, money, etc, that undergo transformations, transitions or transactions as they are further refined, produced and change hands, can be easily tracked in blockchains.  The only question is can these assets, information, currency, etc. be digitally fingerprinted and can that fingerprint be read/verified. If such is the case, then blockchains can be used to track them.

New uses for Blockchain

img_6995IBM showed a demo of their new supply chain management service based on z Systems blockchain in action.  IBM component suppliers record when they shipped component(s), shippers would record when they received the component(s), port authorities would record when components arrived at port, shippers would record when parts cleared customs and when they arrived at IBM facilities. Not sure if each of these transitions were recorded, but there were a number of records for each component shipment from supplier to IBM warehouse. This service is live and being used by IBM and its component suppliers right now.

Leanne Kemp, CEO Everledger, presented another example at IBM Edge (presumably built on z Systems Hyperledger service) used to track diamonds from mining, to cutter, to polishing, to wholesaler, to retailer, to purchaser, and beyond. Apparently the diamonds have a digital bar code/fingerprint/signature that’s imprinted microscopically on the diamond during processing and can be used to track diamonds throughout processing chain, all the way to end-user. This diamond blockchain is used for fraud detection, verification of ownership and digitally certify that the diamond was produced in accordance of the Kimberley Process.

Everledger can also be used to track any other asset that can be digitally fingerprinted as they flow from creation, to factory, to wholesaler, to retailer, to customer and after purchase.

Why z System blockchains

What makes z Systems a great way to implement blockchains is its securely, isolated partitioning and advanced cryptographic capabilities such as z System functionality accelerated hashing, signing & securing and hardware based encryption to speed up blockchain processing.  z Systems also has FIPS-140 level 4 certification which can provide the highest security possible for blockchain and other security based operations.

From IBM’s perspective blockchains speak to the advantages of the mainframe environments. Blockchains are compute intensive, they require sophisticated cryptographic services and represent formal systems of record, all traditional strengths of z Systems.

Aside from the service offering, IBM has made numerous contributions to the Hyperledger project. I assume one could just download the z Systems code and run it on any LinuxONE processing environment you want. Also, since Hyperledger is Linux based, it could just as easily run in any OpenPower server running an appropriate version of Linux.

Blockchains will be used to maintain the system of record of the future just like mainframes maintained the systems of record of today and the past.

Comments?

 

Big open data leads to citizen science

Read an article the other day in ScienceLine about the Astronomical Data Explosion.  It appears that as international observatories start to open up their archives and their astronomical data to anyone and anybody, people are starting to do useful science with it.

Hunting for planets

The story talked about a pair of amateur astronomers who were looking through Kepler telescope data which had recently been put online (see PlanetHunters.org) to find anomalies that signal the possibility of a planet.  They saw a diming of a particular star’s brightness and then saw it again 132 days later. At that point they brought it to the attention of real scientists who later discovered that what they found was a 4 star solar system which they labeled Tatooine.

It seems with all the latest astronomical observations coming in from Kepler, the Sloan Digital Sky Survey and Hubble observatories are generating a deluge of data. And although all this data is being subjected to intense scrutiny by professional astronomers, they can’t do everything they want to do with it.

Consequently, in astronomy today we now have come to a new world of abundant data but not enough resources to do all the science that can be done.  This is where the citizen or amateur scientist enters the picture. Using standard web accessible tools they are able to subject the data to many more eyes each looking for whatever interest spurs them on and as such, can often contribute real science from their efforts.

Citizen science platforms

It turns out PlanetHunters.org is one of a number of similar websites put up by Zooniverse to support citizen science in astronomy, biology, nature, climate and humanities. Their latest project is to classify animal found in snapshots taken on the Serengheti (see SnapshotSerengeti.org).

Of course crowdsourced scientific activity like this has been going on for a long time now with Boinc projects like SETI@Home screen savers that sifted through radio signals searching for extra-terestial signals. But that made use of the extra desktop compute cycles people were waisting with screen savers.

 

In contrast, Zooniverse started with the GalaxyZoo project (original retired site here). They put Hubble telescope images online and asked for amateur astronomers to classify the type of galaxies found in the images.

GalaxyZoo had modest aspirations at first but when they put the Hubble images online their servers were overwhelmed with the response and had to be beefed up considerably to deal with the traffic.  Overtime, they were able to get literally millions of galaxy classifications. Now they want more, and the recent incarnation of GalaxyZoo has put the brightest 250K galaxies online and they are asking for even finer, more detailed classifications of them.

Today’s Zooniverse projects are taking advantage of recent large and expanding data repositories plus newer data visualization tools to help employ human analysis to their data.  Automated tools are not yet sophisticated enough to classify images as well as a human can.

One criteria for Zooniverse projects is to have a massive amount of data which needs to be classified.  In this way, science is once again returning to it’s amateur roots but this time guided by professionals.  Together we can do more than what either could do apart.

~~~~

I suppose it was only a matter of time before science got inundated with more data than they could process effectively.  Having the ability to put all this data online, parcel it out to concerned citizens and ask them to help understand/classify it has brought a new dawn to citizen science.

Comments?

Photo credits:
Twin Suns on Mos Espa by Stéfan
BONIC running SETI@Home by Keng Susumpow
Galaxy Group Stephan’s Quintet by HubbleColor {Zolt}

Data hypervisor

(c) 2012 Silverton Consulting, Inc. All rights reserved

With all this talk of software defined networking and server virtualization where does storage virtualization stand.  I blogged about some problems with storage virtualization a week or so ago in my post on Storage Utilization is broke and this post takes it to the next level.  Also I was at a financial analyst conference this week in Vail where I heard Mark Lewis of Tekrocket but formerly of EMC discuss the need for a data hypervisor to provide software defined storage.

I now believe what we really need for true storage virtualization is a renewed focus on data hypervisor functionality.  The data hypervisor would need both a control plane and a data plane in order to function properly.   Ideally the control plane would set up the interface and routing for the data plane hardware and the server and/or backend storage would be none the wiser.

DMs everywhere

I envision a scenario where a customer’s application data is packaged with a data hypervisor which runs on a commodity data switch hardware with data plane and control plane software running on it.  Sort of creating (virtual) data machines or DMs.

All enterprise and nowadays most midrange storage provide most of the functionality of a storage control plane such as defining units of storage, setting up physical to logical storage mapping, incorporating monitoring, and management of the physical storage layer, etc.  So control planes are pervasive in today’s storage but proprietary.

In addition most storage systems have data plane functionality which operates to connect a host IO request to the actual data which resides in backend storage or internal cache.  But again although data planes are everywhere in storage today they are all proprietary to a specific vendor’s storage system.

Data switch needed

But in order to utilize a data hypervisor and create a more general purpose control plane layer, we need a more generic data plane layer that operates on commodity hardware. This is different from today’s SAN storage switches or DCB switches but similar in a some ways.

The functions of the data switch/data plane layer would be to take routing instructions from the control plane layer and direct the server IO request to the proper storage unit using the data plane layer.  Somewhere in this world view, probably at the data plane level it would introduce data protection services like RAID or other erasure coding schemes, point in time copy/clone services and replication services and other advanced storage features needed by enterprise storage today.

Also it would need to provide some automated storage movement across and within tiers of physical storage and it would connect server storage interfaces at the front end to storage interfaces at the backend.  Not unlike SAN or DCB switches but with much more advanced functionality.

Ideally data switch storage interfaces could attach to dedicated JBOD, Flash arrays as well as systems using DAS  storage.  In addition, it would be nice if the data switch could talk to real storage arrays on SAN, IP/SANs or NFS&CIFS/SMB storage systems.

The other thing one would like out of a data switch is support for a universal translator that would map one protocol to another, such as iSCSI to SAS, NFS to FC, or FC to NFS and any other combination, depending on the needs of the server and the storage in the configuration.

Now if the data switch were built on top of commodity x86 hardware and software with the data switch as just a specialized application that would create the underpinnings for a true data hypervisor with a control and data plane that could be independent and use anybody’s storage.

Data hypervisor

Assuming all this were available then we would have true storage virtualization.  With these capabilities, storage could be repurposed on the fly, added to, subtracted from, and in general be a fungible commodity not unlike server processing MIPs under VMware or Hyper-V.

Application data would then needed to be packaged into a data machine which would offer all the host services required to support host data access.  The data hypervisor would handle the linkages required to interface with the control and data layers.

Applications could be configured to utilize available storage at ease and storage could grow,  shrink or move to accommodate the required workload just as easily as VMs can be deployed today.

How we get there

Aside from the VMware, Citrix, Microsoft thrusts towards virtual storage there are plenty of storage virtualization solutions that can control most backend enterprise SAN storage. However, the problem with these solutions is that in general the execute only on a specific vendors hardware and don’t necessarily talk to DAS or JBOD storage.

In addition, not all of the current generation storage virtualization solutions are unified. That is most of these today only talk FC, FCoE or iSCSI and don’t support NFS or CIFS/SMB.

These don’t appear to be insurmountable obstacles and with proper allocation of R&D funding, could all be solved.

However the more problematic is that none of these solutions operate on commodity hardware or commodity software.

The hardware is probably the easiest to deal with. Today many enterprise storage systems are built ontop of x86 processor storage controllers. Albeit sometimes they incorporate specialized packaging for redundancy and high availability.

The harder problem may be commodity software. Although the genesis for a few storage virtualization systems might come from BSD or other “commodity” software operating systems. They have been modified over the years to no longer represent anything that can run on standard off the shelf operating systems.

Then again some storage virtualization systems started out with special home grown hardware and software. As such, converting these over to something more commodity oriented would be a major transition.

But the challenge is how to get there from here and would anyone want to take this on.  The other problem is that the value add that storage vendors supply currently would be somewhat eroded.  Not unlike what happened to proprietary Unix systems with the advent of VMware.

But this will not take place overnight and the company that takes this on and makes a go at it can have a significant software monopoly that would be hard to crack.

Perhaps it will take a startup to do this but I believe the main enterprise storage vendors are best positioned to take this on.

Comments?

A “few exabytes-a-day” from SKA

A number of radio telescopes, positioned close together pointed at a cloudy sky
VLA by C. G. P. Grey (cc) (from Flickr)

ArsTechnica reported today on the proposed Square Kilometer Array (SKA) radio telescope and it’s data requirements. IBM is in collaboration with the Netherlands Institute for Radio Astronomy (ASTRON) to help develop the SKA called the DOME project.

When completed in ~2024, the SKA will generate over an exabyte a day (10**18) of raw data.  I reported in a previous post how the world was generating an exabyte-a-day, but that was way back in 2009.

What is the SKA?

The new SKA telescope will be a configuration of “millions of radio telescopes” which when combined together will create a telescope with an aperture of one square kilometer, which is no small feet.  They hope that the telescope will be able to shed some light on galaxy evolution, cosmology and dark energy.  But it will go beyond that to investigating “strong-field tests of gravity“, “origins and evolution of cosmic magnetism” and search for life on other planets.

But the interesting part from a storage perspective is that the SKA will be generating a “few exabytes a day” of radio telescopic data for every full day of operation.   Apparently the new radio telescopes will make use of a new, more sensitive detector able to generate data of up to 10GB/second.

How much data, really?

The team projects final storage needs at between 300 to 1500 PB per year. This compares to the LHC at CERN which consumes ~15PB of storage per year.

It would seem that the immediate data download would be the few exabytes and then it would be post- or inline-processed into something more mangeable and store-able.  Unless they have some hellaciously fast processing, I am hard pressed to believe this could all happen inline.  But then they would need at least another “few exabytes” of storage to buffer the data feed before processing.

I guess that’s why it’s still a research project.  Presumably, this also says that the telescope won’t be in full operation every day of the year, at least at first.

The IBM-ASTRON DOME collaboration project

The joint research project was named for the structure that covers a major telescope and for a famous Swiss mountain.  Focus areas for the IBM-ASTRON DOME project include:

  • Advanced high performance computing utilizing 3D chip stacks for better energy efficiency
  • Optical interconnects with nanophotonics for high-speed data transfer
  • Storage for both high access performance access and for dense/energy efficient data storage.

In this last focus area, IBM is considering the use of phase change memories (PCM) for high access performance and new generation tape for dense/efficient storage.  We have discussed PCM before in a previous post as an alternative to NAND based storage today (see Graphene Flash Memory).  But IBM has also been investigating MRAM based race track memory as a potential future storage technology.  I would guess the advantage of PCM over MRAM might be access speed.

As for tape, IBM has already demonstrated in their labs technologies for a 35TB tape. However storing 1500 PB would take over 40K tapes per year so they may need another even higher capacities to support SKA tape data needs.

Of course new optical interconnects will be needed to move this much data around from telescope to data center and beyond.  It’s likely that the nanophotonics will play some part as an all optical network for transceivers, amplifiers, and other networking switching gear.

The 3D chip stacks have the advantage of decreasing chip IO and more dense packing of components will make efficient use of board space.  But how these help with energy efficiency is another question.  The team projects very high energy and cooling requirements for their exascale high performance computing complex.

If this is anything like CERN, datasets gathered onsite are initially processed then replicated for finer processing elsewhere (see 15PB a year created by CERN post.  But moving PBs around like SKA will require is way beyond today’s Internet infrastructure.

~~~~

Big science like this gives a whole new meaning to BIGData. Glad I am in the storage business.  Now just what exactly is nanophotonics, mems based phote-electronics?