Data Science storage with NetApp’s Python Toolkit

I’ve got a book someplace (yet to be read completely) with the title Data science with Python. At a recent Storage Field Day 21 last month, NetApp was there discussing a number of their product offerings one of which was their Python SDK to manage NetApp storage for data scientists and AI researchers (see videos of their sessions here).

I’m not a data science expert but a Python SDK for storage management just makes so much sense to me I just had to take a look. Their GitHub repo is available online and they call it the NetApp Data Science Toolkit.

The challenge for data science and AI researchers is that it’s all about the data. How do you find the data, gain access to it, clean it, and process it quickly so you can do it all over again. Having some sort of Python SDK that allows you to do some rudimentary storage volume configuration, access, snapshotting etc. can make these sorts of pipelines be self-serviced rather than going back and forth with operations to get volumes configured, mounted, and services established.

NetApp Data Science Toolkit

The NetApp Data Science Toolkit can be PIP installed into anything with Python 3.5 or later and can be invoked via a command line or as a library of Python functions that can be invoked. The command line utility and the Python calls appear to be functionally equivalent.

pip3 install netapp-ontap pandas tabulate requests boto3

The Toolkit must be configured for your environment and NetApp storage but once that’s done your ready to rock and roll.

MLOps pipeline from Google

The command line is invoked with

./ntap_dsutil.py

following that command are subcommands and parameters specifying what ONTAP operation you want to perform and how it is to be done. Python function calls seem to follow the same parameterization as the CLI.

The CLI and Python function calls can run on MacOS or any Linux distribution. There’s a paper that discusses how to use the SDK to accelerate AI pipelines as well as another ReadMe that describes it’s use in Kubernetes with NetApp’s Trident CSI plugin.

The functionality supports NetApp AFF, FAS, Cloud Volumes and Select that are running ONTAP 9.7 or later. For a current list of ONTAP functions available, check out the toolkit. But for a overview these ONTAP functions were available.

  • For Volume Management – cloning, creating, listing all, deleting or mounting a volume,
  • For Snapshot Management – creating, deleting, listing and restoring snapshots (of volumes)
  • For Data Fabric Management – listing all cloud sync relationships, triggering a cloud sync operation, multi-thread pulling a bucket down from S3 storage (into a NetApp volume directory), pulling a single object down from S3 into a file, pushing the contents of a directory to bucket on S3 and pushing a file into an object on S3.
  • For Advanced Data Fabric Management – listing all SnapMirror relationships and triggering a sync operation for an existing SnapMirror relationship.

This is a pretty comprehensive list of NetApp ONTAP storage functionality. Having all this under control of Python and CLI for data scientist or AI researcher seems pretty impressive.

Of course not every option for all those functions are supported but it’s just a start (V1.1 of the toolkit). I’m sure there’s more to come, especially if customers demand it.

However, it would be nice to have an ONTAP simulator available with the toolkit that could be used to test out your Python code and CLI commands before using real NetApp storage. This would be very useful for those of us lacking our own test ONTAP storage, just hanging around on prem or in the cloud.

As Python becomes the language of choice for AI and now data science, it seems only natural that storage and data protection companies would start releasing Python SDKs/APIs for their product functionality. That way AI and data science researchers could embed any storage functionality they needed directly into their Python code or Jupyter Notebook application.

Having a Python SDK for NetApp ONTAP storage, means using data storage for your MLops or data science pipelines is that much easier.

Great move by NetApp. Ok where’s the rest of the industry?

Picture credit(s):

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.