data access – Silverton Consulting

open source AGI or not – AGI part 8

Posted on October 19, 2023 by Ray in AGI, data access

Read a recent article in the NY Times, An industry insider drives an open alternative to big tech’s AI, about the Allen Institute for AI releasing a massive corpus of data, Dolma: 3 Trillion Token Open Corpus for Language Model Pre-trainning, that can be used to train LLM’s, available to be downloaded from HuggingFace.

The intent of the data release is to at some point, end up supplying an open source alternative to closed source Google/OpenAI LLMs and a more fully opened source LLM than Meta’s Llama 2, that the world’s research community can use to understand, de-risk and further AI and ultimately AGI development.

We’ve written about AGI before (see our latest, One agent to rule them all – AGI part 7, which has links to parts 1-6 of our AGI posts). Needless to say it’s a very interesting topic to me and should be to the rest of humankind. LLM is a significant step towards AGI IMHO.

One of the Allen Institute for AI’s (AI2) major goals is to open source an LLM (see Announcing AI2 OLMo, an Open Language Model Made by Scientists for Scientists), including the data (Dolma), the model, it’s weight, the training tools/code, the evaluation tools/code, and everything else that went into creating their OLMo (Open Language Model) LLM.

This way the world’s research community can see how it was created and perhaps help in insuring it’s a good (whatever that means) LLM. Releasing Dolma is a first step towards a truly open source LLM.

The Dolma corpus

AI2 has released a report on the contents of Dolma (dolma-datasheet.pdf) which documents much of what went into creating the corpus.

The datasheet goes into a good level of detail into where the corpus data came from and how each data segment is licensed and other metadata to allow researchers the ability to understand its content.

For example, in the Common Crawl data, they have included all of the websites URL as identifiers and for The Stack data the names of GitHub repo used are included in the data’s metadata.

In addition, the Dolma corpus is released under an AI2 ImpACT license as a medium risk artifact, which requires disclosure for use (download). Medium risk ImpACT licensing means that you cannot re-distribute (externally) any copy of the corpus but you may distribute any derivatives of the corpus with “Flow down use restrictions”, “Attribution” and “Notices”.

Which seems to say you can do an awful lot with the corpus and still be within its license restrictions. They do require an Derivative Impact Report to be filed which is sort of a model card for the corpus derivative you have created.

What’s this got to do with AGI

All that being said, the path to AGI is still uncertain. But the textual abilities of recent LLM releases seems to be getting closer and closer to something that approaches human skill in creating text, code, interactive agents, etc. Yes, this may be just one “slim” domain of human intelligence, but textual skills, when and if perfected, can be applied to much that white collar workers do these these days, at least online.

A good text LLM would potentially put many of our jobs at risk but could also possibly open up a much more productive, online workforce, able to assimilate massive amounts of information, and supply correct-current-vetted answers to any query.

The elephant in the room

But all that begs the real question behind AI2’s open sourcing OLMo, which is how do we humans create a safe, effective AGI that can benefit all of mankind rather than any one organization or nation. One that can be used safely by everyone to do whatever is needed to make the world a better society for all.

Versus, some artificial intelligent monstrosity, that sees humankind or any segment of them as an enemy, to whatever it believe needs to be done, and eliminates us or worse, ignores us as irrelevant.

I’m of the opinion that the only way to create a safe and effective AGI for the world is to use an open source approach to create many (competing) AGIs. There are a number of benefits to this as I see it. With a truly open source AGI,

Any organization (with sufficient training resources) can have access to their personally trained AGI, which means no one organization or nation can gain the lions share of benefits from AGI.
Would allow the creation and deployment of many competing AGI’s which should help limit and check any one of them from doing us or the world any harm. .
All of the worlds researchers can contribute to making it as safe as possible.
All of the worlds researcher can contribute to making it as multi-culturally, effective and correct as possible.
Anyone (with sufficient inferencing resources) can use it for their very own intelligent agent or to work on their very own personal world improvement projects.
Many cloud or service provider organizations (with sufficient inferencing resources) could make it available as a service to be used by anyone on an incremental, OPex cost basis.

The risks of a truly open source AGI are also many and include:

Any bad actor, nation state, organization, billionaire, etc., could copy the AGI and train it as a weapon to eliminate their enemies or all of humankind, if so inclined.
Any bad actors could use it to swamp the internet and world’s media with biased information, disinformation or propaganda.
Any good actor or researcher, could, perhaps by mistake, unleash an AGI on an exponentially increasing, self-improvement cycle that could grow beyond our ability to control or to understand.
An AGI agent alone, could take it upon itself to eliminate humanity or the world as the best option to save itself

But all these are even more of a problem for closed or semi-open/semi-closed releases of AGIs. As the only organizations with resources to do LLM research are very large tech companies or large technically competent nation states. And all of these are competing across the world stage already.

The resources may still limit widespread use

One item that seems to be in the way of truly widely available AGI is the compute resources needed to train or to use one for inferencing. OpenAI has Microsoft and other select organizations funding their compute, Meta and Google have all their advertising revenue funding theirs.

AI2 seems to have access (and looking for more funding for even more access) to the EU’s LUMI (HPE Cray system using AMD EPYC CPUs and AMD Instinct GPUs) supercomputer, located in CSC data center in Finland and is currently the EU’s fastest supercomputer at 375 CPU PFlops/550 GPU PFlops (~1.5M laptops).

Not many organizations, let alone nations could afford this level of compute.

But the funny thing is that compute doubles (flops/$) every 2 years or so. So, in six years or so, an equivalent of LUMI’s compute power would only require 150K current laptops and after another six years or so, 15K laptops. At some point, ~18 years from now, one would only need ~1.5K laptops, or something any nation or organization could probably afford. Add another 15 years and we are down to under 3 laptops, which just about anyone with a family in the modern world could afford. So in ~33 years or ~2054, any of us could train an LLM on our families compute resources. And that’s just the training compute..

My guess, something like 10-100X less compute resources would be required to use it for inferencing. So that’s probably available for any organization to use right now or if not now, in 6 years or so.

~~~

I can’t wait until I can have my very own AGI to use to write RayOnStorage current-correct-vetted blog posts for me…

Comments?

Picture credit(s):

From AI2 Dolma blog post
From Race Faces by Jerome Rauckman
From dolma-datasheet.pdf
From May we introduce LUMI supercomputer web page

AWS Data Exchange vs Data Banks – part 2

Posted on March 14, 2023March 14, 2023 by Ray in Artificial Intelligence, Cloud services, Cognitive computing, Data, data access, Data banks, Data economy, Data ownership, Data withdrawals, Deep Learning, Information economy, Initial Data Offerings (IDOs), Machine Learning, Neural network, Strategic Inflection Points

Saw where AWS announced a new Data Exchange service on their AWS Pi day 2023. This is a completely managed service available on the AWS market place to monetize data.

In a prior post on a topic I called data banks (Data banks, data deposits & data withdrawals…), I talked about the need to have some sort of automated support for personal data that would allow us to monetize it.

The hope then (4.5yrs ago) was that social media, search and other web services would supply all the data they have on us back to us and we could then sell it to others that wanted to use it.

In that post, I called the data the social media gave back to us data deposits, the place where that data was held and sold a data bank, and the sale of that data a data withdrawal. (I know talking about banks deposits and withdrawals is probably not a great idea right now but this was back a ways).

AWS Data Exchange

1918 Farm Auction by dok1 (cc) (from Flickr)

With AWS Data Exchange, data owners can sell their data to data consumers. And it’s a completely AWS managed service. One presumably creates an S3 bucket with the data you want to sell. determine a price to sell the data for and a period clients can access that data for and register this with AWS and the AWS Data Exchange will support any number of clients purchasing data data.

Presumably, (although unstated in the service announcement), you’d be required to update and curate the data to insure it’s correct and current but other than that once the data is on S3 and the offer is in place you could just sit back and take the cash coming in.

I see the AWS Data Exchange service as a step on the path of data monetization for anyone. Yes it’s got to be on S3, and yes it’s via AWS marketplace, which means that AWS gets a cut off any sale, but it’s certainly a step towards a more free-er data marketplace.

Changes I would like to AWS Data Exchange service

Putting aside the need to have more than just AWS offer such a service, and I heartedly request that all cloud service providers make a data exchange or something similar as a fully supported offering of their respective storage services. This is not quite the complete data economy or ecosystem that I had envisioned in September of 2018.

If we just focus on the use (data withdrawal) side of a data economy, which is the main thing AWS data exchange seems to supports, there’s quite a few missing features IMHO,

Data use restrictions – We don’t want customers to obtain a copy of our data. We would very much like to restrict them to reading it and having plain text access to the data only during the period they have paid to access it. Once that period expires all copies of data needs to be destroyed programmatically, cryptographically or in some other permanent/verifiable fashion. This can’t be done through just license restrictions. Which seems to be the AWS Data Exchanges current approach. Not sure what a viable alternative might be but some sort of time-dependent or temporal encryption key that could be expired would be one step but customers would need to install some sort of data exchange service on their servers using the data that would support encryption access/use.
Data traceability – Yes, clients who purchase access should have access to the data for whatever they want to use it for. But there should be some way to trace where our data ended up or was used for. If it’s to help train a NN, then I would like to see some sort of provenance or certificate applied to that NN, in a standardized structure, to indicate that it made use of our data as part of its training. Similarly, if it’s part of an online display tool somewhere in the footnotes of the UI would be a data origins certificate list which would have some way to point back to our data as the source of the information presented. Ditto for any application that made use of the data. AWS Data Exchange does nothing to support this. In reality something like this would need standards bodies to create certificates and additional structures for NN, standard application packages, online services etc. that would retain and provide proof of data origins via certificates.
Data locality – there are some juristictions around the world which restrict where data generated within their boundaries can be sent, processed or used. I take it that AWS Data Exchange deals with these restrictions by either not offering data under jurisdictional restrictions for sale outside governmental boundaries or gating purchase of the data outside valid jurisdictions. But given VPNs and similar services, this seems to be less effective. If there’s some sort of temporal key encryption service to make use of our data then its would seem reasonable to add some sort of regional key encryption addition to it.
Data audibility – there needs to be some way to insure that our data is not used outside the organizations that have actually paid for it. And that if there’s some sort of data certificate saying that the application or service that used the data has access to that data, that this mechanism is mandated to be used, supported, and validated. In reality, something like this would need a whole re-thinking of how data is used in society. Financial auditing took centuries to take hold and become an effective (sometimes?) tool to monitor against financial abuse. Data auditing would need many of the same sorts of functionality, i.e. Certified Data Auditors, Data Accounting Standards Board (DASB) which defines standardized reports as to how an entity is supposed to track and report on data usage, governmental regulations which requires public (and private?) companies to report on the origins of the data they use on a yearly/quarterly basis, etc.

Probably much more that could be added here but this should suffice for now.

other changes to AWS Data Exchange processes

The AWS Pi Day 2023 announcement didn’t really describe the supplier end of how the service works. How one registers a bucket for sale was not described. I’d certainly want some sort of stenography service to tag the data being sold with the identity of those who purchased it. That way there might be some possibility to tracking who released any data exchange data into the wild.

Also, how the data exchange data access is billed for seems a bit archaic. As far as I can determine one gets unlimited access to data for some defined period (N months) for some specific amount ($s). And once that period expires, customers have to pay up or cease accessing the S3 data. I’d prefer to see at least a GB/month sort of cost structure that way if a customer copies all the data they pay for that privilege and if they want to reread the data multiple times they get to pay for that data access. Presumably this would require some sort of solution to the data use restrictions above to enforce.

Data banks, deposits, withdrawals and Initial Data Offerings (IDOs)

The earlier post talks about an expanded data ecosystem or economy. And I won’t revisit all that here but one thing that I believe may be worth re-examining is Initial Data Offerings or IDOs.

As described in the earlier post, IDO’ss was a mechanism for data users to request permanent access to our data but in exchange instead of supplying it for a one time fee, they would offer data equity in the service.

Not unlike VC, each data provider would be supplied some % (data?) ownership in the service and over time data ownership get’s diluted at further data raises but at some point when the service is profitable, data ownership units could be purchased outright, so that the service could exit it’s private data use stage and go public (data use).

Yeah, this all sounds complex, and AWS Data Exchange just sells data once and you have access to it for some period, establishing data usage rights.. But I think that in order to compensate users for their data there needs to be something like IDOs that provides data ownership shares in some service that can be transferred (sold) to others.

I didn’t flesh any of that out in the original post but I still think it’s the only way to truly compensate individuals (and corporations) for the (free) use of the data that web, AI and other systems are using to create their services.

~~~~

I wrote the older post in 2018 because I saw the potential for our data to be used by others to create/trlain services that generate lots of money for those organization but without any of our knowledge, outright consent and without compensating us for the data we have (indadvertenly or advertently) created over our life span.

As an example One can see how Getty Images is suing DALL-E 2 and others have had free use of their copyrighted materials to train their AI NN. If one looks underneath the covers of ChatGPT, many image processing/facial recognition services, and many other NN, much of the data used in training them was obtained by scrapping web pages that weren’t originally intended to supply this sorts of data to others.

For example, it wouldn’t surprise me to find out that RayOnStorage posts text has been scrapped from the web and used to train some large language model like ChatGPT.

Do I receive any payment or ownership equity in any of these services – NO. I write these blog posts partially as a means of marketing my other consulting services but also because I have an abiding interest in the subject under discussion. I’m happy for humanity to read these and welcome comments on them by humans. But I’m not happy to have llm or other RNs use my text to train their models.

On the other hand, I’d gladly sell access to RayOnStorage posts text if they offered me a high but fair price for their use of it for some time period say one year… 🙂

Comments?

CTERA, Cloud NAS on steroids

Posted on August 13, 2021August 13, 2021 by Ray in Cloud services, Cloud storage, data access, Data consistency, Data security, Distributed computing, File Storage, Mobile computing, Object storage, storage scalability, Strategic Inflection Points

We attended SFD22 last week and one of the presenters was CTERA, (for more information please see SFD22 videos of their session) discussing their enterprise class, cloud NAS solution.

We’ve heard a lot about cloud NAS systems lately (see our/listen to our GreyBeards on Storage podcast with LucidLink from last month). Cloud NAS systems provide a NAS (SMB, NFS, and S3 object storage) front-end system that uses the cloud or onprem object storage to hold customer data which is accessed through the use of (virtual or hardware) caching appliances.

These differ from file synch and share in that Cloud NAS systems

Don’t copy lots or all customer data to user devices, the only data that resides locally is metadata and the user’s or site’s working set (of files).
Do cache working set data locally to provide faster access
Do provide NFS, SMB and S3 access along with user drive, mobile app, API and web based access to customer data.
Do provide multiple options to host user data in multiple clouds or on prem
Do allow for some levels of collaboration on the same files

Although admittedly, the boundary lines between synch and share and Cloud NAS are starting to blur.

CTERA is a software defined solution. But, they also offer a whole gaggle of hardware options for edge filers, ranging from smart phone sized, 1TB flash cache for home office user to a multi-RU media edge server with 128TB of hybrid disk-SSD solution for 8K video editing.

They have HC100 edge filers, X-Series HCI edge servers, branch in a box, edge and Media edge filers. These later systems have specialized support for MacOS and Adobe suite systems. For their HCI edge systems they support Nutanix, Simplicity, HyperFlex and VxRail systems.

CTERA edge filers/servers can be clustered together to provide higher performance and HA. This way customers can scale-out their filers to supply whatever levels of IO performance they need. And CTERA allows customers to segregate (file workloads/directories) to be serviced by specific edge filer devices to minimize noisy neighbor performance problems.

CTERA supports a number of ways to access cloud NAS data:

Through (virtual or real) edge filers which present NFS, SMB or S3 access protocols
Through the use of CTERA Drive on MacOS or Windows desktop/laptop devices
Through a mobile device app for IOS or Android
Through their web portal
Through their API

CTERA uses a, HA, dual redundant, Portal service which is a cloud (or on prem) service that provides CTERA metadata database, edge filer/server management and other services, such as web access, cloud drive end points, mobile apps, API, etc.

CTERA uses S3 or Azure compatible object storage for its backend, source of truth repository to hold customer file data. CTERA currently supports 36 on-prem and in cloud object storage services. Customers can have their data in multiple object storage repositories. Customer files are mapped one to one to objects.

CTERA offers global dedupe, virus scanning, policy based scheduled snapshots and end to end encryption of customer data. Encryption keys can be held in the Portals or in a KMIP service that’s connected to the Portals.

CTERA has impressive data security support. As mentioned above end-to-end data encryption but they also support dark sites, zero-trust authentication and are DISA (Defense Information Systems Agency) certified.

Customer data can also be pinned to edge filers, Moreover, specific customer (director/sub-directorydirectories) data can be hosted on specific buckets so that data can:

Stay within specified geographies,
Support multi-cloud services to eliminate vendor lock-in

CTERA file locking is what I would call hybrid. They offer strict consistency for file locking within sites but eventual consistency for file locking across sites. There are performance tradeoffs for strict consistency, so by using a hybrid approach, they offer most of what the world needs from file locking without incurring the performance overhead of strict consistency across sites. For another way to do support hybrid file locking consistency check out LucidLink’s approach (see the GreyBeards podcast with LucidLink above).

At the end of their session Aron Brand got up and took us into a deep dive on select portions of their system software. One thing I noticed is that the portal is NOT in the data path. Once the edge filers want to access a file, the Portal provides the credential verification and points the filer(s) to the appropriate object and the filers take off from there.

CTERA’s customer list is very impressive. It seems that many (50 of WW F500) large enterprises are customers of theirs. Some of the more prominent include GE, McDonalds, US Navy, and the US Air Force.

Oh and besides supporting potentially 1000s of sites, 100K users in the same name space, and they also have intrinsic support for multi-tenancy and offer cloud data migration services. For example, one can use Portal services to migrate cloud data from one cloud object storage provider to another.

They also mentioned they are working on supplying K8S container access to CTERA’s global file system data.

There’s a lot to like in CTERA. We hadn’t heard of them before but they seem focused on enterprise’s with lots of sites, boatloads of users and massive amounts of data. It seems like our kind of storage system.

Comments?

Storageless data!?

Posted on March 10, 2021March 18, 2021 by Ray in data access, data mobility, Data QoS, Distributed computing, File Storage, Strategic Inflection Points

I (virtually) attended SFD21 earlier this year and a company called Hammerspace presented discussing their vision for storageless data (see videos of their session at SFD21).

We’ve talked them before but now they have something to offer the enterprise – data mobility or storageless data.

The white board after David Flynn’s session at SFD8

In essence, customers want to be able to run their workloads wherever it makes the most sense, on prem, in private cloud, and in the public cloud among other places. Historically, it’s been relatively painless to transfer an application’s binary from one to another data center, to a managed service provider or to the public cloud.

And with VMware Cloud Foundation, Kubernetes, Docker and Linux operating everywhere, the runtime environment and other OS services that applications depend on are pretty much available in any of those locations. So now customers have 2 out of 3, what’s left?

It’s all about the Data

Data can take a very long time to move around a data center, let alone across the web between locations. MBs and even GBs of data may be relatively painless to move, but TBs of data can be take days, and moving PBs of data is suicidal.

For instance, when we signed up for a globally accessible file synch and share storage service, I probably had 75GB or so of data I wanted managed. It took literally several days of time to upload this. Yes, I didn’t have data center class internet access, but even that might have only sped this up 2-5X. Ok, now try this with 1TB or more and it’s pretty much going to take days, and you can easily multiple that by 10 to do a PB or more. And that’s if it happens to continue to perform the transfer without disruption.

So what’s Hammerspace storageless data got to do with any of this.

Hammerspace’s idea

It’s been sort of a ground truth of storage, since I’ve been in the industry (40+ years now), that not all random IO data is accessed at the same frequency. That is, some data is accessed a lot and other data accessed hardly at all. That’s why DRAM caching of data can be so important to a host or storage system.

Similarly for sequential access, if you can get the first blocks of data to the host and then stream the rest in time, a storage system can appear to read fast.

Now I won’t go into all the tricks of doing good data caching, (the secret sauce to every vendor’s enterprise storage), but if you can appear to cache data well, you don’t actually need to transfer all the data associated with an application to a location it’s running in, you can appear as if all the data is there, when actually only some of it is present.

Essentially, Hammerspace creates a global file system for your data, across any locations you wish to use it, with great caching, optimized data transfer and with real storage behind it. Servers running your applications mount a Hammerspace file system/share that stitches together all the file storage behind it, across all the locations it’s operating in.

An application request goes to Hammerspace and if the data is not present there, Hammerspace goes and fetches and caches blocks of data as fast as it can. This will let the application start performing IO while the rest of the data is being cached and if allowed, moved to the new location.

Storage can be not managed by Hammerspace, read-write managed by Hammerspace or read-only managed by Hammerspace. For customers who want the whole Hammerspace storageless data functionality they would use read-write mode. For those who just want to access data elsewhere read-only would suffice. Customers who want to continue to access data directly but want read access globally, would use the read-only mode.

Once read-write storage is assigned to Hammerspace grabs all the file metadata information on the storage system. Once this process completes, customers no longer access this file data directly, but rather must access it through Hammerspace. At that point, this data is essentially storageless and can be accessed wherever Hammerspace services are available.

How does Hammerspace do it

Behind the scenes is a lot of technology. Some of which is discussed in the SFD21 sessions (see video’s above). Hammerspace is not in the data path but rather in the control path of data access. But it does orchestrate data movement, and it does route data IO requests from an application to where the data (currently) resides.

Hammerspace also supports Service Level Objectives (SLOs) for performance, geolocation, security, data protection options,, etc. These can be used to keep data in particular regions, to encrypt data (using KMIP), ensure high performance, high data availability, etc.

Hammerspace can manage data across 32 separate sites. It takes a couple of hours to deploy. per site. Each site has a Hammerspace metadata service with standalone access to all data within that site. For example, standalone access could be used, in the event of a network loss.

At the moment, they support eventual consistency and don’t support a global lock service. Rather, Hammerspace uses a conflict resolution service in the event data is overwritten by two or more applications. For any file that was being updated in two or more locations, that file would be flagged as in conflict, Hammerspace would provide snapshots of the various versions of the file(s) and it would require some sort of manual intervention to resolve the conflict. Each location would have (temporary) access to the data it had written directly, but at some point the conflict would need resolution.

They also support NFS and SMB file access for the front end and use object storage services for backend data. Data is copied on demand to the local site’s storage when accessed based on the SLO policies in effect for it. During data movement it is copied up, temporarily into objects on AWS, Microsoft Azure, or GCP, and then copied down to the location it’s being moved to. I believe this temporary object data is encrypted and compressed. Hammerspace support KMIP key providers.

Pricing for Hammerspace is on a managed capacity basis. But anyone can use Hammerspace for up to 10TB for free. Hammerspace is available in AWS marketplace for configuration there.

~~~~

Well it’s been a long time coming, but it appears to be here. Any customers wanting hybrid-cloud operations or global access to their data would be remiss to not check out Hammerspace.

[Edited after posting, The Eds.]

Data Science storage with NetApp’s Python Toolkit

Posted on February 22, 2021April 8, 2021 by Ray in data access, Data analytics, data logistics, Data science, Deep Learning, Machine Learning, Storage

I’ve got a book someplace (yet to be read completely) with the title Data science with Python. At a recent Storage Field Day 21 last month, NetApp was there discussing a number of their product offerings one of which was their Python SDK to manage NetApp storage for data scientists and AI researchers (see videos of their sessions here).

I’m not a data science expert but a Python SDK for storage management just makes so much sense to me I just had to take a look. Their GitHub repo is available online and they call it the NetApp Data Science Toolkit.

But first please take our new poll:

The challenge for data science and AI researchers is that it’s all about the data. How do you find the data, gain access to it, clean it, and process it quickly so you can do it all over again. Having some sort of Python SDK that allows you to do some rudimentary storage volume configuration, access, snapshotting etc. can make these sorts of pipelines be self-serviced rather than going back and forth with operations to get volumes configured, mounted, and services established.

NetApp Data Science Toolkit

The NetApp Data Science Toolkit can be PIP installed into anything with Python 3.5 or later and can be invoked via a command line or as a library of Python functions that can be invoked. The command line utility and the Python calls appear to be functionally equivalent.

pip3 install netapp-ontap pandas tabulate requests boto3

The Toolkit must be configured for your environment and NetApp storage but once that’s done your ready to rock and roll.

The command line is invoked with

./ntap_dsutil.py

following that command are subcommands and parameters specifying what ONTAP operation you want to perform and how it is to be done. Python function calls seem to follow the same parameterization as the CLI.

The CLI and Python function calls can run on MacOS or any Linux distribution. There’s a paper that discusses how to use the SDK to accelerate AI pipelines as well as another ReadMe that describes it’s use in Kubernetes with NetApp’s Trident CSI plugin.

The functionality supports NetApp AFF, FAS, Cloud Volumes and Select that are running ONTAP 9.7 or later. For a current list of ONTAP functions available, check out the toolkit. But for a overview these ONTAP functions were available.

For Volume Management – cloning, creating, listing all, deleting or mounting a volume,
For Snapshot Management – creating, deleting, listing and restoring snapshots (of volumes)
For Data Fabric Management – listing all cloud sync relationships, triggering a cloud sync operation, multi-thread pulling a bucket down from S3 storage (into a NetApp volume directory), pulling a single object down from S3 into a file, pushing the contents of a directory to bucket on S3 and pushing a file into an object on S3.
For Advanced Data Fabric Management – listing all SnapMirror relationships and triggering a sync operation for an existing SnapMirror relationship.

This is a pretty comprehensive list of NetApp ONTAP storage functionality. Having all this under control of Python and CLI for data scientist or AI researcher seems pretty impressive.

Of course not every option for all those functions are supported but it’s just a start (V1.1 of the toolkit). I’m sure there’s more to come, especially if customers demand it.

However, it would be nice to have an ONTAP simulator available with the toolkit that could be used to test out your Python code and CLI commands before using real NetApp storage. This would be very useful for those of us lacking our own test ONTAP storage, just hanging around on prem or in the cloud.

As Python becomes the language of choice for AI and now data science, it seems only natural that storage and data protection companies would start releasing Python SDKs/APIs for their product functionality. That way AI and data science researchers could embed any storage functionality they needed directly into their Python code or Jupyter Notebook application.

Having a Python SDK for NetApp ONTAP storage, means using data storage for your MLops or data science pipelines is that much easier.

Great move by NetApp. Ok where’s the rest of the industry?

Picture credit(s):

How to build a data science pipeline by Balázs Kégl
What’s data science pipeline by Geek for Geeks
MLOps: Continuous delivery and automation pipelines in ML from Google
CLI command and code snippets from NetApp Data Science Toolkit GitHub and other papers linked to in the repo’s ReadMe file

cOAlition S requires open access to funded research

Posted on January 26, 2021January 26, 2021 by Ray in Business economics, data access, Information economy, R&D measures, Strategic Inflection Points, Visionary leadershp

I read a Science article this last week (A new mandate highlights costs and benefits of making all scientific articles free) about a group of funding organizations that have come together to mandate open access to all peer-reviewed research they fund called Plan S. The list of organizations in cOAlition S is impressive including national R&D funding agencies from UK, Ireland, Norway, and a number of other countries, charitable R&D funding agencies from WHO, Welcome Trust, Bill&Melinda Gates Foundation and more, and the group is also being funded by the EU. Plan S takes effect this year.

Essentially, all research funded by these organizations must be immediately published in open access forum, open access journals or be freely available in an open access section of a publishers website which means it could be free to be read by anyone worldwide with access to the web. Authors and institutions will retain copyright for the work and the work will be published under an open access license such as the CC BY (Creative Commons Attribution) license.

Why open access is important

At this blog, frequently we find ourselves writing about research which is only available on a paid subscription or on a pay per article basis. However, sometimes, if we search long enough, we find a duplicate of the article published in pre-print form in some preprint server or open access journal.

We have written about open access journals before (see our New Science combats Coronavirus post). Much of what we do on this blog would not be possible without open access journals like PLoS, BioRxiv, and PubMed.

Open access mandates are trending

Open access mandates have been around for a while now. And even the US Gov’t got into the act, mandating all research funded by the NIH be open access by 2008, with Dept of Agriculture and Energy following later (see wikipedia Open access mandates).

In addition, given the pandemic emergency, many research publishers like Nature and Elsevier made any and all information about the Coronavirus free access on their websites.

Impacts and R&D research publishing business model

Although research is funded by public organizations such as charities and government agencies, prior to open access mandates, most research was published in peer-reviewed journal magazines which charged a fee for access. For many research organizations, those fees were a cost of doing research. If you were an independent researcher or in an institution that couldn’t afford these fees, attempting to do cutting edge research was impossible without this access.

Yes in some cases, those journal repositories waved these fees for deserving institutions and organizations but this wasn’t the case for individual researchers. Or If you were truly diligent, you could request a copy of a paper from an author and wait.

Of course, journal publishers have real expenses they needed to cover, as well as make a reasonable profit. But due to business consolidation, there were fewer independent journals around and as a result, they charged bundled license fees for vast swathes of research articles. Such a wide bundle may or may not be of interest to an individual or an institution. That plus with consolidation, profits were becoming a more significant consideration.

So open access mandates, often included funding to cover fees for publishers to supply open access. Such fees varied widely. So open access mandates also began to require fees to be published and to be supplied a description how prices were calculated. By doing so, their hope was to make such costs more transparent

Impacts on authors of research articles

Somewhere there’s an aphorism for researchers that says “publish or perish“, which means you must publish research in order to become a recognized expert in your field. Recognition often the main driver behind better academic employment and more research funding.

However, it’s not just about volume of published papers, the quality of research also matters. And the more highly regarded publishing outlets have an advantage here, in that they are de facto gatekeepers to whats published in their journals. As such, where you publish can often lend credibility to any research.

Another thing changed over the last few decades, judging the quality of research has become more quantative. Nowadays, research quality is also dependent on the number of citations it receives. The more popular a publisher is, the more readers it has which increases the possibility for citations.

Thus, most researchers try to publish their best work in highly regarded journals. And of course, these journals have a high cost to provide open access.

Successful research institutions can afford to pay these prices but those further down the totem pole cannot.

Most mandates come with additional funding to support paying the cost to supply open access. But they also require publishing and justifying these. In the belief that in doing this so it will lend some transparency to these costs.

So the researcher is caught in the middle. Funding organizations want open access to research they fund. And publishers want to be paid a profit for that access.

History of research publication

Nature magazine first started publishing research in 1859, Science magazine first published in 1880, the Royal Society first published research in 1665. So publishing research has been going on for 350 years, and at least as a for profit business model, since the mid-1800s.

Research prior to being published in journals was only available in books. And more than likely, the author of the research had to pay to have a book published and the publisher made money only when those books were sold. And prior to that, scientific research was mostly only available in a course of study, also mostly paid for by the student.

So science has always had a cost to access. What open access mandates are doing is moving this cost to something added to the funding of research.

Now if open access can only solve the reproducibility crisis in science we could have us a real scientific revolution.

Comments?

Photo Credits:

From the cOAlition S website
From the PLoS website
From the BioRxiv website
From wikipedia article on Sir Isaac Newton

Google Docs as subversive technology

Posted on June 12, 2020June 12, 2020 by Ray in Crowdsourcing, data access, Strategic Inflection Points, Visionary leadershp

Read an article the other day in TechReview (How Google Docs became the social media of the resistance) about how Google Docs was being used to help coordinate and promote the resistance surrounding the recent Black Lives Matter movement.

The article points out that Google Docs are sharing resources around anti-racism, email templates, bail resources, pro-bono legal assistance, etc. to help inform and coordinate the movements actions and activities.

Social unrest, the killer app for Google Docs

Protests could be the killer app for shared Google Docs. Facebook and other social media sites are better used for documenting the real time interactions during protests, but coordinating, motivating and informing the protests and protestors is better accomplished using Google Docs, a simple web based, document editor and sharing service.

In pre-internet days, I suppose all this would have been done on hand copied, typeset printed, carbon copied or photocopied theses/phamplets/fliers/printouts. For example, Luther’s list of grievances nailed to the cathedral door, Common Sense pamphlet during the USA revolutionary war to countless fliers during the 60’s protests, all these used the technology of the day to promote protest and revolution.

Nowadays all it takes is a shared Google Doc and a Google (drive) account.

Google Docs are everywhere

The high school that one of my kids went to uses Google Docs for sharing and submitting homework assignments.

Google Docs are shareable because they are hosted on Google Drives. Docs is just one component of the Google (G-)suite of web based apps that includes Google Sheets (spreadsheets), Google Slides (presentations) and Google Drives (object storage).

Moreover, any Google Doc, Sheet or Slide file can be shared and edited by anyone. And Google services like Docs, Sheets, and Slides are useable anonymously, Anyone onlin, can make a change to a shareable/editable doc, sheet, or slide and their changes are automatically saved to the google drive file.

Another thing is that any Google Doc can be shared with just a URL. And they can also be made read-only (or uneditable) by their owner at any time. And of course any Google Doc is backed up automatically by Google drive services.

Owners of documents can revert to previous versions of a Doc file. So if someone incorrectly (or maliciously) changes a doc, the originator can revert it back to a prior version.

Why not use a Wiki

I would think a Wiki would be better to use to coordinate, motivate and inform a protest. Once a Wiki is setup and started, it can be much easier to navigate, as easy to update, and can become a central repository of all information about a movement/protest.

But it takes a lot more effort and IT-web knowledge to set up a Wiki. And it has to have it’s own web address.

Another problem with a Wiki, is that it can become a central point which can be more easily attacked or disturbed. And Wiki edit wars are pretty common, so they too are not immune to malicious behavior.

But with 10s to 100s of Google Docs, spread across user a similar number of user Google drives, Google Docs are a much more distributed resource, less prone to single point of attack. And they can be created and edited almost on a whim. And the only thing it takes is a Google log in and Google drive.

~~~~

Photo copiers were a controlled technology in the old Soviet Union and even today facebook and twitter are restricted in China and other authoritarian states.

But Google Doc’s seems to have become a much more ubiquitous tool and have become the latest technology, to aid, abet and support social resistance.

Photo credit(s):

Wikipedia, Martin Luther 95 Theses article
Star Tribune, State fair vendors focus of debate on BLM
Wikipedia, Common Sense article

New science used to combat COVID-19 disease

Posted on March 9, 2020March 9, 2020 by Ray in Crowdsourcing, data access, Data banks, Data deposits, R&D measures, Strategic Inflection Points, Visionary organizations

Read an article last week in Science Magazine (A completely new culture on doing research… ) on how the way science is done to combat disease has changed the last few years.

In the olden days (~3-5 years ago), disease outbreaks would generate a slew of research papers to be written, submitted for publication and there they would sit, until peer-reviewed, after which they might get published for the world to see for the first time. Estimates I’ve seen say that the scientific research publishing process takes anywhere from one month (very fast) to 4-8 months, assuming no major revisions are required.

With the emergence of the Zika virus and recent Ebola outbreaks, more and more biological research papers have become available through pre-print servers. These are web-sites which accept any research before publication (pre-print), posting the research for all to see, comment and understand.

Open science via pre-print

Most of these pre-print servers focus on specific areas of science. For example bioRxiv is a pre-print server focused on Biology and medRxiv is for health sciences. On the other hand, arXiv is a pre-print server for “physics, mathematics, computer science, quantitative biology, quantitative finance, statistics, electrical engineering and systems science, and economics.” These are just a sampling of what’s available today.

In the past, scientific journals would not accept research that had been published before. But this slowly change as well. Now most scientific journals have policies gol pre-print publication and will also publish them if they deem it worthwhile, (see wikipedia article List of academic journals by pre-print policies).

As of today (9 March 2020) ,on biorXiv there are 423 papers with keyword=”coronavirus” and 52 papers with the keyword COVID-19, some of these may be the same. The newest (Substrate specificity profiling of SARS-CoV-2 Mpro protease provides basis for anti-COVID-19 drug design) was published on 3/7/2020. The last sentence in their abstract says “The results of our work provide a structural framework for the design of inhibitors as antiviral agents or diagnostic tests.” The oldest on bioRxiv is dated 23 January 2020. Similarly, there are 326 papers on medRxiv with the keyword “coronavirus”, the newest published 5 March 2020.

Pre-print research is getting out in the open much sooner than ever before. But the downside, is that pre-print papers may have serious mistakes or omissions in them as they are not peer-reviewed. So the cost of rapid openness is the possibility that some research may be outright wrong, badly done, or lead researchers down blind alleys.

However, the upside is any bad research can be vetted sooner, if it’s open to the world. We see similar problems with open source software, some of it can be buggy or outright failure prone. But having it be open, and if it’s popular, many people will see the problems (or bugs) and fixes will be rapidly created to solve them. With pre-print research, the comment list associated with a pre-print can be long and often will identify problems in the research.

Open science through open journals

In addition to pre-print servers , we are also starting to see the increasing use of open scientific journals such as PLOS to publish formal research.

PLOS has a number of open journals focused on specific arenas of research, such as PLOS Biology, PLOS Pathogyns, PLOS Medicine, etc.

Researchers or their institutions have to pay a nominal fee to publish in PLOS. But all PLOS publications are fully expert, peer-reviewed. But unlike research from say Nature, IEEE or other scientific journals, PLOS papers are free to anyone, and are widely available. (However, I just saw that SpringerNature is making all their coronavirus research free).

Open science via open data(sets)

Another aspect of scientific research that has undergone change of late is the sharing and publication of data used in the research.

Nature has a list of recommended data repositories. All these data repositories seem to be hosted by FAIRsharing at the University of Oxford and run by their Data Readiness Group. They list 1349 databases of which the vast majority (1250) are for the natural sciences with over 1380 standards used for data to be registered with FAIRsharing.

We’ve discussed similar data repositories in the past (please see Data banks, data deposits and data withdrawals, UK BioBank, Big open data leads to citizen science, etc). Having a place to store data used in research papers makes it easier to understand and replicate science.

Collaboration software

The other change to research activities is the use of collaborative software such as Slack. Researchers at UW Madison were already using Slack to collaborate on research but when Coronavirus went public, they Slack could help here too. So they created a group (or channel) under their Slack site called “Wu-han Clan” and invited 69 researchers from around the world. The day after they created it they held their first teleconference.

Other collaboration software exists today but Slack seems most popular. We use Slack for communications in our robotics club, blogging group, a couple of companies we work with, etc. Each has a number of invite-only channels, where channel members can post text, (data) files, links and just about anything else of interest to the channel.

Although I have not been invited to participate in Wu-han Clan (yet), I assume they usee Slack to discuss and vet (pre-print) research, discuss research needs, and other ways to avert the pandemic.

~~~~

So there you have it. Coronavirus scientific research is happening at warp speed compared to diseases of yore. Technologies to support this sped up research have all emerged over the last five to 10 years but are now being put to use more than ever before. Such technological advancement should lead to faster diagnosis, lower worldwide infection/mortality rates and a quicker medical solution.

Photo Credit(s):