Need memory, Intel’s Optane DC PM to the rescue

I attended Intel’s DataCentric Innovation Conference Tech Field Day eXclusive (TFDx) last April. There were a couple of items Intel presented at the show that peaked my interest there, one of which was DL Boost (see my Intel’s new DL Boost for AI inferencing blog post) and the other was Optane DC PM (data center persistent memory). This post is about Optane DC PM.

As you already know, Optane SSDs have been on the market now for at least a year or so and have not gained much market traction as of yet. I and others attribute this to the high price differential between Optane SSDs and NVMe Flash SSDs but others may say it’s a matter of production yields – probably a little of both.

But Optane, as announced, always had another form factor (if that’s the right term), as persistent memory that could dramatically increase the size of server memory to support new memory intensive applications at a lower price than DRAM.

I was at Nutanix .NEXT conference last week and saw a 4 socket server from DELL that had 6TB of DRAM in it (and 4-44 core CPUs). I didn’t ask the price but when I mentioned I wanted one for my home office, they said it could easily heat my house. So the other problem with a lot of DRAM is power consumption.

Optane DC PM (data center persistent) memory is intended to solve both the high cost and high power consumption problems of DRAM.

How does it work in a server

The newer Intel motherboards support up to 12 slots of memory per socket. And up to 6 of these can be Optane DC PM (512GB DIMM) or 3TB per socket. Optane DC PM is accessed via L1-L2 caching just like any other memory. Apparently with a dual socket system you can have up to 11 Optane DC PM DIMMs on the motherboard.

L1-L2 cache access times are on the order of picoseconds (10**[-12] seconds), DRAM is on the order of nanoseconds (10**[-9] seconds) and flash is on the order of 100 microseconds (100*10**[-6] seconds). So there’s a vast access time gulf between DRAM and Flash that could be exploited with the right technology – enter Optane DC PM.

The only detailed info I could find on Optane DC PM access times was in a research paper (see Basic performance of Intel Optane DC PMM research paper) and it said Optane DC PM assessing times are ~350 nanoseconds, or close to right between DRAM and Flash. At the show the development team indicated that Optane DC PM support about 3GB/sec of bandwidth per module (DIMM).

There are two ways to use Optane DC PM:

  1. Memory mode – in Memory mode, the data in Optane DC PM is thrown away during a power cycle. You must use a block of DRAM as a cache or rather a virtual memory block to the Optane DC PM acting as a paging store. Data is brought into the DRAM cache when accessed using its (virtual) DRAM address and when no longer used. it gets evicted (destaged) back out to Optane DC PM. When power is cycled the data in Optane DC PM is cleared out. Optane DC PM supports AES XTS-256 bit encryption and can easily be cleared by throwing away encryption keys during a power cycle.
  2. App Direct mode – in App Direct mode, Optane DC PM is accessed directly using application APIs and its data persists across power cycles. For App Direct mode, Optane DC PM is still AES 256 encrypted but here the encryption key is maintained across power cycles but is locked on power up and you need to use a pass phrase to unlock it. In this mode, applications are responsible for flushing (L1-L2) caches to Optane to retains all data written through L1-L2 to the Optane DC PM. There’s a GitHub Persistent Memory Development Kit (PMDK) library for that supports the App Direct mode API that applications need to use.

Both modes use DDR-T, (transactional DDR4) a new memory bus protocol for Optane DC PM access. In the DDR-T protocol, access to the memory bus is requested by a CPU and is granted by an Optane DC PM DIMM. All Optane DC PM DIMMs on a system can be accessed in parrallel.

You can use RDMA to replicate (App direct?) Optane DC PM data from one system to another. In order to support Memory and App Direct mode, Optane DC PM required CPU, BIOS and (application) software changes.

Most of the Optane DC PM support and cryptology logic is implemented in hardware. Optane DC PM has an address indirection table (AIT) to support 3D XPoint wear leveling maintained in DRAM but flushed to Optane during power loss. Transfers to 3D XPoint media is in 256 byte cache lines but the memory bus operates in 64 byte cache lines, so there’s a (DRAM) buffer between media and memory bus.

Optane also supports a high availability, or up to two device failure mode. In this scenario, if one Optane DC PM DIMM fails, the system can swap another spare Optane DC PM DIMM into that address space and continue to operate. If a 2nd Optane DC PM fails then the system fails. Not sure what happens to the data on the original Optane DC PM DIMM during a failure. It seems to me data would be lost, but it could depend on its failure mode.

In Memory mode, the expected ratio between DRAM size and Optane DC PM size is should be 32GB DRAM/6TB Optane DC PM. At the TFDx event, the Optane DC PM team had some performance charts showing different DRAM cache miss rates. Intel also announced new CPU monitoring statistics to track application/workloads impacting DRAM/Optane DC PM in Memory mode and to track Optane DC PM health.

Optane DC PM usage modes are established by the BIOS. It’s flexible enough to have the Optane DC PM usage modes be defined on a region by region basis. Not exactly sure what a region is, but it could be an address range spanning MB(?) of Optane DC PM. With both modes in operation on a system, data can be moved from Memory mode Optane to App direct mode Optane or vice versa.

Intel expects that lifetime of an Optane DC PM DIMM to be from 200-370PB of data writes. Optane DC PMs have a 5 year warrantee. Given its bandwidth (3GB/sec), 200PB of data writes should last ~2 years but that’s at 100% duty cycle, writing 3GB of data, every second of every day. So, 5 years should be a reasonable guarantee using a more realistic ~40% duty cycle.

What applications use Optane DC PM

The one of interest to most people seems to be SAP HANA. According to the development team, SAP HANA could use App Direct mode for main database storage and use DRAM for its delta column store. Cassandra could also use Optane in App Direct mode in a similar fashion.

Also something like a REDIS for key-value store could use Optane DC PM to store Values and use DRAM to store Keys.

But any application could take advantage of the extra memory made available with Optane DC PM DIMMs in Memory mode today. Of course any use of Optane DC PM would require the right levels of Intel Xeon CPUs (Cascade Lake), BIOSes and motherboards.

~~~~

Interested in learning more, TFDx videos of the event are available on the website noted previously. Also these TFDx bloggers also have posts specifically on Optane DC PM.

The coolest thing since sliced bread – Optane by Matt Leib, (@MBLeib)

Intel’s crossover point: A 3D spork by Stephen Foskett (@SFoskett)

Intel answering SAP HANA’s tough questions by Keith Townsend (@CTOAdvisor)

Comments?

For data that never rests, NetApp NDAS

NetApp co-founder, Dave Hitz announced he was becoming a NetApp Founder Emeritus at the Storage Field Day (SFD18) show. He gave a great session about what he and his Hitz foundation’s been doing (for one example see our Archeology meets big data, post). He also discussed at length where he felt the storage world (and NetApp) must do to address the opportunities of the new cloud world. But this post isn’t about Dave, it’s about NetApp Data Availability Service, NDAS.

NetApp NDAS, currently in Beta but GAing (hopefully) later this year, is an AWS marketplace data orchestration solution that manages primary to secondary to S3 movement for ONTAP data. Essentially, NetApp Data Availability Services extends ONTAP data lifecycle management to AWS cloud. But it’s more than just a way to archive ONTAP data.

NDAS orchestrates Snapmirror services across ONTAP systems and AWS. But once your ONTAP data is in S3 it supplies access to that data for authorized AWS applications and services. That way one can use their ONTAP data to provide data analytics, train AI models, and do just about anything you can do with AWS applications today. By using NDAS, customers can extract more value from their ONTAP data.

NDAS is not just copying data to S3 but is also copying ONTAP metadata, catalogues and other information that provides context for that data. By copying ONTAP catalog information, customers and authorized end users can have file level access to ONTAP data residing in S3 objects.

NDAS today, only supports copying data from secondary ONTAP systems to S3. But a future enhancement will expand this to copy primary ONTAP data to S3.

How does NDAS work

NDAS provisions (your) EC2 instances, and middleware to read the data from the secondary systems and copy it to S3 buckets which you provide. NDAS after initial configuration to point to your ONTAP secondary storage systems, will autodiscover all the data available that can be copied to the cloud.

NDAS will start cataloguing your ONTAP data. NDAS EC2 instances support the NDAS copy, view and a Google-like search processes.

NDAS search presents a simplified file system view into your ONTAP data copied to S3. That way customers can identify data that could be used for AI training or data analytics that run in the cloud to access the data.

There’s extensive security to insure that NDAS is properly authorized to access your ONTAP data. Normal S3 security options also apply such as to have the data be encrypted on S3. NDAS data is automatically encrypted in flight.

Moreover, NDAS S3 bucket data can be replicated across AWS regions . Also serverless/lambda funationality are fully supported from or NDAS S3 buckets. .

What can it do with the data

AWS applications can access the data directly through NDAS APIs. Or customers can manually extract data they want to further process using the NDAS GUI to identify and copy data of interests. NDAS essentially creates a small app layer that allows users to view and access the ONTAP data in S3 as a file system.

One can have different NDAS AMIs operating in different regions for faster access or to support GDPR compliance requirements. Alternatively, a customer could have one NDAS AMI accessing all their secondary ONTAP instances.

NDAS is intended to provide a data analyst or IT generalist access to ONTAP data. This way AI training and big data analytics applications which run easily in the cloud, can have access to ONTAP data. In this way, customers can more effectively utilize data that IT has been storing and maintaining, since time began.

One NDAS beta customer is a MLB team. They have over time instrumented their stadiums to generate lot’s of data about pitch speed, rotation, ball location as it crosses the plate, etc.   The problem with all this data is siloed in onprem or IOT systems that generated it. But the customer wants to use the data to improve players, coaches and the viewer experience. And all that needs tools, applications and software that’s just not available to run in the data center. But with NDAS all this data is now available to cloud applications.

NDAS is supported by any ONTAP 9.5 or later (FAS, AFF, Cloud ONTAP, ONTAPselect) secondary storage system. ONTAP 9.5 software contains all the services required to support NDAS. This includes the copy-to-cloud APIs, as well as the NDAS proxy, which supplies the secure interface to NDAS operating in the cloud.

NetApp’s NDAS sessions are pretty informative. Anyone interested in finding out more should checkout the videos available on TechFieldDay website and Dave’s session is also worth a view.

For more information on Dave’s session and NDAS check out:

NetApp, Cloudier than ever by Enrico Signoretti (@ESignoretti)

NetApp and the space in between by Dan Frith (@PenguinPunk)

~~~~

Comments?

IT in space

Read an article last week about all the startup activity that’s taking place in space systems and infrastructure (see: As rocket companies proliferate … new tech emerges leading to a new space race). This is a consequence of cheap(er) launch systems from SpaceX, Blue Origin, Rocket Lab and others.

SpaceBelt, storage in space

One startup that caught my eye was SpaceBelt from Cloud Constellation Corporation, that’s planning to put PB (4X library of congress) of data storage in a constellation of LEO satellites.

The LEO storage pool will be populated by multiple nodes (satellites) with a set of geo-synchronous access points to the LEO storage pool. Customers use ground based secure terminals to talk with geosynchronous access satellites which communicate to the LEO storage nodes to access data.

Their main selling points appear to be data security and availability. The only way to access the data is through secured satellite downlinks/uplinks and then you only get to the geo-synchronous satellites. From there, those satellites access the LEO storage cloud directly. Customers can’t access the storage cloud without going through the geo-synchronous layer first and the secured terminals.

The problem with terrestrial data is that it is prone to security threats as well as natural disasters which take out a data center or a region. But with all your data residing in a space cloud, such concerns shouldn’t be a problem. (However, gaining access to your ground stations is a whole different story.

AWS and Lockheed-Martin supply new ground station service

The other company of interest is not a startup but a link up between Amazon and Lockheed Martin (see: Amazon-Lockheed Martin …) that supplies a new cloud based, satellite ground station as a service offering. The new service will use Lockheed Martin ground stations.

Currently, the service is limited to S-Band and attennas located in Denver, but plans are to expand to X-Band and locations throughout the world. The plan is to have ground stations located close to AWS data centers, so data center customers can have high speed, access to satellite data.

There are other startups in the ground station as a service space, but none with the resources of Amazon-Lockheed. All of this competition is just getting off the ground, but a few have been leasing idle ground station resources to customers. The AWS service already has a few big customers, like DigitalGlobe.

One thing we have learned, is that the appeal of cloud services is as much about the ecosystem that surrounds it, as the service offering itself. So having satellite ground stations as a service is good, but having these services, tied directly into other public cloud computing infrastructure, is much much better. Google, Microsoft, IBM are you listening?

Data centers in space

Why stop at storage? Wouldn’t it be better to support both storage and computation in space. That way access latencies wouldn’t be a concern. When terrestrial disasters occur, it’s not just data at risk. Ditto, for security threats.

Having whole data centers, would represent a whole new stratum of cloud computing. Also, now IT could implement space native applications.

If Microsoft can run a data center under the oceans, I see no reason they couldn’t do so in orbit. Especially when human flight returns to NASA/SpaceX. Just imagine admins and service techs as astronauts.

And yet, security and availability aren’t the only threats one has to deal with. What happens to the space cloud when war breaks out and satellite killers are set loose.

Yes, space infrastructure is not subject to terrestrial disasters or internet based security risks, but there are other problems besides those and war that exist such as solar storms and space debris clouds. .

In the end, it’s important to have multiple, non-overlapping risk profiles for your IT infrastructure. That is each IT deployment, may be subject to one set of risks but those sets are disjoint with another IT deployment option. IT in space, that is subject to solar storms, space debris, and satellite killers is a nice complement to terrestrial cloud data centers, subject to natural disasters, internet security risks, and other earth-based, man made disasters.

On the other hand, a large, solar storm like the 1859 one, could knock every data system on the world or in orbit, out. As for under the sea, it probably depends on how deep it was submerged!!

Photo Credit(s): Screen shots from SpaceBelt youtube video (c) SpaceBelt

Screens shot from AWS Ground Station as a Service sign up page (c) Amazon-Lockheed

Screen shots from Microsoft’s Under the sea news feature (c) Microsoft

Learning machine learning – part 3

Image of the cover of the book Deep Learning with Python

Decided to take the plunge and purchase the Deep Learning with Python book and see what it has to offer. In prior posts (see Learning machine learning – part 1 & part 2) we were working with the cloud tutorials. This Part one is based on the book

It has a great introduction into deep learning which is a subset of machine learning. After what I know today, the Microsoft Azure session was more on traditional (statistical) machine learning and not deep learning.

 

Installing deep learning

In order to use the book, you need access to Keras, Python, Jupyter and a Keras backend (TensorFlow, Microsoft CNTK or Theanno).

I decided not to use any cloud solutions and rather install Python, Jupiter, TensorFlow and Keras on my MacBook. Although it probably would have been much easier (and more costly) to use any cloud solution.

I followed the directions on the installing TensorFlow website for the PIP install (you have to install a “virtual environment” and “PIP” first). The MacBook didn’t have a NVIDIA GPU so I needed to install the CPU version of TensorFlow.

But I had the hardest time running any of the book examples. Whenever I changed any command cell in a Jupyter notebook with Keras functionality in them (like adding a space to the end of an “import Keras” command line), it would throw a (module not found) error.

After days of web searching for what path is used for Jupyter notebook-iPython/Python imports (sys.path and PYTHONPATH) and where I should be importing Keras from (it’s not “~/ .keras”), I got nowhere closer to running anything.

I finally saw that I could directly install Keras (again, when I installed Tensorflow, it installed Keras as well) into my VENV. After I did that, everything worked. (I probably have one too many Keras environments, but who cares).

Finally getting the environment correct, I could now execute any command cells in a Jupyter notebook (with Keras functionality properly, well most of them anyways).

Jupyter notebooks for dummies…

It took me a while to figure out that the way you run a Jupyter notebook server is by issuing the command “jupyter notebook” (nowhere in the command’s help file, but can be found in Jupyter tutorials). That’s when I started to see the problems in the installation section above with my Keras installation.

Understanding Jupyter notebooks is non-trivial. Yes, I know it’s an interactive code and documentation environment. It’s sort of like BASIC on steroids with WORD functionality built in/escapeable into at any time.

First thing to understand is that when you open up a jupyter notebook, you haven’t executed anything yet. YES there are output lines in the notebook you just opened but NO, they aren’t from executing them under your client-server environment.

The output lines you see in the notebook is output from someone else’s execution run. So while they may look like they worked fine but they haven’t executed in your installation environment yet..

Also, when executing Jupyter notebook command cells, pay special attention to the In [?]: that’s shown to the right of every command cell.

When the ‘?’ is a number, like In:[12] that tells you what sequence (12th in the sequence) that (multi-line command cell) has been executed in and when the ‘?’ a “*”, like In[*], it says that the Jupyter notebook server is executing that command cell. 

Some command cells generate Out [?]: lines and others do not. So can’t use this to tell if something’s been executed or not. The only way to tell if some command cell has been executed is by seeing the In [n]: integer as n be incremented from the last command cell you executed. Of course you can execute command cells out of sequence if you wish.

Jupyter notebook coding/executing was weird as one who is more used to C, Java, and other coding languages and IDEs. A video tutorial on Jupiter notebooks would probably have helped here, but I couldn’t find one.

Running the examples

You can download all of the books current examples from the book’s website.

The book suggests you add model layers, subtract model layers and change the parameters of the number of nodes in a model as examples for you to try at home.

In general, doing so (once the environment was setup properly) seemed to work as desired. Adding layers didn’t seem to change the accuracy of the models, if anything it degraded it, and deleting layers didn’t help either. Ditto for adding or reducing node counts within a layer.

There’s a bunch of datasets that comes with Keras install used in the examples. Many examples have a first step where you modify this data so as to be more amenable to deep learning modeling.

For example, there’s a IMDB dataset that has film reviews. The film reviews are text files. But deep learning doesn’t work on text strings so you need to convert the text files into lists of integers. You do this by looking up each word in a word dictionary and substituting the index for each word in the review, generating an array (list) of integers.

This is all done through the NumPy package. It’s worth the time to become familiar with Python and probably NumPy. I took the verbal Python tutorial (but did nothing to learn NumPy).

Another example is a real estate prediction model that has 13 different parameters across 500 or so neighborhoods. The parameters are all different, some are distances, some %s, some pricing differences, etc. In order to perform deep learning on them, the example normalizes all of them, using distance from mean, in units of standard deviation.

There are other examples of data transformations as well. It seems that transforming your data into something amenable to deep learning is one part of the magic of deep learning.

Back to the book

Getting through chapter 3 of the book i- fairly straightforward when everything is set up properly. I found a iPad app (Juno) that could be used to connect to the Jupyter Server and it seemed to work once I found the proper command to use to start Jupyter (jupyter notebook –ip=”*”) and the proper Jupyter configuration parameters to use.

The examples are pretty self-documenting so you should be able to try out any of them on your own. The book adds great explanations on machine learning, deep learning and and the overall flow of how to approach a deep learning project.

Once you finish chapter 4 of the book you have all the tools one needs to tackle any deep learning project that you want to attack. You may need to read up on how to transform your data and you will probably be using one of the modeling techniques in one of the examples but it seems easy enough.

The rest of the book’s chapters (which I have yet to complete) deal with deep learning in practice and it’s in these chapters that you can learn some of the art of deep learning data science and model science..

~~~~

I ended up having fun with Jupyter notebooks, once I got them running with the iPad client in one hand and the book in the other. At the end of chapter 4, I startedto see some applications to my consulting business that might be interesting to model.

Using the Mac CPU was fast enough for the examples but I may have to tear down the crypto mine and use it as an AI server for my home network if I plan to tackle something with more data.

Wish me luck…

Learning Machine Learning – part 2

In Learning Machine Learning – part 1, we covered AWS and GCP tutorials on machine learning within each of their clouds. In part 2 we cover Microsoft tutorial(s) on machine learning in Azure.

I found Machine Learning Jump Start in Microsoft Visual Academy with instructors, Buck Woody and Seayoung Rhee. This is a series of 4 video tutorials on Azure ML Studio. ML studio seems similar to AWS SageMaker as it’s a framework to perform machine learning.

Azure (and probably AWS & GCP) have a number of methods to perform machine learning. ML studio happens to be the one that I found but there are many others worth examining.

Azure’s ML Studio tutorial videos were a better than AWS but not as good as GCP IMHO for learning machine learning.  There are four videos in the series. I watched  the first (~45 minutes), the second  (~45 minutes) and most of the third (only 25 of ~45 minutes).

Video 1 Concepts and setting up a ML Studio account

In the first video, the instructors took a long time to get going and then when  got someplace interesting, it was all play acting (human as a machine learner) to teach concepts.

The tutorials do distinguish between Supervised learning and Unsupervised learning. Both of which can apply to prediction or classification types of problems or outcomes. These are discussed as classic machine learning characteristics.

 

In the last 1/3 of the first video they discuss Azure ML Studio. It provides a common place to work and collaborate across team members. It also provides a graphical approach to machine learning. ML Studio also supports a programmable API, but I never got to that section in my viewing.

Some Azure ML Studio strengths:

  1. It provides industry recognized  data sets and data science algorithms that can be used as a black box, such as recommendation engines.
  2. It allows you to publish and consume machine learning solutions.

On the Azure portal there’s a machine learning studio icon (it’s now buried under the 100+ services link, in the AI + Machine Learning section). You use this to create a new ML studio workspace.

Inside a workspace you can use Azure ML studio services.  In the workspace you can review all your experiments (these are algorithms or predictive models being worked). 

In the Experiments page you can create a new experiment which is sort of a graphical workflow of the machine learning task.

There you will find a list of Azure sample data sets and sample algorithms that can be used in your experiment. The first video didn’t go into much detail on any of this other than showing you how to get started and create a ML studio workspace.

Video 2 how to use ML studio

Video 2 takes your ML studio workspace and runs a rudimentary experiment with it. In this video they walk you through selecting a data set, selecting algorithms to use and how to connect them into a machine learning workflow.

Creating an ML Studio experiment is almost like flowcharting your workflow. You select the data you want and drop it into the workflow. Next select an extraction engine you want to use and drop it into the work flow and connect it to the data. Then. you identify what you want to do with the data (like training) and drop that algorithm into the workflow and so on.  In the end you have defined a sequence of actions to perform on data.

In their example, dataset they use a user movie ratings dataset. They connect this to a bayesian learning model and to a IMDB database to extract movie titles. The tutorial experiment is a movie recommendation engine.Although it wasn’t a neural net many of the same techniques apply.

 

ML Studio uses an intuitive graphical approach to defining a machine learning workflow.

Video 3 publishing your ML Studio web service

Video 3 shows you how to publish (on Azure’s Marketplace) the recommendation engine created in video 2 as an OData web service.

I stopped watching the 3rd video after about 25 minutes as it was setting up various aspect of the OData web service to be deployed on Azure marketplace.

Using Azure ML studio seemed pretty straightforward. But it was much more data science/data analytics activity than neural network training.

The Azure MVA ML Studio tutorial was created in 2014 so some of the concepts are a bit dated but most still apply.

Looking today on the Azure Portal, I was still able to find the ML studio workspaces under one of the 10 AI + Machine Learning services.  Again I would have to say the GCP tutorial was a better fit for what I wanted which was how do I create a neural  net and get it trained.

Other ML approaches under Azure

There are other Azure approaches to machine learning and tutorials that support them. For example, there’s a quick start tutorial to understand how to use Python and Jupyter notebooks under Azure, which is probably closer to the neural net training in GCP.

I found myself skipping ahead a lot in video 1 as it was mainly about concepts and not much technical detail. Video 2 was a good intro into ML studio and Video 3 showed you how to publish a ML studio web service in Azure but it was more details than I wanted to know. I never got to video 4, which probably talked about ML Studio’s programable API.

If I had to do it over again, I probably would have viewed the quick start tutorial with Python and Jupyter notebooks, which sounded more like the GCP tutorials in the part 1 post.

On the other hand, Azure ML Studio tutorials supplied a good complement to the GCP tutorial, as a different (more graphical) way to do ML. It would probably be worthwhile to view before taking the AWS Sagemaker tutorials as it’s a bit higher level and quicker introduction into the workflow of AI and machine learning.

Comments?

Picture credit(s): Screen shots of Videos 1, 2 and 3 in the MVA series, (c) Microsoft 

UK Biobank & the data economy – part 2

A couple of weeks back I wrote a post about repositories for all the data that users generate these days and what to do with it.  (See our post on Data banks, deposits, … data economy – part 1).

This past week I read an article (see ScienceDaily Genetics of brain structure … article) which partially exemplifies what that post talked about. The research used publicly available genetic information to tease out brain structure hereditary characteristics.  The Science Daily article was a summary of research done at the University of Oxford using information provided from the UK Biobank.

Biobank as a data bank

The Biobank has recruited 500K participants from the UK,  aged 40-69,  between 2006-2010, to share their anonymized health data with researchers and scientist around the world. The Biobank is set up as a Scottish charity, funded by various health organizations in UK both gov’t and private. 

In addition to information collected during the baseline assessment: 

  • 100K participants have worn a 24 hour health monitoring device for a week and 20K have signed up to repeat this activity.
  • 500K participants are providing have been genotyped (DNA sequencing to determine hereditary genes)
  • 100K participants will be medically scanned (brain, heart, abdomen, bones, carotid artery) with images stored in the Biobank
  • 100K participants have signed up to receive questionnaires asking  about diet, exercise, work history, digestive health and other medical indicators..

There’s more. Biobank is linking to electronic health records (EHR) of participants to track their health over time. The Biobank is also starting to provide blood analysis and other detailed medical measures of subjects in the study.

UK Biobank (data bank) information uses

“UK Biobank is an open access resource. The Resource is open to bona fide scientists, undertaking health-related research that is in the public good. Approved scientists from the UK and overseas and from academia, government, charity and commercial companies can use the Resource. ….” (from UK Biobank scientists page).

Somewhat like open source code, the Biobank resource is made available to anyone (academia as well as industry), that can make valid use of its data BUT any research derived from its data must be published and made freely available to the Biobank and the world.

Biobank’s papers page documents some of the research that has already been published using their data. It lists the paper on genetics of brain study mentioned above and dozens more.

Differences from Data Banks

In the original data bank post:

  1. We thought data was only needed by  AI/deep learning. That seems naive now. The Biobank shows that AI/deep learning is not the only application/research that needs data.
  2. We thought data would be collected by only by hyper-scalars and other big web firms during normal user web activity. But their data is not the only data that matters.
  3. We thought data would be gathered for free. Good data can take many forms, and some may cost money.
  4. We thought profits from selling data would be split between the bank and users and could fund data bank operations. But in the Biobank, funding came from charitable contributions and data is available for free (to valid researchers).

Data banks can be an invaluable resource and may take many forms. Data that’s difficult to find can be gathered by charities and others that use funding to create, operate and gather the specific information needed for targeted research.

Comments?

Photo Credit(s): Bank on it by Alan Levine

Latest MRI – two screws in the kneecap by Becky Stern

Other graphics from the Genetics of brain structure… paper

 

 

Data banks, data deposits & data withdrawals in the data economy – part 1

perspective by anomalous4 (cc) (from Flickr)

Big data visualization, Facebook friend connections
Facebook friend carrousel by antjeverena (cc) (from flickr)

Read an interesting article this week in The Atlantic, Why Technology Favors Tyranny by Yuvai Noah Harari, about the inevitable future of technology and how the use of data will drive it.

At the end of the article Harari talks about the need to take back ownership of our data in order to gain some control over the tech giants that currently control our data.

In part 3, Harari discusses the coming AI revolution and the impact on humanity. Yes there will still be jobs, but early on less jobs for unskilled labor and over time less jobs for skilled labor.

Yet, our data continues to be valuable. AI neural net (NN) accuracy increases as a function of the amount of data used to train it. As a result,  he has the most data creates the best AI NN. This means our data has value and can be used over and over again to train other AI NNs. This all sounds like data is just another form of capital, at least for AI NN training.

If only we could own our data, then there would still be value from people’s (digital) exertions (labor), regardless of how much AI has taken over the reigns of production or reduced the need for human work.Safe by cjc4454 (cc) (from flickr)

Safe by cjc4454 (cc) (from flickr)What we need is data (savings) banks. These banks would hold people’s data, gathered from social media likes/dislikes,  cell phone metadata, app/web history, search history, credit history, purchase history,  photo/video streams, email streams, lab work, X-rays, wearables info, etc. Probably many more categories need to be identified but ultimately ALL the digital data we generate today would need to be owned by people and deposited in their digital bank accounts.

Data deposits?

Social media companies, telecom, search companies, financial services app companies, internet  providers, etc. anywhere you do business should supply a copy of the digital data they gather for a person back to that persons data bank account.

There are many technical problems to overcome here but it could be as simple as an object storage bucket, assigned to each person that each digital business deposits (XML versions of) our  digital data they create for everyone that uses their service. They would do this as compensation for using our data in their business activities.

How to change data ownership?

Today, we all sign user agreements which essentially gives a company the rights to our data in perpetuity. That needs to change. I see a few ways that this change could come about

  1. Countries could enact laws to insure personal data ownership resides in the person generating it and enforce periodic distribution of this data
  2. Market dynamics could impel data distribution, e.g. if some search firm supplied data to us, we would be more likely to use them.
  3. Societal changes, as AI becomes more important to profit making activities and reduces the need for human work, and as data continues to be an important factor in AI success, data ownership becomes essential to retaining the value of human labor in society.

Probably, all of the above and maybe more would be required to change the ownership structure of data.

How to profit from data?

Technical entities needing data to train AI NNs could solicit data contributions through an Initial Data Offering (IDO). IDO’s would specify types of data required and a proportion of AI NN ownership, they would cede to all  data providers. Data providers would be apportioned ownership based on the % identified and the number of IDO data subscribers.

perspective by anomalous4 (cc) (from Flickr)
perspective by anomalous4 (cc) (from Flickr)

Data banks would extract the data requested by the IDO and supply it to the IDO entity for use. For IDOs, just like ICO’s or IPO’s, some would fail and others would succeed. But the data used in them would represent an ownership share sort of like a  stock (data) certificate in the AI NN.

Data bank responsibilities

Data banks would have various responsibilities and would need to collect fees to perform them. For example, data banks would be responsible for:

  1. Protecting data deposits – to insure data deposits are never lost, are never accessed without permission, are always trackable as to how they are used..
  2. Performing data deposits – to verify that data is deposited from proper digital entities, to validate that data deposits are in a usable form and to properly store the data in a customers object storage bucket.
  3. Performing data withdrawals – upon customer request, to extract all the appropriate data requested by an IDO,  anonymize it, secure it, package it and send it to the IDO originator.
  4. Reconciling data accounts – to track data transactions, data banks would supply a monthly statement that identifies all data deposits and data withdrawals, data revenues and data expenses/fees.
  5. Enforcing data withdrawal types – to enforce data withdrawal types, as data  withdrawals can have many different characteristics, such as exclusivity, expiration, geographic bounds, etc. Data banks would need to enforce withdrawal characteristics, at least to the extent they can
  6. Auditing data transactions – to insure that data is used properly, a consortium of data banks or possibly data accountancies would need to audit AI training data sets to verify that only data that has been properly withdrawn is used in trying the NN. .

AI NN, tools and framework responsibilities

In order for personal data ownership to work well, AI NNs, tools and frameworks used today would need to change to account for data ownership.

  1. Generate, maintain and supply immutable data ownership digests – data ownership digests would be a sort of stock registry for the data used in training the AI NN. They would need to be a part of any AI NN and be viewable by proper data authorities
  2. Track data use – any and all data used in AI NN training should be traceable so that proper data ownership can be guaranteed.
  3. Identify AI NN revenues – NN revenues would need to be isolated, identified and accounted for so that data owners could be rewarded.
  4. Identify AI NN data expenses – NN data costs would need to somehow be isolated, identified and accounted for so that data expenses could be properly deducted from data owner awards. .

At some point there’s a need for almost a data profit and loss statement as well as a data balance sheet for at an AI NN level. The information supplied above should make auditing data ownership, use and rewards much more feasible. But it all starts with identifying data ownership and the data used in training the AI.

~~~~

There are a thousand more questions that come to mind. For example

  • Who owns earth sensing satellite, IoT sensors, weather sensors, car sensors etc. data? Everyone in the world (or country) being monitored is laboring to create the environment sensed by these devices. Shouldn’t this sensor data be apportioned to the people of the world or country where these sensors operate.
  • Who pays data bank fees? The generators/extractors of the data could pay in addition to providing data deposits for the privilege to use our data. I could also see the people paying.  Having the company pay would give them an incentive to make the data load be as efficient and complete as possible. Having the people pay would induce them to use their data more productively.
  • What’s a decent data expiration period? Given application time frames these days, 7-15 years would make sense. But what happens to the AI NN when data expires. Some way would need to be created to extract data from a NN, or the AI NN would need to cease being used and a new one would  need to be created with new data.
  • Can data deposits be rented/sold to data aggregators? Sort of like a AI VC partnership only using data deposits rather than money to fund AI startups.
  • What happens to data deposits when a person dies? Can one inherit a data deposits, would a data deposit inheritance be taxable as part of an estate transfer?

In the end, as data is required to train better AI, ownership of our data makes us all be capitalist (datalists) in the creation of new AI NNs and the subsequent advancement of society. And that’s a good thing.

Comments?

 

 

Marketing meet Big Data, call records, credit card purchases & demographics

Read an article in Science Daily (Understanding urban issues through credit cards) that talked about a study published in Nature (Sequences of purchases in credit card data reveals lifestyles of urban populations) that applies big data to B2C marketing.

The researchers examined call data records (CDRs), credit card transactions records (CCRs) and demographic (age, sex, residential zip code, wage level, etc.) data and did a cross table between them to identify sequences of purchases. They then used these sequences to identify different lifestyle groups in the urban area.

Marketing 2.0

The analyzed data from Mexico City, Mexico. The CCR data was collected for 10 weeks across 150K users. The had CDR data for 1/10th of the users for 6 months surrounding the 10 weeks duration. Credit card adoption is still low in Mexico (18%), so the analysis may be biased.  When thy matched CCR expenditures against median wages in a district and they found their participants came from higher wage populations. Their data also spanned all districts within the city.

The analysis identified sequences of purchase categories as well as expenditures.  They characterized purchase sequences as “words”.

 

 

 

Using the word data and further statistical analysis they were able to split the population up into 5 distinct lifestyle groups. 

The loops of icons above represent major purchase categories derived from the CCR data merchant category codes (MCC).  Each of the rings in “a” above show the same 12 major MCC purchase categories. If you look at each ring, one can identify a central or core node that seems to have the most incoming or outgoing arks. These seem to be the central purchases made by that lifestyle group after which they branch out to other purchase categories.

There are five different lifestyle categories (they also show the city average) delineated in the data:

  • Commuter – generally they have to pay tolls, have longer travel between home and work and have a diverse sequence of purchase that occurs after purchases from the toll category.
  • Household – purchases seem to center on grocery stores/supermarkets and then branch off from there.
  • Young – purchases seem to center on the taxicab category and then go to computer-networking, restaurants, grocery stores/supermarkets.
  • Hi-Tech – purchases seem to center on computer-networking,  then go to gas stations, grocery stores/supermarkets, restaurants, and telecomm.
  • Average – seems to have two focuses grocery stores/supermarkets and restaurants and then goes out from there to gas stations, specialty food stores and department stores.
  • Dinner-out – purchases seem to center on restaurants and then branch out fro there to computer-networking, gas stations, supermarkets, fast food, etc.

In “b”  breakout above, you can see the socio-demographic characteristics of each lifestyle group as compared with the median user. And in “c” one can see some population histograms of the demographic data.

They were then able to use the CDR data to construct a map of which lifestyle called which other life style to identify call correlation data. Most calls were contacts between the same groups but the second most active call was calls to householders.

They took this same analysis to another city in Mexico and came up with six  lifestyle categories, five of the same and a different one.

~~~~

When I went to Uni (a long long time ago), I attended an urban geography class that was much more scientific and mathematical than any other geography class I had ever attended. I remember asking the professor when did geography become an exact science. As best as I can recall, he laughed and said over the last decade.

Analysis like the above could make B2C marketing, almost an exact science.

Big Data meet Marketing – Buyer beware.

Comments?

Photo Credit(s):  All charts/photos are from the Nature article Sequences of purchase in credit card data reveal lifestyles in urban populations