AWS Data Exchange vs Data Banks – part 2

Saw where AWS announced a new Data Exchange service on their AWS Pi day 2023. This is a completely managed service available on the AWS market place to monetize data.

In a prior post on a topic I called data banks (Data banks, data deposits & data withdrawals…), I talked about the need to have some sort of automated support for personal data that would allow us to monetize it.

The hope then (4.5yrs ago) was that social media, search and other web services would supply all the data they have on us back to us and we could then sell it to others that wanted to use it.

In that post, I called the data the social media gave back to us data deposits, the place where that data was held and sold a data bank, and the sale of that data a data withdrawal. (I know talking about banks deposits and withdrawals is probably not a great idea right now but this was back a ways).

AWS Data Exchange

1918 Farm Auction by dok1 (cc) (from Flickr)
1918 Farm Auction by dok1 (cc) (from Flickr)

With AWS Data Exchange, data owners can sell their data to data consumers. And it’s a completely AWS managed service. One presumably creates an S3 bucket with the data you want to sell. determine a price to sell the data for and a period clients can access that data for and register this with AWS and the AWS Data Exchange will support any number of clients purchasing data data.

Presumably, (although unstated in the service announcement), you’d be required to update and curate the data to insure it’s correct and current but other than that once the data is on S3 and the offer is in place you could just sit back and take the cash coming in.

I see the AWS Data Exchange service as a step on the path of data monetization for anyone. Yes it’s got to be on S3, and yes it’s via AWS marketplace, which means that AWS gets a cut off any sale, but it’s certainly a step towards a more free-er data marketplace.

Changes I would like to AWS Data Exchange service

Putting aside the need to have more than just AWS offer such a service, and I heartedly request that all cloud service providers make a data exchange or something similar as a fully supported offering of their respective storage services. This is not quite the complete data economy or ecosystem that I had envisioned in September of 2018.

If we just focus on the use (data withdrawal) side of a data economy, which is the main thing AWS data exchange seems to supports, there’s quite a few missing features IMHO,

  • Data use restrictions – We don’t want customers to obtain a copy of our data. We would very much like to restrict them to reading it and having plain text access to the data only during the period they have paid to access it. Once that period expires all copies of data needs to be destroyed programmatically, cryptographically or in some other permanent/verifiable fashion. This can’t be done through just license restrictions. Which seems to be the AWS Data Exchanges current approach. Not sure what a viable alternative might be but some sort of time-dependent or temporal encryption key that could be expired would be one step but customers would need to install some sort of data exchange service on their servers using the data that would support encryption access/use.
  • Data traceability – Yes, clients who purchase access should have access to the data for whatever they want to use it for. But there should be some way to trace where our data ended up or was used for. If it’s to help train a NN, then I would like to see some sort of provenance or certificate applied to that NN, in a standardized structure, to indicate that it made use of our data as part of its training. Similarly, if it’s part of an online display tool somewhere in the footnotes of the UI would be a data origins certificate list which would have some way to point back to our data as the source of the information presented. Ditto for any application that made use of the data. AWS Data Exchange does nothing to support this. In reality something like this would need standards bodies to create certificates and additional structures for NN, standard application packages, online services etc. that would retain and provide proof of data origins via certificates.
  • Data locality – there are some juristictions around the world which restrict where data generated within their boundaries can be sent, processed or used. I take it that AWS Data Exchange deals with these restrictions by either not offering data under jurisdictional restrictions for sale outside governmental boundaries or gating purchase of the data outside valid jurisdictions. But given VPNs and similar services, this seems to be less effective. If there’s some sort of temporal key encryption service to make use of our data then its would seem reasonable to add some sort of regional key encryption addition to it.
  • Data audibility – there needs to be some way to insure that our data is not used outside the organizations that have actually paid for it. And that if there’s some sort of data certificate saying that the application or service that used the data has access to that data, that this mechanism is mandated to be used, supported, and validated. In reality, something like this would need a whole re-thinking of how data is used in society. Financial auditing took centuries to take hold and become an effective (sometimes?) tool to monitor against financial abuse. Data auditing would need many of the same sorts of functionality, i.e. Certified Data Auditors, Data Accounting Standards Board (DASB) which defines standardized reports as to how an entity is supposed to track and report on data usage, governmental regulations which requires public (and private?) companies to report on the origins of the data they use on a yearly/quarterly basis, etc.

Probably much more that could be added here but this should suffice for now.

other changes to AWS Data Exchange processes

The AWS Pi Day 2023 announcement didn’t really describe the supplier end of how the service works. How one registers a bucket for sale was not described. I’d certainly want some sort of stenography service to tag the data being sold with the identity of those who purchased it. That way there might be some possibility to tracking who released any data exchange data into the wild.

Also, how the data exchange data access is billed for seems a bit archaic. As far as I can determine one gets unlimited access to data for some defined period (N months) for some specific amount ($s). And once that period expires, customers have to pay up or cease accessing the S3 data. I’d prefer to see at least a GB/month sort of cost structure that way if a customer copies all the data they pay for that privilege and if they want to reread the data multiple times they get to pay for that data access. Presumably this would require some sort of solution to the data use restrictions above to enforce.

Data banks, deposits, withdrawals and Initial Data Offerings (IDOs)

The earlier post talks about an expanded data ecosystem or economy. And I won’t revisit all that here but one thing that I believe may be worth re-examining is Initial Data Offerings or IDOs.

As described in the earlier post, IDO’ss was a mechanism for data users to request permanent access to our data but in exchange instead of supplying it for a one time fee, they would offer data equity in the service.

Not unlike VC, each data provider would be supplied some % (data?) ownership in the service and over time data ownership get’s diluted at further data raises but at some point when the service is profitable, data ownership units could be purchased outright, so that the service could exit it’s private data use stage and go public (data use).

Yeah, this all sounds complex, and AWS Data Exchange just sells data once and you have access to it for some period, establishing data usage rights.. But I think that in order to compensate users for their data there needs to be something like IDOs that provides data ownership shares in some service that can be transferred (sold) to others.

I didn’t flesh any of that out in the original post but I still think it’s the only way to truly compensate individuals (and corporations) for the (free) use of the data that web, AI and other systems are using to create their services.

~~~~

I wrote the older post in 2018 because I saw the potential for our data to be used by others to create/trlain services that generate lots of money for those organization but without any of our knowledge, outright consent and without compensating us for the data we have (indadvertenly or advertently) created over our life span.

As an example One can see how Getty Images is suing DALL-E 2 and others have had free use of their copyrighted materials to train their AI NN. If one looks underneath the covers of ChatGPT, many image processing/facial recognition services, and many other NN, much of the data used in training them was obtained by scrapping web pages that weren’t originally intended to supply this sorts of data to others.

For example, it wouldn’t surprise me to find out that RayOnStorage posts text has been scrapped from the web and used to train some large language model like ChatGPT.

Do I receive any payment or ownership equity in any of these services – NO. I write these blog posts partially as a means of marketing my other consulting services but also because I have an abiding interest in the subject under discussion. I’m happy for humanity to read these and welcome comments on them by humans. But I’m not happy to have llm or other RNs use my text to train their models.

On the other hand, I’d gladly sell access to RayOnStorage posts text if they offered me a high but fair price for their use of it for some time period say one year… 🙂

Comments?

Societal growth depends on IT

Read an interesting article the other day in SciencDaily (IT played a key role in growth of ancient civilizations) and a Phys.Org article (Information drove development of early states) both of which were reporting on a Nature article (Scale and information processing thresholds in Holocene social evolution) which discussed how the growth of society during ancient times was directly correlated to the information processing capabilities they possessed. In these articles IT meant writing, accounting, currency, etc., relatively primitive forms of IT but IT nonetheless.

Seshat: Global History Databank

What the researchers were able to do was to use the Seshat: Global History Databank which “systematically collects what is currently known about the social and political organization of human societies and how civilizations have evolved over time” and use the data to analyze the use of IT by societies.

We have talked about Seschat before (See our Data Analysis of History post)

The Seshat databank holds information on 30 (natural) geographical areas (NGA), ~400 societies and, their history from 4000 BCE to 1900CE.

Seschat has a ~100 page Code Book that identifies what kinds of information to collect on each society, how it is to be estimated, identified, listed, etc. to normalize the data in their databank. Their Code Book provides essential guidelines on how to gather the ~1500 variables collected on societies.

IT drives society growth

The researchers used the Seshat DB and ran a statistical principal component analysis (PCA) of the data to try to ascertain what drove society’s growth.

PCA (see wikipedia Principal Component Analysis article) essentially produces a list of variables and their inter-relationships. Their combined inter-relationships is essentially a percentage (%Var) of explanatory power in how much those variables explains the variance of all variables. PCA can be one, two, three or N-dimensional.

The researchers took Seshat 51 society variables and combined them into 9 (societal) complexity characteristics (CC)s and did a PCA of those variables across all the (285) society’s information available at the time.

Fig, 2 says that the average PC1 component of all societies is driven by the changes (increases and decreases) in PC2 components. Decreases of PC2 depend on those elements of PC2 which are negative and increases in PC2 depend on those elements of PC2 which are negative.

The elements in PC2 that provide the largest positive impacts are writing (.31), texts (.24), money (.28), infrastructure (.12) and gvrnmnt (.06). The elements in PC2 that provide the largest negative impacts are PolTerr (polity area, -0.35), CapPop (capital population, -0.27), PolPop (polity population, -0.25) and levels (?, -0.15). Below is another way to look at this data.

The positive PC2 CC’s are tracked with the red line and the negative PC2 CC’s are tracked with the blue line. The black line is the summation of the blue and red lines and is effectively equal to the blue line in Fig 2 above.

The researchers suggest that the inflection points in Fig 2 and the black line in Fig 3),represent societal information processing thresholds. Once these IT thresholds have passed they change the direction that PC2 takes on after that point

In Fig4 they have disaggregated the information averaged in Fig. 2 & 3 and show PC2 and PC1 trajectories for all 285 societies tracked in the Seshat DB. Over time as PC1 goes more positive, societie, start to converge on effectively the same level of PC2 . At earlier times, societies tend to be more heterogeneous with varying PC2 (and PC1) values.

Essentially, societies IT processing characteristics tend to start out highly differentiated but over time as societies grow, IT processing capabilities tend to converge and lead to the same levels of societal growth

Classifying societies by I

The Kadashev scale (see wikipedia Kardashev scale article) identifes levels or types of civilizations using their energy consumption. For example, The Kardashev scale lists the types of civilizations as follows:

  • Type I Civilization can use and control all the energy available on its planet,
  • Type II Civilization can use and control all the energy available in its planetary system (its star and all the planets/other objects in orbit around it).
  • Type III Civilization can use and control all the energy available in its galaxy

I can’t help but think that a more accurate scale for civilization, society or a polity’s level would a scale based on its information processing power.

We could call this the Shin scale (named after the primary author of the Nature paper or the Shin-Price-Wolpert-Shimao-Tracy-Kohler scale). The Shin scale would list societies based on their IT levels.

  • Type A Societies have non-existant IT (writing, money, texts, money & infrastructure) which severely limits their population and territorial size
  • Type B Societies have primitive forms of IT (writing, money, texts, money & infrastructure, ~MB (10**6) of data) which allows these societies to expand to their natural boundaries (with a pop of ~10M).
  • Type C Societies have normal (2020) levels of IT (world wide Internet with billions of connected smart phones, millions of servers, ZB (10**21) of data, etc.) which allows societies to expand beyond their natural boundaries across the whole planet (pop of ~10B).
  • Type D Societies have high levels of IT (speculation here but quintillion connected smart dust devices, trillion (10**12) servers, 10**36 bytes of data) which allows societies to expand beyond their home planet (pop of ~10T).
  • Type E Societies have high levels of IT (more speculation here, 10**36 smart molecules, quintillion (10**18) servers, 10**51 bytes of data ) which allows societies to expand beyond their home planetary system (pop of ~10Q).

I’d list Type F societies here but a can’t think of anything smaller than a molecule that could potentially be smart — perhaps this signifies a lack of imagination on my part.

Comments?

Photo Credit(s):

New science used to combat COVID-19 disease

Read an article last week in Science Magazine (A completely new culture on doing research… ) on how the way science is done to combat disease has changed the last few years.

In the olden days (~3-5 years ago), disease outbreaks would generate a slew of research papers to be written, submitted for publication and there they would sit, until peer-reviewed, after which they might get published for the world to see for the first time. Estimates I’ve seen say that the scientific research publishing process takes anywhere from one month (very fast) to 4-8 months, assuming no major revisions are required.

With the emergence of the Zika virus and recent Ebola outbreaks, more and more biological research papers have become available through pre-print servers. These are web-sites which accept any research before publication (pre-print), posting the research for all to see, comment and understand.

Open science via pre-print

Most of these pre-print servers focus on specific areas of science. For example bioRxiv is a pre-print server focused on Biology and medRxiv is for health sciences. On the other hand, arXiv is a pre-print server for “physics, mathematics, computer science, quantitative biology, quantitative finance, statistics, electrical engineering and systems science, and economics.” These are just a sampling of what’s available today.

In the past, scientific journals would not accept research that had been published before. But this slowly change as well. Now most scientific journals have policies gol pre-print publication and will also publish them if they deem it worthwhile, (see wikipedia article List of academic journals by pre-print policies).

As of today (9 March 2020) ,on biorXiv there are 423 papers with keyword=”coronavirus” and 52 papers with the keyword COVID-19, some of these may be the same. The newest (Substrate specificity profiling of SARS-CoV-2 Mpro protease provides basis for anti-COVID-19 drug design) was published on 3/7/2020. The last sentence in their abstract says “The results of our work provide a structural framework for the design of inhibitors as antiviral agents or diagnostic tests.” The oldest on bioRxiv is dated 23 January 2020. Similarly, there are 326 papers on medRxiv with the keyword “coronavirus”, the newest published 5 March 2020.

Pre-print research is getting out in the open much sooner than ever before. But the downside, is that pre-print papers may have serious mistakes or omissions in them as they are not peer-reviewed. So the cost of rapid openness is the possibility that some research may be outright wrong, badly done, or lead researchers down blind alleys.

However, the upside is any bad research can be vetted sooner, if it’s open to the world. We see similar problems with open source software, some of it can be buggy or outright failure prone. But having it be open, and if it’s popular, many people will see the problems (or bugs) and fixes will be rapidly created to solve them. With pre-print research, the comment list associated with a pre-print can be long and often will identify problems in the research.

Open science through open journals

In addition to pre-print servers , we are also starting to see the increasing use of open scientific journals such as PLOS to publish formal research.

PLOS has a number of open journals focused on specific arenas of research, such as PLOS Biology, PLOS Pathogyns, PLOS Medicine, etc.

Researchers or their institutions have to pay a nominal fee to publish in PLOS. But all PLOS publications are fully expert, peer-reviewed. But unlike research from say Nature, IEEE or other scientific journals, PLOS papers are free to anyone, and are widely available. (However, I just saw that SpringerNature is making all their coronavirus research free).

Open science via open data(sets)

Another aspect of scientific research that has undergone change of late is the sharing and publication of data used in the research.

Nature has a list of recommended data repositories. All these data repositories seem to be hosted by FAIRsharing at the University of Oxford and run by their Data Readiness Group. They list 1349 databases of which the vast majority (1250) are for the natural sciences with over 1380 standards used for data to be registered with FAIRsharing.

We’ve discussed similar data repositories in the past (please see Data banks, data deposits and data withdrawals, UK BioBank, Big open data leads to citizen science, etc). Having a place to store data used in research papers makes it easier to understand and replicate science.

Collaboration software

The other change to research activities is the use of collaborative software such as Slack. Researchers at UW Madison were already using Slack to collaborate on research but when Coronavirus went public, they Slack could help here too. So they created a group (or channel) under their Slack site called “Wu-han Clan” and invited 69 researchers from around the world. The day after they created it they held their first teleconference.

Other collaboration software exists today but Slack seems most popular. We use Slack for communications in our robotics club, blogging group, a couple of companies we work with, etc. Each has a number of invite-only channels, where channel members can post text, (data) files, links and just about anything else of interest to the channel.

Although I have not been invited to participate in Wu-han Clan (yet), I assume they usee Slack to discuss and vet (pre-print) research, discuss research needs, and other ways to avert the pandemic.

~~~~

So there you have it. Coronavirus scientific research is happening at warp speed compared to diseases of yore. Technologies to support this sped up research have all emerged over the last five to 10 years but are now being put to use more than ever before. Such technological advancement should lead to faster diagnosis, lower worldwide infection/mortality rates and a quicker medical solution.

Photo Credit(s):

Data analysis of history

Read an article the other day in The Guardian (History as a giant data set: how analyzing the past could save the future), which talks about this new discipline called cliodynamics (see wikipedia cliodynamics article). There was a Nature article (in 2012), Human Cycles: History as Science, which described cliodynamics in a bit more detail.

Cliodynamics uses mathematical systems theory on historical data to predict what will happen in the future for society. According to The Guardian and Nature articles, the originator of cliodynamics, Peter Turchin, predicted in 2010 that the world would change dramatically for the worse over the coming decade, with violence peaking in 2020.

What is cliodynamics

Cliodynamics depends on vast databases of historical data that has been amassed over the last decade or so. For instance, the Seshat Global History Databank (started in 2011, has 3 datasets: moralizing gods, axial age history [8th to 3rd cent. BCE], & social complexity), International Institute of Social History (est. 1935, in 2013 re-organized their collection to focus on data, has 33 dataverses ranging from data on apprenticeships, prices and wage history, strike history of various countries and time periods, etc. ), and Google NGRAM viewer (started in 2010, provides keyword statistics on Google BOOKs).

Cliodynamics uses the information from databases like the above to devise a mathematical model of the history of the world. From their mathematical model, cliodynamics researchers have discerned patterns or cycles in human endeavors that have persisted over centuries.

Cliodynamic cycles

Two of cycles of interest come to mind:

  • Secular cycle – this plays out over 2-3 centuries and starts out with a new egalitarian society that has low levels of inequality where the supply and demand for labor are roughly equal. Over time as population grows, the supply of labor outstrips demand and inequality increases. Elites then start to battle one another, war and political instability results in a new more equal society, re-starting the cycle .
  • Fathers and sons cycle – this plays out over 50 years and starts when the (fathers) generation responds violently to social injustice and the next (sons) generation resigns itself to injustice (or hopefully resolves it) until the next (fathers) generation sees injustice again and erupts violently re-starting the cycle over again. .

It’s this last cycle that Turchin predicted to peak again in 2020, the last one peaking in 1970 and the ones before that peaking in 1920 and 1870.

We’ve seen such theories before. In the 19th and 20th centuries there were plenty of historical theorist. Probably the most prominent was Marx but there were others as well.

The problem with cliodynamics, good data

Sparsity and accuracy of data has always been a problem with historical study. Much information is lost through natural or manmade disasters and much of what’s left is biased. Nonetheless, more and more data is being amassed of a historical nature every day, most of it quantitative and suitable to analysis.

Historical data, where available, can be assessed scientifically, and analyzed by using current tools such as data analytics, machine learning, & deep learning to ascertain trends and make predictions. And the more data available, the more accurate these analyses and predictions can become. Cliodynamics pre-dates much of these tools. but that’s no excuse for not to taking advantage of them.

~~~~

As for 2020, AI, automation and globalization has led and will lead to more job disruption. Inequality is also on the rise, at least throughout much of the west. And then there’s Brexit, USA elections and general mid-east turmoil that seems to all be on the horizon.

Stay tuned, 2020 seems only months away.

Photo Credits:

From Key Historic Figures of WW1 article, Mansell/Ghetty Images, (c) ThoughtCo

Anti War March (1968 Chicago) By David Wilson , CC BY 2.0, Link

Eleven times Americans have marched on Washington, (1920, Washington DC) (c) Smithsonian Magazine

UK Biobank & the data economy – part 2

A couple of weeks back I wrote a post about repositories for all the data that users generate these days and what to do with it.  (See our post on Data banks, deposits, … data economy – part 1).

This past week I read an article (see ScienceDaily Genetics of brain structure … article) which partially exemplifies what that post talked about. The research used publicly available genetic information to tease out brain structure hereditary characteristics.  The Science Daily article was a summary of research done at the University of Oxford using information provided from the UK Biobank.

Biobank as a data bank

The Biobank has recruited 500K participants from the UK,  aged 40-69,  between 2006-2010, to share their anonymized health data with researchers and scientist around the world. The Biobank is set up as a Scottish charity, funded by various health organizations in UK both gov’t and private. 

In addition to information collected during the baseline assessment: 

  • 100K participants have worn a 24 hour health monitoring device for a week and 20K have signed up to repeat this activity.
  • 500K participants are providing have been genotyped (DNA sequencing to determine hereditary genes)
  • 100K participants will be medically scanned (brain, heart, abdomen, bones, carotid artery) with images stored in the Biobank
  • 100K participants have signed up to receive questionnaires asking  about diet, exercise, work history, digestive health and other medical indicators..

There’s more. Biobank is linking to electronic health records (EHR) of participants to track their health over time. The Biobank is also starting to provide blood analysis and other detailed medical measures of subjects in the study.

UK Biobank (data bank) information uses

“UK Biobank is an open access resource. The Resource is open to bona fide scientists, undertaking health-related research that is in the public good. Approved scientists from the UK and overseas and from academia, government, charity and commercial companies can use the Resource. ….” (from UK Biobank scientists page).

Somewhat like open source code, the Biobank resource is made available to anyone (academia as well as industry), that can make valid use of its data BUT any research derived from its data must be published and made freely available to the Biobank and the world.

Biobank’s papers page documents some of the research that has already been published using their data. It lists the paper on genetics of brain study mentioned above and dozens more.

Differences from Data Banks

In the original data bank post:

  1. We thought data was only needed by  AI/deep learning. That seems naive now. The Biobank shows that AI/deep learning is not the only application/research that needs data.
  2. We thought data would be collected by only by hyper-scalars and other big web firms during normal user web activity. But their data is not the only data that matters.
  3. We thought data would be gathered for free. Good data can take many forms, and some may cost money.
  4. We thought profits from selling data would be split between the bank and users and could fund data bank operations. But in the Biobank, funding came from charitable contributions and data is available for free (to valid researchers).

Data banks can be an invaluable resource and may take many forms. Data that’s difficult to find can be gathered by charities and others that use funding to create, operate and gather the specific information needed for targeted research.

Comments?

Photo Credit(s): Bank on it by Alan Levine

Latest MRI – two screws in the kneecap by Becky Stern

Other graphics from the Genetics of brain structure… paper

 

 

perspective by anomalous4 (cc) (from Flickr)

Data banks, data deposits & data withdrawals in the data economy – part 1

Big data visualization, Facebook friend connections
Facebook friend carrousel by antjeverena (cc) (from flickr)

Read an interesting article this week in The Atlantic, Why Technology Favors Tyranny by Yuvai Noah Harari, about the inevitable future of technology and how the use of data will drive it.

At the end of the article Harari talks about the need to take back ownership of our data in order to gain some control over the tech giants that currently control our data.

In part 3, Harari discusses the coming AI revolution and the impact on humanity. Yes there will still be jobs, but early on less jobs for unskilled labor and over time less jobs for skilled labor.

Yet, our data continues to be valuable. AI neural net (NN) accuracy increases as a function of the amount of data used to train it. As a result, he who has the most data creates the best AI NN. This means our data has value and can be used over and over again to train other AI NNs. This all sounds like data is just another form of capital, at least for AI NN training.

Safe by cjc4454 (cc) (from flickr)
Safe by cjc4454 (cc) (from flickr)

If only we could own our data, then there would still be value from people’s (digital) exertions (labor), regardless of how much AI has taken over the reigns of production or reduced the need for human work.

What we need is data (savings) banks. These banks would hold people’s data, gathered from social media likes/dislikes,  cell phone metadata, app/web history, search history, credit history, purchase history,  photo/video streams, email streams, lab work, X-rays, wearables info, etc. Probably many more categories need to be identified but ultimately ALL the digital data we generate today would need to be owned by people and deposited in their digital bank accounts.

Data deposits?

Social media companies, telecom, search companies, financial services app companies, internet  providers, etc. anywhere you do business should supply a copy of the digital data they gather for a person back to that persons data bank account.

There are many technical problems to overcome here but it could be as simple as an object storage bucket, assigned to each person that each digital business deposits (XML versions of) our  digital data they create for everyone that uses their service. They would do this as compensation for using our data in their business activities.

How to change data ownership?

Today, we all sign user agreements which essentially gives a company the rights to our data in perpetuity. That needs to change. I see a few ways that this change could come about

  1. Countries could enact laws to insure personal data ownership resides in the person generating it and enforce periodic distribution of this data
  2. Market dynamics could impel data distribution, e.g. if some search firm supplied data to us, we would be more likely to use them.
  3. Societal changes, as AI becomes more important to profit making activities and reduces the need for human work, and as data continues to be an important factor in AI success, data ownership becomes essential to retaining the value of human labor in society.

Probably, all of the above and maybe more would be required to change the ownership structure of data.

How to profit from data?

Technical entities needing data to train AI NNs could solicit data contributions through an Initial Data Offering (IDO). IDO’s would specify types of data required and a proportion of AI NN ownership, they would cede to all  data providers. Data providers would be apportioned ownership based on the % identified and the number of IDO data subscribers.

perspective by anomalous4 (cc) (from Flickr)
perspective by anomalous4 (cc) (from Flickr)

Data banks would extract the data requested by the IDO and supply it to the IDO entity for use. For IDOs, just like ICO’s or IPO’s, some would fail and others would succeed. But the data used in them would represent an ownership share sort of like a  stock (data) certificate in the AI NN.

Data bank responsibilities

Data banks would have various responsibilities and would need to collect fees to perform them. For example, data banks would be responsible for:

  1. Protecting data deposits – to insure data deposits are never lost, are never accessed without permission, are always trackable as to how they are used..
  2. Performing data deposits – to verify that data is deposited from proper digital entities, to validate that data deposits are in a usable form and to properly store the data in a customers object storage bucket.
  3. Performing data withdrawals – upon customer request, to extract all the appropriate data requested by an IDO,  anonymize it, secure it, package it and send it to the IDO originator.
  4. Reconciling data accounts – to track data transactions, data banks would supply a monthly statement that identifies all data deposits and data withdrawals, data revenues and data expenses/fees.
  5. Enforcing data withdrawal types – to enforce data withdrawal types, as data  withdrawals can have many different characteristics, such as exclusivity, expiration, geographic bounds, etc. Data banks would need to enforce withdrawal characteristics, at least to the extent they can
  6. Auditing data transactions – to insure that data is used properly, a consortium of data banks or possibly data accountancies would need to audit AI training data sets to verify that only data that has been properly withdrawn is used in trying the NN. .

AI NN, tools and framework responsibilities

In order for personal data ownership to work well, AI NNs, tools and frameworks used today would need to change to account for data ownership.

  1. Generate, maintain and supply immutable data ownership digests – data ownership digests would be a sort of stock registry for the data used in training the AI NN. They would need to be a part of any AI NN and be viewable by proper data authorities
  2. Track data use – any and all data used in AI NN training should be traceable so that proper data ownership can be guaranteed.
  3. Identify AI NN revenues – NN revenues would need to be isolated, identified and accounted for so that data owners could be rewarded.
  4. Identify AI NN data expenses – NN data costs would need to somehow be isolated, identified and accounted for so that data expenses could be properly deducted from data owner awards. .

At some point there’s a need for almost a data profit and loss statement as well as a data balance sheet for at an AI NN level. The information supplied above should make auditing data ownership, use and rewards much more feasible. But it all starts with identifying data ownership and the data used in training the AI.

~~~~

There are a thousand more questions that come to mind. For example

  • Who owns earth sensing satellite, IoT sensors, weather sensors, car sensors etc. data? Everyone in the world (or country) being monitored is laboring to create the environment sensed by these devices. Shouldn’t this sensor data be apportioned to the people of the world or country where these sensors operate.
  • Who pays data bank fees? The generators/extractors of the data could pay in addition to providing data deposits for the privilege to use our data. I could also see the people paying.  Having the company pay would give them an incentive to make the data load be as efficient and complete as possible. Having the people pay would induce them to use their data more productively.
  • What’s a decent data expiration period? Given application time frames these days, 7-15 years would make sense. But what happens to the AI NN when data expires. Some way would need to be created to extract data from a NN, or the AI NN would need to cease being used and a new one would  need to be created with new data.
  • Can data deposits be rented/sold to data aggregators? Sort of like a AI VC partnership only using data deposits rather than money to fund AI startups.
  • What happens to data deposits when a person dies? Can one inherit a data deposits, would a data deposit inheritance be taxable as part of an estate transfer?

In the end, as data is required to train better AI, ownership of our data makes us all be capitalist (datalists) in the creation of new AI NNs and the subsequent advancement of society. And that’s a good thing.

Comments?