All that AI DL training data comes from us

Read a couple of articles the past few weeks that highlighted something that not many of us are aware of, most of the data used to train AI deep learning (DL) models comes from us.

That is through our ignorance or tacit acceptation of licenses for apps that we use every day and for just walking around/interacting with the world.

The article in Atlantic, The AI supply chain runs on ignorance, talks about Ever, a picture sharing app (like Flickr), where users opted in to its facial recognition software to tag people in pictures. Ever also used that (tagged by machine or person) data to train its facial recognition software which it sells to government agencies throughout the world.

The second article, in Engadget , Colorado College students were secretly used to train AI facial recognition (software), talks about a group using a telephoto security camera than was pointed at a high traffic area on campus. The data obtained was used to help train an AI DL model to identify facial characteristics from far away.

The article went on to say that gathering photos from people in public places is not against the law. The study was also cleared by the school. The database was not released until after the students graduated but it did have information about the time and date the photos were taken.

But that’s nothing…

The same thing applies to video sharing and photo animation models, podcasting and text speaking models, blogging and written word generation models, etc. All this data is just lying around the web, freely available for any AI DL data engineer to grab and use to train their models. The article which included the image below talks about a new dataset of millions of webpages.

From an OpenAI paper on better language models showing the accuracy of some AI DL models “trained on a new dataset of millions of webpages called WebText.”

,Google photo search is scanning the web and has access to any photo posted to use for training data. Facebook, IG, and others have millions of photos that people are posting online every day, many of which are tagged, with information identifying people in the photos. I’m sure some where there’s a clause in a license agreement that says your photos, when posted on our app, no longer belong to you alone.

As security cameras become more pervasive, camera data will readily be used to train even more advanced facial recognition models without your say so, approval or even appreciation that it is happening. And this is in the first world, with data privacy and identity security protections paramount, imagine how the rest of the world’s data will be used.

With AI DL models, it’s all about the data. Yes much of it is messy and has to be cleaned up, massaged and sometimes annotated to be useful for DL training. But the origins of that training data are typically not disclosed to the AI data engineers nor the people that created it.

We all thought China would have a lead in AI DL because of their unfettered access to data, but the west has its own way to gain unconstrained access to vast amounts of data. And we are living through it today.

Yes AI DL models have the potential to drastically help the world, humanity and government do good things better. But a dark side to AI DL models also exist to help bad actors, organizations and even some government agencies do evil.

Caveat usor (May the user beware)



Photo Credit(s): “Still Watching You” by jhcrow is licensed under CC BY-NC 2.0 

Computational Photography Homework 1 Results.” by kscottz is licensed under CC BY-NC 2.0 

From Language models are unsupervised multi-task learners OpenAI research paper

perspective by anomalous4 (cc) (from Flickr)

Data banks, data deposits & data withdrawals in the data economy – part 1

Big data visualization, Facebook friend connections
Facebook friend carrousel by antjeverena (cc) (from flickr)

Read an interesting article this week in The Atlantic, Why Technology Favors Tyranny by Yuvai Noah Harari, about the inevitable future of technology and how the use of data will drive it.

At the end of the article Harari talks about the need to take back ownership of our data in order to gain some control over the tech giants that currently control our data.

In part 3, Harari discusses the coming AI revolution and the impact on humanity. Yes there will still be jobs, but early on less jobs for unskilled labor and over time less jobs for skilled labor.

Yet, our data continues to be valuable. AI neural net (NN) accuracy increases as a function of the amount of data used to train it. As a result,  he has the most data creates the best AI NN. This means our data has value and can be used over and over again to train other AI NNs. This all sounds like data is just another form of capital, at least for AI NN training.

If only we could own our data, then there would still be value from people’s (digital) exertions (labor), regardless of how much AI has taken over the reigns of production or reduced the need for human work.Safe by cjc4454 (cc) (from flickr)

Safe by cjc4454 (cc) (from flickr)What we need is data (savings) banks. These banks would hold people’s data, gathered from social media likes/dislikes,  cell phone metadata, app/web history, search history, credit history, purchase history,  photo/video streams, email streams, lab work, X-rays, wearables info, etc. Probably many more categories need to be identified but ultimately ALL the digital data we generate today would need to be owned by people and deposited in their digital bank accounts.

Data deposits?

Social media companies, telecom, search companies, financial services app companies, internet  providers, etc. anywhere you do business should supply a copy of the digital data they gather for a person back to that persons data bank account.

There are many technical problems to overcome here but it could be as simple as an object storage bucket, assigned to each person that each digital business deposits (XML versions of) our  digital data they create for everyone that uses their service. They would do this as compensation for using our data in their business activities.

How to change data ownership?

Today, we all sign user agreements which essentially gives a company the rights to our data in perpetuity. That needs to change. I see a few ways that this change could come about

  1. Countries could enact laws to insure personal data ownership resides in the person generating it and enforce periodic distribution of this data
  2. Market dynamics could impel data distribution, e.g. if some search firm supplied data to us, we would be more likely to use them.
  3. Societal changes, as AI becomes more important to profit making activities and reduces the need for human work, and as data continues to be an important factor in AI success, data ownership becomes essential to retaining the value of human labor in society.

Probably, all of the above and maybe more would be required to change the ownership structure of data.

How to profit from data?

Technical entities needing data to train AI NNs could solicit data contributions through an Initial Data Offering (IDO). IDO’s would specify types of data required and a proportion of AI NN ownership, they would cede to all  data providers. Data providers would be apportioned ownership based on the % identified and the number of IDO data subscribers.

perspective by anomalous4 (cc) (from Flickr)
perspective by anomalous4 (cc) (from Flickr)

Data banks would extract the data requested by the IDO and supply it to the IDO entity for use. For IDOs, just like ICO’s or IPO’s, some would fail and others would succeed. But the data used in them would represent an ownership share sort of like a  stock (data) certificate in the AI NN.

Data bank responsibilities

Data banks would have various responsibilities and would need to collect fees to perform them. For example, data banks would be responsible for:

  1. Protecting data deposits – to insure data deposits are never lost, are never accessed without permission, are always trackable as to how they are used..
  2. Performing data deposits – to verify that data is deposited from proper digital entities, to validate that data deposits are in a usable form and to properly store the data in a customers object storage bucket.
  3. Performing data withdrawals – upon customer request, to extract all the appropriate data requested by an IDO,  anonymize it, secure it, package it and send it to the IDO originator.
  4. Reconciling data accounts – to track data transactions, data banks would supply a monthly statement that identifies all data deposits and data withdrawals, data revenues and data expenses/fees.
  5. Enforcing data withdrawal types – to enforce data withdrawal types, as data  withdrawals can have many different characteristics, such as exclusivity, expiration, geographic bounds, etc. Data banks would need to enforce withdrawal characteristics, at least to the extent they can
  6. Auditing data transactions – to insure that data is used properly, a consortium of data banks or possibly data accountancies would need to audit AI training data sets to verify that only data that has been properly withdrawn is used in trying the NN. .

AI NN, tools and framework responsibilities

In order for personal data ownership to work well, AI NNs, tools and frameworks used today would need to change to account for data ownership.

  1. Generate, maintain and supply immutable data ownership digests – data ownership digests would be a sort of stock registry for the data used in training the AI NN. They would need to be a part of any AI NN and be viewable by proper data authorities
  2. Track data use – any and all data used in AI NN training should be traceable so that proper data ownership can be guaranteed.
  3. Identify AI NN revenues – NN revenues would need to be isolated, identified and accounted for so that data owners could be rewarded.
  4. Identify AI NN data expenses – NN data costs would need to somehow be isolated, identified and accounted for so that data expenses could be properly deducted from data owner awards. .

At some point there’s a need for almost a data profit and loss statement as well as a data balance sheet for at an AI NN level. The information supplied above should make auditing data ownership, use and rewards much more feasible. But it all starts with identifying data ownership and the data used in training the AI.


There are a thousand more questions that come to mind. For example

  • Who owns earth sensing satellite, IoT sensors, weather sensors, car sensors etc. data? Everyone in the world (or country) being monitored is laboring to create the environment sensed by these devices. Shouldn’t this sensor data be apportioned to the people of the world or country where these sensors operate.
  • Who pays data bank fees? The generators/extractors of the data could pay in addition to providing data deposits for the privilege to use our data. I could also see the people paying.  Having the company pay would give them an incentive to make the data load be as efficient and complete as possible. Having the people pay would induce them to use their data more productively.
  • What’s a decent data expiration period? Given application time frames these days, 7-15 years would make sense. But what happens to the AI NN when data expires. Some way would need to be created to extract data from a NN, or the AI NN would need to cease being used and a new one would  need to be created with new data.
  • Can data deposits be rented/sold to data aggregators? Sort of like a AI VC partnership only using data deposits rather than money to fund AI startups.
  • What happens to data deposits when a person dies? Can one inherit a data deposits, would a data deposit inheritance be taxable as part of an estate transfer?

In the end, as data is required to train better AI, ownership of our data makes us all be capitalist (datalists) in the creation of new AI NNs and the subsequent advancement of society. And that’s a good thing.