All that AI DL training data comes from us

Read a couple of articles the past few weeks that highlighted something that not many of us are aware of, most of the data used to train AI deep learning (DL) models comes from us.

That is through our ignorance or tacit acceptation of licenses for apps that we use every day and for just walking around/interacting with the world.

The article in Atlantic, The AI supply chain runs on ignorance, talks about Ever, a picture sharing app (like Flickr), where users opted in to its facial recognition software to tag people in pictures. Ever also used that (tagged by machine or person) data to train its facial recognition software which it sells to government agencies throughout the world.

The second article, in Engadget , Colorado College students were secretly used to train AI facial recognition (software), talks about a group using a telephoto security camera than was pointed at a high traffic area on campus. The data obtained was used to help train an AI DL model to identify facial characteristics from far away.

The article went on to say that gathering photos from people in public places is not against the law. The study was also cleared by the school. The database was not released until after the students graduated but it did have information about the time and date the photos were taken.

But that’s nothing…

The same thing applies to video sharing and photo animation models, podcasting and text speaking models, blogging and written word generation models, etc. All this data is just lying around the web, freely available for any AI DL data engineer to grab and use to train their models. The article which included the image below talks about a new dataset of millions of webpages.

From an OpenAI paper on better language models showing the accuracy of some AI DL models “trained on a new dataset of millions of webpages called WebText.”

,Google photo search is scanning the web and has access to any photo posted to use for training data. Facebook, IG, and others have millions of photos that people are posting online every day, many of which are tagged, with information identifying people in the photos. I’m sure some where there’s a clause in a license agreement that says your photos, when posted on our app, no longer belong to you alone.

As security cameras become more pervasive, camera data will readily be used to train even more advanced facial recognition models without your say so, approval or even appreciation that it is happening. And this is in the first world, with data privacy and identity security protections paramount, imagine how the rest of the world’s data will be used.

With AI DL models, it’s all about the data. Yes much of it is messy and has to be cleaned up, massaged and sometimes annotated to be useful for DL training. But the origins of that training data are typically not disclosed to the AI data engineers nor the people that created it.

We all thought China would have a lead in AI DL because of their unfettered access to data, but the west has its own way to gain unconstrained access to vast amounts of data. And we are living through it today.

Yes AI DL models have the potential to drastically help the world, humanity and government do good things better. But a dark side to AI DL models also exist to help bad actors, organizations and even some government agencies do evil.

Caveat usor (May the user beware)

~~~~

Comments?

Photo Credit(s): “Still Watching You” by jhcrow is licensed under CC BY-NC 2.0 

Computational Photography Homework 1 Results.” by kscottz is licensed under CC BY-NC 2.0 

From Language models are unsupervised multi-task learners OpenAI research paper

ZooArchNet.org, a new collaboration for zoological-archeological data

Read an article the other day about a new collaboration data platform, the ZooArchNet, for archeological and zoological data ( data about animals and the history of humankind).

The collaboration was started at the Florida Museum at the University of Florida. They intend to construct a database that would allow researchers to track the history of animals and how humans have interacted with them over time.

The problem is there’s a lot of historical animal specimen information available in various locations/sites around the world and similarly, there’s a lot of data about the history of humanity, but there’s little that cross links the two. And by missing those cross links, researchers aren’t seeing the big picture, that humankind and animal-kind have co-existed since the dawn of time and have impacted each other throughout their history.

However, if there was a site where one could trace the history of animal and human life, across time, in a region, one could develop a better understanding of how they interact and impact one another.

Humankind interacting with animalkind

In the article, they discuss a number of examples where animals have been impacted by humankind over time. For example, originally the Mexican Turkey was domesticated for its feathers during Mayan, Aztec and other civilizations of Central America,. but over time it became a prized for use as food. While this was occurring, its range expanded considerably throughout North (and South) America.

It’s the understanding of habitat range over time and how humankind helped or hindered this range that’s best served by linking zoological and archeological data sets that exist in research libraries throughout the world.

How it works

One problem in cross linking such data is that it often exists in different formats and uses different metadata to describe it.

A key, early decision was to use a standard metadata format ,the Darwin Core (DwC) an outgrowth of the Dublin Core which is more focused on zoological data.

With this in place, the next problem was to translate specimen metadata into the DwC and extract the actual data (or URI’s) that described the specimen for harvesting. Once all that was accomplished they could migrate the specimen data or archeological data and host it/cross link it in their ZooArchNet database.

For example, the researchers at Florida Museum used the Open Context database to provide archeological informationand the Global Biodiversity Information Facility (GBIF) to supply biological diversity information and together the two were linked and cross indexed in the ZooArchNet database.

Once the data was available and located in Google Cloud storage, researchers could use Google BigQuery data analytics as well as other apps like (Google) indexers to create more data rich views and searchable indices for their ZooArchNet and VertNet web portals.

ZooArchNet is just starting. Most of the information currently available is about the few examples chosen to demonstrate the technology. As with anything like this, there’s a certain amount of crowd sourcing needed to make it worthwhile. It’s popularity will be a prime determinant on its usefulness over time. But anything that helps the world understand the true history of humanity’s impact on this life of this planet is worthwhile.

~~~~

Comments?

Photo Credit(s): “turkey bird” by watts photos1 is licensed under CC BY 2.0 

Workflow from ZooArchNet: Connecting zooarchaeological specimens to the biodiversity and archaeology data networks article

Darwin Core overview from Darwin Core: An Evolving Community-Developed Biodiversity Data Standard article

Need memory, Intel’s Optane DC PM to the rescue

I attended Intel’s DataCentric Innovation Conference Tech Field Day eXclusive (TFDx) last April. There were a couple of items Intel presented at the show that peaked my interest there, one of which was DL Boost (see my Intel’s new DL Boost for AI inferencing blog post) and the other was Optane DC PM (data center persistent memory). This post is about Optane DC PM.

As you already know, Optane SSDs have been on the market now for at least a year or so and have not gained much market traction as of yet. I and others attribute this to the high price differential between Optane SSDs and NVMe Flash SSDs but others may say it’s a matter of production yields – probably a little of both.

But Optane, as announced, always had another form factor (if that’s the right term), as persistent memory that could dramatically increase the size of server memory to support new memory intensive applications at a lower price than DRAM.

I was at Nutanix .NEXT conference last week and saw a 4 socket server from DELL that had 6TB of DRAM in it (and 4-44 core CPUs). I didn’t ask the price but when I mentioned I wanted one for my home office, they said it could easily heat my house. So the other problem with a lot of DRAM is power consumption.

Optane DC PM (data center persistent) memory is intended to solve both the high cost and high power consumption problems of DRAM.

How does it work in a server

The newer Intel motherboards support up to 12 slots of memory per socket. And up to 6 of these can be Optane DC PM (512GB DIMM) or 3TB per socket. Optane DC PM is accessed via L1-L2 caching just like any other memory. Apparently with a dual socket system you can have up to 11 Optane DC PM DIMMs on the motherboard.

L1-L2 cache access times are on the order of picoseconds (10**[-12] seconds), DRAM is on the order of nanoseconds (10**[-9] seconds) and flash is on the order of 100 microseconds (100*10**[-6] seconds). So there’s a vast access time gulf between DRAM and Flash that could be exploited with the right technology – enter Optane DC PM.

The only detailed info I could find on Optane DC PM access times was in a research paper (see Basic performance of Intel Optane DC PMM research paper) and it said Optane DC PM assessing times are ~350 nanoseconds, or close to right between DRAM and Flash. At the show the development team indicated that Optane DC PM support about 3GB/sec of bandwidth per module (DIMM).

There are two ways to use Optane DC PM:

  1. Memory mode – in Memory mode, the data in Optane DC PM is thrown away during a power cycle. You must use a block of DRAM as a cache or rather a virtual memory block to the Optane DC PM acting as a paging store. Data is brought into the DRAM cache when accessed using its (virtual) DRAM address and when no longer used. it gets evicted (destaged) back out to Optane DC PM. When power is cycled the data in Optane DC PM is cleared out. Optane DC PM supports AES XTS-256 bit encryption and can easily be cleared by throwing away encryption keys during a power cycle.
  2. App Direct mode – in App Direct mode, Optane DC PM is accessed directly using application APIs and its data persists across power cycles. For App Direct mode, Optane DC PM is still AES 256 encrypted but here the encryption key is maintained across power cycles but is locked on power up and you need to use a pass phrase to unlock it. In this mode, applications are responsible for flushing (L1-L2) caches to Optane to retains all data written through L1-L2 to the Optane DC PM. There’s a GitHub Persistent Memory Development Kit (PMDK) library for that supports the App Direct mode API that applications need to use.

Both modes use DDR-T, (transactional DDR4) a new memory bus protocol for Optane DC PM access. In the DDR-T protocol, access to the memory bus is requested by a CPU and is granted by an Optane DC PM DIMM. All Optane DC PM DIMMs on a system can be accessed in parrallel.

You can use RDMA to replicate (App direct?) Optane DC PM data from one system to another. In order to support Memory and App Direct mode, Optane DC PM required CPU, BIOS and (application) software changes.

Most of the Optane DC PM support and cryptology logic is implemented in hardware. Optane DC PM has an address indirection table (AIT) to support 3D XPoint wear leveling maintained in DRAM but flushed to Optane during power loss. Transfers to 3D XPoint media is in 256 byte cache lines but the memory bus operates in 64 byte cache lines, so there’s a (DRAM) buffer between media and memory bus.

Optane also supports a high availability, or up to two device failure mode. In this scenario, if one Optane DC PM DIMM fails, the system can swap another spare Optane DC PM DIMM into that address space and continue to operate. If a 2nd Optane DC PM fails then the system fails. Not sure what happens to the data on the original Optane DC PM DIMM during a failure. It seems to me data would be lost, but it could depend on its failure mode.

In Memory mode, the expected ratio between DRAM size and Optane DC PM size is should be 32GB DRAM/6TB Optane DC PM. At the TFDx event, the Optane DC PM team had some performance charts showing different DRAM cache miss rates. Intel also announced new CPU monitoring statistics to track application/workloads impacting DRAM/Optane DC PM in Memory mode and to track Optane DC PM health.

Optane DC PM usage modes are established by the BIOS. It’s flexible enough to have the Optane DC PM usage modes be defined on a region by region basis. Not exactly sure what a region is, but it could be an address range spanning MB(?) of Optane DC PM. With both modes in operation on a system, data can be moved from Memory mode Optane to App direct mode Optane or vice versa.

Intel expects that lifetime of an Optane DC PM DIMM to be from 200-370PB of data writes. Optane DC PMs have a 5 year warrantee. Given its bandwidth (3GB/sec), 200PB of data writes should last ~2 years but that’s at 100% duty cycle, writing 3GB of data, every second of every day. So, 5 years should be a reasonable guarantee using a more realistic ~40% duty cycle.

What applications use Optane DC PM

The one of interest to most people seems to be SAP HANA. According to the development team, SAP HANA could use App Direct mode for main database storage and use DRAM for its delta column store. Cassandra could also use Optane in App Direct mode in a similar fashion.

Also something like a REDIS for key-value store could use Optane DC PM to store Values and use DRAM to store Keys.

But any application could take advantage of the extra memory made available with Optane DC PM DIMMs in Memory mode today. Of course any use of Optane DC PM would require the right levels of Intel Xeon CPUs (Cascade Lake), BIOSes and motherboards.

~~~~

Interested in learning more, TFDx videos of the event are available on the website noted previously. Also these TFDx bloggers also have posts specifically on Optane DC PM.

The coolest thing since sliced bread – Optane by Matt Leib, (@MBLeib)

Intel’s crossover point: A 3D spork by Stephen Foskett (@SFoskett)

Intel answering SAP HANA’s tough questions by Keith Townsend (@CTOAdvisor)

Comments?