For AGI, is reward enough – part 4

Last May, an article came out of DeepMind research titled Reward is enough. It was published in an artificial intelligence journal but PDFs of it are available free of charge.

The article points out that according to DeepMind researchers, using reinforcement learning and an appropriate reward signal is sufficient to attain AGI (artificial general intelligence). We have written about the perils and pitfalls of AGI before (see Existential event risks [-part-0]NVIDIA Triton GMI, a step to far[-part-1]The Myth of AGI [-part-2], and Towards a better AGI – part 3ish. (Sorry, I only started numbering them after part 3ish).

My last post on AGI inclined towards the belief that AGI was not possible without combining deduction, induction and abduction (probabilistic reasoning) together and that any such AGI was a distant dream at best.

Then I read the Reward is Enough article and it implied that they saw a realistic roadmap towards achieving AGI based solely on reward signals and Reinforcement Learning (wikipedia article on Reinforcement Learning ). To read the article was disheartening at best. After the article came out, I made it a hobby to understand everything I could about Reinforcement Learning to understand whether what they are talking is feasible or not.

Reinforcement learning, explained

Let’s just say that the text book, Reinforcement Learning, is not the easiest read I’ve seen. But I gave it a shot and although I’m no where near finished, (lost somewhere in chapter 4), I’ve come away with a better appreciation of reinforcement learning.

The premise of Reinforcement Learning, as I understand it, is to construct a program that performs a sequence of steps based on state or environment the program is working on, records that sequence and tags or values that sequence with a reward signal (i.e., +1 for good job, -1 for bad, etc.). Depending on whether the steps are finite, i.,e, always ends or infinite, never ends, the reward tagging could be cumulative (finite steps) or discounted (infinite steps).

The record of the program’s sequence of steps would include the state or the environment and the next step that was taken. Doing this until the program completes the task or if, infinite, whenever the discounted reward signal is minuscule enough to not matter anymore.

Once you have a log or record of the state, the step taken in that state and the reward for that step you have a policy used to take better steps. Over time, with sufficient state-step-reward sequences, one can build a policy that would work’s very well for the problem at hand.

Reinforcement learning, a chess playing example

Let’s say you want to create a chess playing program using reinforcement learning. If a sequence of moves ends the game, you can tag each move in that sequence with a reward (say +1 for wins, 0 for draws and -1 for losing), perhaps discounted by the number of moves it took to win. The “sequence of steps” would include the game board and the move chosen by the program for that board position.

Figure 2: Comparison with specialized programs. (A) Tournament evaluation of AlphaZero in chess, shogi, and Go in matches against respectively Stockfish, Elmo, and the previously published version of AlphaGo Zero (AG0) that was trained for 3 days. In the top bar, AlphaZero plays white; in the bottom bar AlphaZero plays black. Each bar shows the results from AlphaZero’s perspective: win (‘W’, green), draw (‘D’, grey), loss (‘L’, red). (B) Scalability of AlphaZero with thinking time, compared to Stockfish and Elmo. Stockfish and Elmo always receive full time (3 hours per game plus 15 seconds per move), time for AlphaZero is scaled down as indicated. (C) Extra evaluations of AlphaZero in chess against the most recent version of Stockfish at the time of writing, and against Stockfish with a strong opening book. Extra evaluations of AlphaZero in shogi were carried out against another strong shogi program Aperyqhapaq at full time controls and against Elmo under 2017 CSA world championship time controls (10 minutes per game plus 10 seconds per move). (D) Average result of chess matches starting from different opening positions: either common human positions, or the 2016 TCEC world championship opening positions . Average result of shogi matches starting from common human positions . CSA world
championship games start from the initial board position.

If your policy incorporates enough winning chess move sequences and the program encounters one of these in a game and if move recorded won, select that move, if lost, select another valid move at random. If the program runs across a board position its never seen before, choose a valid move at random.

Do this enough times and you can build a winning white playing chess policy. Doing something similar for black playing program would build a winning black playing chess policy.

The researchers at DeepMind explain their AlphaZero program which plays chess, shogi, and Go in another research article, A general reinforcement learning algorithm that masters chess, shogi and Go through self-play.

Reinforcement learning and AGI

So now what does all that have to do with creating AGI. The premise of the paper is that by using rewards and reinforcement learning, one could program a policy for any domain that one encounters in the world.

For example, using the above chart, if we were to construct reinforcement learning programs that mimicked perception (object classification/detection) abilities, memory ((image/verbal/emotional/?) abilities, motor control abilities, etc. Each subsystem could be trained to solve the arena needed. And over time, if we built up enough of these subsystems one could somehow construct an AGI system of subsystems, that would match human levels of intelligence.

The paper’s main hypothesis is “(Reward is enough) Intelligence, and its associated abilities, can be understood as subserving the maximization of reward by an agent acting in its environment.”

Given where I am today, I agree with the hypothesis. But the crux of the problem is in the details. Yes, for a game of multiple players and where a reward signal of some type can be computed, a reinforcement learning program can be crafted that plays better than any human but this is only because one can create programs that can play that game, one can create programs that understand whether the game is won or lost and use all this to improve the game playing policy over time and game iterations.

Does rewards and reinforcement learning provide a roadmap to AGI

To use reinforcement learning to achieve AGI implies that

  • One can identify all the arenas required for (human) intelligence
  • One can compute a proper reward signal for each arena involved in (human) intelligence,
  • One can programmatically compute appropriate steps to take to solve that arena’s activity,
  • One can save a sequence of state-steps taken to solve that arena’s problem, and
  • One can run sequences of steps enough times to produce a good policy for that arena.

There are a number of potential difficulties in the above. For instance, what’s the state the program operates in.

For a human, which has 500K(?) pressure, pain, cold, & heat sensors throughout the exterior and interior of the body, two eyes, ears, & nostrils, one tongue, two balance sensors, tired, anxious, hunger, sadness, happiness, and pleasure signals, and 600 muscles actuating the position of five fingers/hand, toes/foot, two eyes ears, feet, legs, hands, and arms, one head and torso. Such a “body state, becomes quite complex. Any state that records all this would be quite large. Ok it’s just data, just throw more storage at the problem – my kind of problem.

The compute power to create good policies for each subsystem would also be substantial and in the end determining the correct reward signal would be non-trivial for each and every subsystem. Yet, all it takes is money, time and effort and all this could be accomplished.

So, yes, given all the above creating an AGI, that matches human levels of intelligence, using reinforcement learning techniques and rewards is certainly possible. But given all the state information, action possibilities and reward signals inherent in a human interacting in the world today, any human level AGI, would seem unfeasible in the next year or so.

One item of interest, recent DeepMind researchers have create MuZero which learns how to play Go, Chess, Shogi and Atari games without any pre-programmed knowledge of the games (that is how to play the game, how to determine if the game is won or lost, etc.). It managed to come up with it’s own internal reward signal for each game and determined what the proper moves were for each game. This seemed to combine a deep learning neural network together with reinforcement learning techniques to craft a rewards signal and valid move policies.

Alternatives to full AGI

But who says you need AGI, for something that might be a useful to us. Let’s say you just want to construct an intelligent oracle that understood all human generated knowledge and science and could answer any question posed to it. With the only response capabilities being audio, video, images and text.

Even an intelligent oracle such as the above would need an extremely large state. Such a state would include all human and machine generated information at some point in time. And any reward signal needed to generate a good oracle policy would need to be very sophisticated, it would need to determine whether the oracle’s answer; was good or not. And of course the steps to take to answer a query are uncountable, 1st there’s understanding the query, next searching out and examining every piece of information in the state space for relevance, and finally using all that information to answer to the question.

I’m probably missing a few steps in the above, and it almost makes creating a human level AGI seem easier.

Perhaps the MuZero techniques might have an answer to some or all of the above.

~~~~

Yes, reinforcement learning is a valid roadmap to achieving AGI, but can it be done today – no. Tomorrow, perhaps.

Photo credit(s):

Math’s war on Gerrymandering

Read an article the other day in MIT Technology Review, Mathematicians are deploying algorithms to stop gerrymandering, that discussed how a bunch of mathematicians had created tools (Python and R applications, links below) which can be used to test whether state redistricting maps are fair or not.

The best introduction to what these applications can do is in a 2021 I. E. Block Community Lecture: Jonathan Christopher Mattingly captured on YouTube video.

For those outside the states, US Congress House of Representatives are elected via state districts. The term gerrymandering was coined in 1812 over a district map created for Massachusetts that took the form of a dragon which all but assured that the district would go Democrat-Republican vs. Federalist in the next election, (source: Wikipedia entry on Gerrymandering). This re-districting process is done every 10 years in every stateafter the US Census Bureau releases their decennial census results which determines the number of representatives that each state elects to the US House of Representatives.

Funny thing about gerrymandering, both the Democrats and the Republicans have done this in the past and will most likely do so in the future. Once district maps are approved they typically stay that way until the next census.

Essentially, what the mathematicians have done is create a way togenerate a vast number of district maps, under a specific series of constraints/guidelines and then can use these series of maps to characterize the democrat-republican split of a trial election based on some recent election result.

One can see in the above show the # of Democrats that would have been elected using the data from the 2012 and 2016 election results under four distinct maps vs a histogram representing the distribution of all the maps the mathematician’s systems created. The four specific maps indicated on the histograms are.

  • NC2012- thrown out by the state courts as being unfair
  • NC2016– one that NC legislature came out with also thrown out as being unfair as it’s equivalent to the NC2012 map
  • Bipartisan Judges – one that a group of independent judges came out with, and
  • Remedial=NC2020 – one submitted by the mathematicians which they deemed “fairer”
These North Carolina congressional district maps illustrate how geometry is not a fail-safe indicator of gerrymandering. The NC 2012 map, with its bizarre district boundaries, was deemed by the courts to be a racial gerrymander. The replacement, the NC 2016 map, looks quite different and tame by comparison, but was deemed to be an unconstitutional political gerrymander. Analysis by Duke’s Jonathan Mattingly and his team showed that the 2012 and 2016 maps were politically equivalent in their partisan outcomes. A court-appointed expert drew the NC 2020 map.

The algorithms (in Python, GitHub repo for GerryChain and R, GitHub repo for redist) take as input an election map which is a combined US census blocks (physical groupings of population defined by US Census Bureau) and some prior election results (that provide the democrat-republican votes for a particular election within those census blocks). And using this data and a list of districting constraints, such as, compactness requirements, minimal breaks of counties (state political units), population equivalence, etc. and using these inputs, the apps generate a multitude or ensemble of district maps for a state. FYI, districting constraints differ from state to state.

Once they have this ensemble of district maps that adhere to the states specified districting constraints one can compare any sample districting map to the histogram and see if a specific map is fair or not. “Fairness” means that it would result in the same Democrat-Republican split that occurred from the highest number of ensemble maps.

The latest process by which districting maps can be created is documented in a research article, Recombination: A Family of Markov Chains for Redistricting. But most of the prior generations all seem to use a tree structure together with a markov chain approach. At the leaves of the tree are the census blocks and the tree hierarchy algorithmically represents the different districting hierarchies.

Presumably, a Markov Chain encodes a method to represent the state’s districting constraints. And what the algorithm does is traverse the tree of census blocks, using the markov chain and randomness, to create district maps by splitting a branch of the tree (=district) off somewhere in the hierarchy, above the census blocks.

Doing this randomly, over a number of iterations, provides a group or ensemble of proper districting maps that can be used to build the histogram for a specific election result. (Suggest reading the above research report for more information on how this works).

One can see the effect of different election results on the distribution of election results that would have occurred with the current ensemble of maps. For example, USH12 uses the census block voting results from the North Carolina, US (Congressional) House election of 2012 and the GOV16 uses the results from the North Carolina Governor election of 2016

It’s somewhat surprising that the US Supreme Court has ruled that districting is a states issue and not subject to constitutional oversight. Not sure I agree but I’m no constitutional scholar/lawyer. So, all of the legal disputes surrounding state’s re-districting maps have been accomplished in state courts.

But what the mathematicians have done is provide the tools needed to create a multitude of districting maps and when one uses prior election results at the census block level, one can see whether any new re-districting map is representative of what one would see if one drew 100s or 1000s of proper redistricting maps.

Let’s hope this all leads to fairer state and federal elections in the future.

Comments?

Photo Credits:

CTERA, Cloud NAS on steroids

We attended SFD22 last week and one of the presenters was CTERA, (for more information please see SFD22 videos of their session) discussing their enterprise class, cloud NAS solution.

We’ve heard a lot about cloud NAS systems lately (see our/listen to our GreyBeards on Storage podcast with LucidLink from last month). Cloud NAS systems provide a NAS (SMB, NFS, and S3 object storage) front-end system that uses the cloud or onprem object storage to hold customer data which is accessed through the use of (virtual or hardware) caching appliances.

These differ from file synch and share in that Cloud NAS systems

  • Don’t copy lots or all customer data to user devices, the only data that resides locally is metadata and the user’s or site’s working set (of files).
  • Do cache working set data locally to provide faster access
  • Do provide NFS, SMB and S3 access along with user drive, mobile app, API and web based access to customer data.
  • Do provide multiple options to host user data in multiple clouds or on prem
  • Do allow for some levels of collaboration on the same files

Although admittedly, the boundary lines between synch and share and Cloud NAS are starting to blur.

CTERA is a software defined solution. But, they also offer a whole gaggle of hardware options for edge filers, ranging from smart phone sized, 1TB flash cache for home office user to a multi-RU media edge server with 128TB of hybrid disk-SSD solution for 8K video editing.

They have HC100 edge filers, X-Series HCI edge servers, branch in a box, edge and Media edge filers. These later systems have specialized support for MacOS and Adobe suite systems. For their HCI edge systems they support Nutanix, Simplicity, HyperFlex and VxRail systems.

CTERA edge filers/servers can be clustered together to provide higher performance and HA. This way customers can scale-out their filers to supply whatever levels of IO performance they need. And CTERA allows customers to segregate (file workloads/directories) to be serviced by specific edge filer devices to minimize noisy neighbor performance problems.

CTERA supports a number of ways to access cloud NAS data:

  • Through (virtual or real) edge filers which present NFS, SMB or S3 access protocols
  • Through the use of CTERA Drive on MacOS or Windows desktop/laptop devices
  • Through a mobile device app for IOS or Android
  • Through their web portal
  • Through their API

CTERA uses a, HA, dual redundant, Portal service which is a cloud (or on prem) service that provides CTERA metadata database, edge filer/server management and other services, such as web access, cloud drive end points, mobile apps, API, etc.

CTERA uses S3 or Azure compatible object storage for its backend, source of truth repository to hold customer file data. CTERA currently supports 36 on-prem and in cloud object storage services. Customers can have their data in multiple object storage repositories. Customer files are mapped one to one to objects.

CTERA offers global dedupe, virus scanning, policy based scheduled snapshots and end to end encryption of customer data. Encryption keys can be held in the Portals or in a KMIP service that’s connected to the Portals.

CTERA has impressive data security support. As mentioned above end-to-end data encryption but they also support dark sites, zero-trust authentication and are DISA (Defense Information Systems Agency) certified.

Customer data can also be pinned to edge filers, Moreover, specific customer (director/sub-directorydirectories) data can be hosted on specific buckets so that data can:

  • Stay within specified geographies,
  • Support multi-cloud services to eliminate vendor lock-in

CTERA file locking is what I would call hybrid. They offer strict consistency for file locking within sites but eventual consistency for file locking across sites. There are performance tradeoffs for strict consistency, so by using a hybrid approach, they offer most of what the world needs from file locking without incurring the performance overhead of strict consistency across sites. For another way to do support hybrid file locking consistency check out LucidLink’s approach (see the GreyBeards podcast with LucidLink above).

At the end of their session Aron Brand got up and took us into a deep dive on select portions of their system software. One thing I noticed is that the portal is NOT in the data path. Once the edge filers want to access a file, the Portal provides the credential verification and points the filer(s) to the appropriate object and the filers take off from there.

CTERA’s customer list is very impressive. It seems that many (50 of WW F500) large enterprises are customers of theirs. Some of the more prominent include GE, McDonalds, US Navy, and the US Air Force.

Oh and besides supporting potentially 1000s of sites, 100K users in the same name space, and they also have intrinsic support for multi-tenancy and offer cloud data migration services. For example, one can use Portal services to migrate cloud data from one cloud object storage provider to another.

They also mentioned they are working on supplying K8S container access to CTERA’s global file system data.

There’s a lot to like in CTERA. We hadn’t heard of them before but they seem focused on enterprise’s with lots of sites, boatloads of users and massive amounts of data. It seems like our kind of storage system.

Comments?

Storywrangler, ranking tweet ngrams over time

Read a couple of articles the past few weeks on a project in Vermont that has randomly selected 10% of all tweets (150 Billion) since the beginning of Twitter (2008) and can search and rank this tweet corpus for ngrams (1-, 2-, & 3-word phrases). All of these articles were reporting on a Science Advances article: Storywrangler: A massive exploratorium for sociolinguistic, cultural, socioeconomic, and political timelines using Twitter.

Why we need Storywrangler

The challenge with all social media is that it is transient, here now, (mostly) gone tomorrow. That is once posted, if it’s liked/re-posted/re-tweeted it can exist in echoes of the original on the service for some time, and if not, it dies out very quickly never to be seen (externally ever) again. While each of us could potentially see every tweet we have ever created (when this post is published it should be my 5387th tweet on my twitter account) but most of us cannot see this history for others.

All that makes viewing what goes on on social media impossible which leads to a lot of mis-understanding and makes it difficult to analyze. It would be great if we had a way of looking at social media activity in more detail to understand it better.

I wrote about this before (see my Computational anthropology & archeology post) and if anything, the need for such capabilities has become even more important in today’s society.

If only there was a way to examine the twitter-verse. What’s mainly lacking is a corpus of all tweets that have ever been tweeted. A way to slice, dice, search, and rank this text data would be a godsend to understanding (twitter and maybe social) history, in real time.

Storywrangler, has a randomized version of 10% of all tweets since twitter started. And it provides ngram searching and ranking over a specified time interval. It’s not everything but it’s a start.

Storywrangler currently has over 1 trillion (1- to 3- word) ngrams and they support ngram rankings for over 150 different languages.

Google books ngram viewer

The idea for the Storywrangler project came from Google’s books ngram viewer. Google’s ngram viewer has a corpus of Google books, over a time period (from 1800 to 2019) and allows one to search for ngrams (1- to 5-word phrases) over any time period they support.

Google’s ngram viewer charts ngrams with a vertical axis that is the % of all ngrams in their book corpus. One can see the rise and fall of ngrams, e.g., “atomic power”. The phrase “atomic power” peaked in Google books around 1960 at a height of 0.000260% of all 2 word ngrams. The time period level of granularity is a year.

The nice thing about Google books ngram data is you can download their book ngram data yourself. The data is of the form of tab separated list of rows with ngram text (1 to 5 words), year, how many times it occurred that year, on how many pages, on how many books on each row. Google books ngram data is generally about 2 years old.

Unclear just how much data is in Google’s books ngram database but for instance in the 1 gram English fiction list, they show a sample of two rows (the 3,000,000 and 3,000,001 rows) which are the 1978 and 1979 book counts for the word “circumvallate”.

Storywrangler tweet ngram viewer

The usage tab on the Storywrangler website provides a search engine that one can use to input N-grams that you want to search the corpus for and can visualize how their rank changes over time. For example, one can do a similar search on the “atomic power” ngram only for tweets.

From Storywrangler search one can see that peak tweet use of “Atomic Power” and “ATOMIC POWER” occurred somewhere in July of 2020 (only way to see the month is to hover over that line) and it’s rank reached somewhere around ~10,000 highest used tweet 2 word ngram during that time.

It’s interesting to see that ngram books and ngram twitter don’t seem to have any correlation. For example the prior best ranking for atomic power (~200Kth highest) was in June of 2015. There was no similar peak for book ngrams of the phrase.

For Storywrangler you can download a JSON or CSV version of the charts displayed. It’s not the complete ngram history that Google book ngram viewer provides. Storywrangler data is generally about 2 days old.

The other nice thing about Storywrangler is under the real-time tab it will show you ngram rankings at 15 minute intervals for whatever timeline you wish to see. Also under the trending tab it will show you the changing ranks for the top 5 ngrams over a selected time period. And the languagetab will do tracking for tweet language use for select languages. The common tab will track the ranking of most common ngrams (pretty boring mostly articles/prepositions) over time. And for any of these searches one can turn on or off retweet counting, which can help to eliminate bot activity.

Storywrangler provides a number of other statistics for ngrams other than just ranking such as odds (of occurring) and frequency (of occurrence). And one can also track rank change, old (years) rank vs. current (year) rank, rank (turbulence) divergence.

~~~~

Comments?

Photo Credit(s):

Kasten, Kafka & the quest to protect data #CFD11

We attended Cloud Field Day 11 (CFD11) last week and among the vendors at the show was Kasten by Veeam talking about their extensive Kubernetes data protection/DR/migration capabilities. All that was worthy of its own blog post but somewhere near the end of their session, they started discussing how they plan to backup Apache Kafka message traffic

We have discussed Kafka before (see our Data in motion post). For those who have read that post or are familiar with Kafka, you can ignore the Kafka primer section below and just for the record, I’m no Kafka expert.

Kafka primer

Kafka is a massively scalable message bus and real time stream processing system. Messages or records come in from Kafka producers and are sent to Kafka consumers or subscribers for processing. Kafka supplies a number of guarantees one of which is that messages are processed once and only once.

Messages or records are key-value pairs that are generated continuously and are processed by Kafka apps. Message streams are split into topics.

Kafka connectors allow message traffic to be extracted/generated from external databases and other systems . Kafka stream processors use Kafka streaming primitives and stream flow graphs to construct applications that process unbounded, continuously updated data sets.

Topics in Kafka can have multi-producers and multi-subscribers. Each topic can be further split across partitions which can be processed in different servers or brokers in a Kafka cluster. It’s this partitioning that allows Kafka to scale from processing 1 message to millions of messages per second.

Consumers/subscribers can also be producers, so a message that comes in and is processed by a subscriber could create other messages on topics that need to be processed. Messages are placed on topic partitions in the order they are received.

Messages can be batched and added to a partition or not. Recently Kafka added support for sticky partitions. Normally messages are assigned to partitions based on key hashing but sometimes messages have null keys and in this case Kafka assigned them to partitions in a round robin fashion. Sticky partitioning strategy amends that approach to use batches of messages rather than single messages when adding null key messages to partitions.

Kafka essentially provides a realtime streaming message/event processing system that scales very well.

Data protection for Kafka streams without Kasten

This is our best guess as to what this looks like, so bear with us.

Sometime after messages are processed by a subscriber they can be sent to a logging system. If that system records those messages to external storage, they would be available to be copied off-line and backed up.

How far behind real time, Kafka logs happen to be can vary significantly based on the incoming/outgoing message traffic, resources available for logging and the overall throughput of your logging process/storage. Could it be an hour or two behind, possibly but I doubt it, could it be 1 second behind, also unlikely. Somewhere in between those two extremes seems reasonable

Log processing messages just like any topic could be partitioned. So any single log would only have messages on that partition. So there would need to be some post process to stitch all these log partitions together to get a consistent message stream.

In any case, there is potentially a wide gulf between the message data that is being processed and the logs which hold them. But then they need to be backed up as well so there’s another gap introduced between the backed up data what is happening in real time This is especially true the more messages that are being processed in your Kafka cluster.

(Possible) data protection for Kafka streams with Kasten

For starters, it wasn’t clear what Kasten presented was a current product offering, beta offering or just a germ of an idea. So bear that in mind. See the videos of their CFD11 session here for their take on this.

Kafka supports topic replication. What this means is that any topic can be replicated to other servers in the Kafka cluster. Topics can have 0 to N replications and one server is always designated leader (or primary) for a topic and other replicas are secondary. The way it’s supposed to work is that the primary topic partition server will not acknowledge or formally accept any message until it is replicated to all other topic replicas.

What Kasten is proposing is to use topic replicas and take hot snapshots of replicated topic partitions at the secondary server. That way there should be a minimal impact on primary topic message processing. Once snapped, topic partition message data can be backed up and sent offsite for DR.

We have a problem with this chart. Our understanding is that if we take the numbers to be a message id or sequence number, each partition should be getting different messages rather than the the same messages. Again we are not Kafka experts. The Eds.

However, even though Kasten plans to issue hot snapshots, one after another, for each partition, there is still a small time difference between each snap request. As a result, the overall state of the topic partitions when snapped may be slightly inconsistent. In fact, there is a small possibility that when you stitch all the partition snapshots together for a topic, some messages may be missing.

For example, say a topic is replicated and when looking at the replicated topic partition some message say Msg[2001] was delayed (due to server load) in being replicated for partition 0, while Kasten was hot snapping replica partition 15 which already held Msg[2002]. In this case the snapshots missed Msg[2001]. Thus hot snapshots in aggregate, can be missing one or (potentially) more messages.

Kasten called this a crash consistent backup but it’s not the term we would use (not sure what the term would be but it’s worse than crash consistent). But to our knowledge, this was the first approach that Kasten (or anyone else) has described that comes close to providing data protection for Kafka messaging.

As another alternative, Kasten suggested one could make the backed up set of partitions better would by post processing all the snapshots to find the last message point where all prior messages were available and jettisoning any followon messages. In this way, after post processing, the set of data derived from the hot snaps would be crash consistent up to that message point..

It seems to me, the next step to create an application consistent Kafka messaging backup would require Kafka to provide some way to quiesce a topic message stream. Once quiesced and after some delay to accept all in flight messages, hot snaps could be taken. The resultant snapshots in aggregate would have all the messages to the last one accepted.

Unclear whether quiescing a Kafka topic stream, even for a matter of seconds to minutes, is feasible for a system that processes 1000s to millions of messages per second.

Comments?

Photo Credit(s):

Photonic [AI] computing seeing the light of day – part 2

Read an interesting article in Analytics India Magazine (MIT Researchers Make New Chips That Work On Light) about a startup out of MIT focused on using photonics for AI/ML/DL activities. Not exactly neuromorphic chips, but using analog photonics interactions to perform computational intensive operations required by todays deep neural net training.

We’ve written about photonics computing before ( see Photonic computing seeing the light of day [-part 1]). That post was about spin outs from Princeton and MIT back in 2019. We showed a bit more on how photonics can perform multiplication and other computations with less power.

The article (noted above) talked about LightIntelligence, an MIT spinout/ startup that’s been around since ~2017, but there’s another company in the same space, also out of MIT called LightMatter that just announced early access to their hardware system.

The CEOs of both companies collaborated on a paper (#1&2 authors of the 10 author paper) written back in 2017 on Deep Learning with Coherent Nanophotonic Circuits. This seemed to be the event that launched both companies.

LightMatter just received $80M in Series B funding ( bringing total funding to $113M) last month and LightIntelligence seems to have $40M in total funding So both have decent funding but, LightMatter seems further ahead in funding and product technology.

LightMatter

LightMatter Envise Photonics-RISC AI processing chip

LightMatter Envise AI chip uses standard RISC electronic cores together with Photo Arithmetic Units for accelerated AI computations. Each Envise chip has 500MB of SRAM for large models, offers 400Gbps chip to chip interconnect fabric, 256 RISC cores, a Graph processor, 294 photonic arithmetic units and PCIe 4.0 connectivity.

LightMatter has just announced early access for their Envise AI photonics server. It’s an 4U, AI server with 16 Envise chips, 2 AMD EPYC CPUs, (16×400=)6.4Tpbs optical fabric for inter-chip communications, 1TB of DDR4 DRAM, 3TB of NVMe SSD and supports 2-200GbE SmartNICs for outside communications.

Envise also offers Idiom Software that interfaces with standard AI frameworks to transform models for photonics computing to use Envise hardware . Developers select Envise hardware to run their AI models on and Idiom automatically re-compiles (IdCompile) their model into more parallelized, photonics operations. Idiom also has a model profiler (IdProfiler) to help debug and visualize photonic models in operation (training or inferencing?) on Envise hardware. Idiom also offers an AI model library (IdML) which provides a PyTorch frontend to help compress and quantize a standard set of AI models.

LightMatter also announced their Passage optical interconnect chip that supplies 100Tbps optical switch for photonics, CPU or GPU processing. It’s huge, 8″x8″ and built on 5nm/7nm node process. Passage can connect up to 48 photonics, CPU or GPU chips that are built onto of it (one can see the space for each of these 48 [sub-]chips on the chip). LightMatter states that 40 Passage (photonic/optical) lanes are the width of one optical fibre. Passage chips are sampling now.

LightMatter Passage photonics-transistor chip (carrier) that provides a photonics programmable interconnect for inter-[photonics-electronic-]chip communications.

LightIntelligence

They don’t appear to be announcing any specific hardware just yet but they are at work in creating the world largest integrated photonics processing system. But LightIntelligence have published a number of research papers focused on photonic approaches to CNNs, RNNs/LSTMs/GRUs, Recurrent ISING machines, statistical computing, and invisibility cloaking.

Turns out the processing power needed to provide invisibility cloaking is very intensive and as its all pixels, photonics offers serious speedups (for invisibility, see Nature article, behind paywall).

Photonics Recurrent ISLING Sampler (PRIS)

LightIntelligence did produce a prototype photonics processor in 2019. And they believe the will have de-risked 80-90% of their photonics technology by year end 2021.

If I had to guess, it would appear as if LightIntelligence is trying to re-imagine deep learning taking a predominately all photonics approach.

Why photonics for AI DL

It turns out that one can use the interaction/interference between two light beams to perform matrix multiplication and other computations a lot faster, with a lot less power than using standard RISC (or CISC) electronic processor architectures. Typical GPUs run 400W each and multi-GPU training activities are commonplace today.

The research documented in the (Deep learning using nanophotonics) paper was based on using an optical FPGA which we have talked about before (See Photonics or Optical FPGAs on the horizon) to prototype the technology back in 2017.

Can photonics change the technology underpinning AI or computing?

If by using photonics, one could speed up AI inferencing by 3-5X and do it with 5-6X less power, you might have a market. These are LightMatter Envise performance numbers on ResNet50 with ImageNet and BERT-Base with SQUAD v1.1 against NVIDIA DGX-A100 (state of the art) AI processing system.

The challenge to changing the technology behind multi-million/billion/trillion dollar industry is that it’s not sufficient to offer a product better than the competition. One has to offer a technology that’s better enough to fund the building of a new (multi-million/billion/trillion dollar) ecosystem surrounding that technology. In order to do that it’s got to be orders of magnitude faster/lower power/better so that commercial customers adopt it en masse.

I like where LightMatter is going with their Passage chip. But their Envise server doesn’t seem fast enough to give them enough traction to build a photonics ecosystem or to fund Envise 2, 3, 4, etc. to change the industry.

The 2017 (Deep learning using nanophotonics) paper predicted that an all optical/photonics implementation of CNN would use 3 orders of magnitude less power for small models and that advantage would only go up for larger models (not counting power for data movement, photo detectors, etc.). Now if that’s truly feasible and maybe it takes a more photonics intensive processor to get there, then photonics technology could truly transform the AI or for that matter the computing industry.

But the other thing that LightIntelligence and LightMatter may be counting on is the slowdown in Moore’s law which may inhibit further advances in electronics processing power. Whether the silicon industry is ready to throw in the towel yet on Moore’s law is TBD.

Comments?

Photo Credit(s):

New era of graphical AI is near #AIFD2 @Intel

I attended AIFD2 ( videos of their sessions available here) a couple of weeks back and for the last session, Intel presented information on what they had been working on for new graphical optimized cores and a partner they have, called Katana Graph, which supports a highly optimized graphical analytics processing tool set using latest generation Xeon compute and Optane PMEM.

What’s so special about graphs

The challenges with graphical processing is that it’s nothing like standard 2D tables/images or 3D oriented data sets. It’s essentially a non-Euclidean data space that has nodes with edges that connect them.

But graphs are everywhere we look today, for instance, “friend” connection graphs, “terrorist” networks, page rank algorithms, drug impacts on biochemical pathways, cut points (single points of failure in networks or electrical grids), and of course optimized routing.

The challenge is that large graphs aren’t easily processed with standard scale up or scale out architectures. Part of this is that graphs are very sparse, one node could point to one other node or to millions. Due to this sparsity, standard data caching fetch logic (such as fetching everything adjacent to a memory request) and standardized vector processing (same instructions applied to data in sequence) don’t work very well at all. Also standard compute branch prediction logic doesn’t work. (Not sure why but apparently branching for graph processing depends more on data at the node or in the edge connecting nodes).

Intel talked about a new compute core they’ve been working on, which was was in response to a DARPA funded activity to speed up graphical processing and activities 1000X over current CPU/GPU hardware capabilities.

Intel presented on their PIUMA core technology was also described in a 2020 research paper (Programmable Integrated and Unified Memory Architecture) and YouTube video (Programmable Unified Memory Architecture).

Intel’s PIUMA Technology

DARPA’s goals became public in 2017 and described their Hierarchical Identity Verify Exploit (HIVE) architecture. HIVE is DOD’s description of a graphical analytics processor and is a multi-institutional initiative to speed up graphical processing. .

Intel PIUMA cores come with a multitude of 64-bit RISC processor pipelines with a global (shared) address space, memory and network interfaces that are optimized for 8 byte data transfers, a (globally addressed) scratchpad memory and an offload engine for common operations like scatter/gather memory access.

Each multi-thread PIUMA core has a set of instruction caches, small data caches and register files to support each thread (pipeline) in execution. And a PIUMA core has a number of multi-thread cores that are connected together.

PIUMA cores are optimized for TTEPS (Tera-Traversed Edges Per Second) and attempt to balance IO, memory and compute for graphical activities. PIUMA multi-thread cores are tied together into (completely connected) clique into a tile, multiple tiles are connected within a single node and multiple nodes are tied together with a 8 byte transfer optimized network into a PIUMA system.

P[I]UMA (labeled PUMA in the video) multi-thread cores apparently eschew extensive data and instruction caching to focus on creating a large number of relatively simple cores, that can process a multitude of threads at the same time. Most of these threads will be waiting on memory, so the more threads executing, the less likely that whole pipeline will need to be idle, and hopefully the more processing speedup can result.

Performance of P[I]UMA architecture vs. a standard Xeon compute architecture on graphical analytics and other graph oriented tasks were simulated with some results presented below.

Simulated speedup for a single node with P[I]UMAtechnology vs. Xeon range anywhere from 3.1x to 279x and depends on the amount of computation required at each node (or edge). (Intel saw no speedups between a single Xeon node and multiple Xeon Nodes, so the speedup results for 16 P[I]UMA nodes was 16X a single P[I]UMA node).

Having a global address space across all PIUMA nodes in a system is pretty impressive. We guess this is intrinsic to their (large) graph processing performance and is dependent on their use of photonics HyperX networking between nodes for low latency, small (8 byte) data access.

Katana Graph software

Another part of Intel’s session at AIFD2 was on their partnership with Katana Graph, a scale out graph analytics software provider. Katana Graph can take advantage of ubiquitous Xeon compute and Optane PMEM to speed up and scale-out graph processing. Katana Graph uses Intel’s oneAPI.

Katana graph is architected to support some of the largest graphs around. They tested it with the WDC12 web data commons 2012 page crawl with 3.5B nodes (pages) and 128B connections (links) between nodes.

Katana runs on AWS, Azure, GCP hyperscaler environment as well as on prem and can scale up to 256 systems.

Katana Graph performance results for Graph Neural Networks (GNNs) is shown below. GNNs are similar to AI/ML/DL CNNs but use graphical data rather than images. One can take a graph and reduce (convolute) and summarize segments to classify them. Moreover, GNNs can be used to understand whether two nodes are connected and whether two (sub)graphs are equivalent/similar.

In addition to GNNs, Katana Graph supports Graph Transformer Networks (GTNs) which can analyze meta paths within a larger, heterogeneous graph. The challenge with large graphs (say friend/terrorist networks) is that there are a large number of distinct sub-graphs within the graph. GTNs can break heterogenous graphs into sub- or meta-graphs, which can then be used to understand these relationships at smaller scales.

At AIFD2, Intel also presented an update on their Analytics Zoo, which is Intel’s MLops framework. But that will need to wait for another time.

~~~~

It was sort of a revelation to me that graphical data was not amenable to normal compute core processing using today’s GPUs or CPUs. DARPA (and Intel) saw this defect as a need for a completely different, brand new compute architecture.

Even so, Intel’s partnership with Katana Graph says that even today compute environment could provide higher performance on graphical data with suitable optimizations.

It would be interesting to see what Katana Graph could do using PIUMA technology and appropriate optimizations.

In any case, we shouldn’t need to wait long, Intel indicated in the video that P[I]UMA Technology chips could be here within the next year or so.

Comments?

Photo Credit(s):

  • From Intel’s AIFD2 presentations
  • From Intel’s PUMA you tube video

Swarm learning for distributed & confidential machine learning

Read an article the other week about researchers in Germany working with a form of distributed machine learning they called swarm learning (see: AI with swarm intelligence: a novel technology for cooperative analysis …) which was reporting on a Nature magazine article (see: Swarm Learning for decentralized and confidential clinical machine learning).

The problem of shared machine learning is particularly accute with medical data. Many countries specifically call out patient medical information as data that can’t be shared between organizations (even within country) unless specifically authorized by a patient.

So these organizations and others are turning to use distributed machine learning as a way to 1) protect data across nodes and 2) provide accurate predictions that uses all the data even though portions of that data aren’t visible. There are two forms of distributed machine learning that I’m aware of federated and now swarm learning.

The main advantages of federated and swarm learning is that the data can be kept in the hospital, medical lab or facility without having to be revealed outside that privileged domain BUT the [machine] learning that’s derived from that data can be shared with other organizations and used in aggregate, to increase the prediction/classification model accuracy across all locations.

How distributed machine learning works

Distributed machine learning starts with a common model that all nodes will download and use to share learnings. At some agreed to time (across the learning network), all the nodes use their latest data to re-train the common model and share new training results (essentially weights used in the neural network layers) with all other members of the learning network.

Shared learnings would be encrypted with TLS plus some form of homomorphic encryption that allowed for calculations over the encrypted data.

In both federated and swarm learning, the sharing mechanism was facilitated by a privileged block chain (apparently Etherium for swarm). All learning nodes would use this blockchain to share learnings and download any updates to the common model after sharing.

Federated vs. Swarm learning

The main difference between federated and swarm learning is that with federated learning there is a central authority that updates the model(s) and with swarm learning that processing is replaced by a smart contract executing within the blockchain. Updating model(s) is done by each node updating the blockchain with shared data and then once all updates are in, it triggers a smart contract to execute some Etherium VM code which aggregates all the learnings and constructs a new model (or at least new weights for the model). Thus no node is responsible for updating the model, it’s all embedded into a smart contract within the Etherium block chain. .

Buthow does the swarm (or smart contract) update the common model’s weights. The Nature article states that they used either a straight average or a weighted average (weighted by “weight” of a node [we assume this is a function of the node’s re-training dataset size]) to update all parameters of the common model(s).

Testing Swarm vs. Centralized vs. Individual (node) model learning

In the Nature paper, the researchers compared a central model, where all data is available to retrain the models, with one utilizing swarm learning. To perform the comparison, they had all nodes contribute 20% of their test data to a central repository, which ran the common swarm updated model against this data to compute an accuracy metric for the swarm. The resulting accuracy of the central vs swarm learning comparison look identical.

They also ran the comparison of each individual node (just using the common model and then retraining it over time without sharing this information to the swarm versus using the swarm learning approach. In this comparison the swarm learning approach alway seemed to have as good as if not better accuracy and much narrower dispersion.

In the Nature paper, the researchers used swarm learning to manage the machine learning model predictions for detecting COVID19, Leukemia, Tuberculosis, and other lung diseases. All of these used public data, which included PBMC (peripheral blood mono-nuclear cells) transcription data, whole blood transcription data, and X-ray images.

Swarm learning also provides the ability to onboard new nodes in the network. Which would supply the common model and it’s current weights to the new node and add it to the shared learning smart contract.

The code for the swarm learning can be downloaded from HPE (requires an HPE passport login [it’s free]). The code for the models and data processing used in the paper are available from github. All this seems relatively straight forward, one could use the HPE Swarm Learning Library to facilitate doing this or code it up oneself.

Photo Credit(s):