I attended AIFD2 ( videos of their sessions available here) a couple of weeks back and for the last session, Intel presented information on what they had been working on for new graphical optimized cores and a partner they have, called Katana Graph, which supports a highly optimized graphical analytics processing tool set using latest generation Xeon compute and Optane PMEM.
What’s so special about graphs
The challenges with graphical processing is that it’s nothing like standard 2D tables/images or 3D oriented data sets. It’s essentially a non-Euclidean data space that has nodes with edges that connect them.
But graphs are everywhere we look today, for instance, “friend” connection graphs, “terrorist” networks, page rank algorithms, drug impacts on biochemical pathways, cut points (single points of failure in networks or electrical grids), and of course optimized routing.
The challenge is that large graphs aren’t easily processed with standard scale up or scale out architectures. Part of this is that graphs are very sparse, one node could point to one other node or to millions. Due to this sparsity, standard data caching fetch logic (such as fetching everything adjacent to a memory request) and standardized vector processing (same instructions applied to data in sequence) don’t work very well at all. Also standard compute branch prediction logic doesn’t work. (Not sure why but apparently branching for graph processing depends more on data at the node or in the edge connecting nodes).
Intel talked about a new compute core they’ve been working on, which was was in response to a DARPA funded activity to speed up graphical processing and activities 1000X over current CPU/GPU hardware capabilities.
Intel presented on their PIUMA core technology was also described in a 2020 research paper (Programmable Integrated and Unified Memory Architecture) and YouTube video (Programmable Unified Memory Architecture).
Intel’s PIUMA Technology
DARPA’s goals became public in 2017 and described their Hierarchical Identity Verify Exploit (HIVE) architecture. HIVE is DOD’s description of a graphical analytics processor and is a multi-institutional initiative to speed up graphical processing. .
Intel PIUMA cores come with a multitude of 64-bit RISC processor pipelines with a global (shared) address space, memory and network interfaces that are optimized for 8 byte data transfers, a (globally addressed) scratchpad memory and an offload engine for common operations like scatter/gather memory access.
Each multi-thread PIUMA core has a set of instruction caches, small data caches and register files to support each thread (pipeline) in execution. And a PIUMA core has a number of multi-thread cores that are connected together.
PIUMA cores are optimized for TTEPS (Tera-Traversed Edges Per Second) and attempt to balance IO, memory and compute for graphical activities. PIUMA multi-thread cores are tied together into (completely connected) clique into a tile, multiple tiles are connected within a single node and multiple nodes are tied together with a 8 byte transfer optimized network into a PIUMA system.
P[I]UMA (labeled PUMA in the video) multi-thread cores apparently eschew extensive data and instruction caching to focus on creating a large number of relatively simple cores, that can process a multitude of threads at the same time. Most of these threads will be waiting on memory, so the more threads executing, the less likely that whole pipeline will need to be idle, and hopefully the more processing speedup can result.
Performance of P[I]UMA architecture vs. a standard Xeon compute architecture on graphical analytics and other graph oriented tasks were simulated with some results presented below.
Simulated speedup for a single node with P[I]UMAtechnology vs. Xeon range anywhere from 3.1x to 279x and depends on the amount of computation required at each node (or edge). (Intel saw no speedups between a single Xeon node and multiple Xeon Nodes, so the speedup results for 16 P[I]UMA nodes was 16X a single P[I]UMA node).
Having a global address space across all PIUMA nodes in a system is pretty impressive. We guess this is intrinsic to their (large) graph processing performance and is dependent on their use of photonics HyperX networking between nodes for low latency, small (8 byte) data access.
Katana Graph software
Another part of Intel’s session at AIFD2 was on their partnership with Katana Graph, a scale out graph analytics software provider. Katana Graph can take advantage of ubiquitous Xeon compute and Optane PMEM to speed up and scale-out graph processing. Katana Graph uses Intel’s oneAPI.
Katana graph is architected to support some of the largest graphs around. They tested it with the WDC12 web data commons 2012 page crawl with 3.5B nodes (pages) and 128B connections (links) between nodes.
Katana runs on AWS, Azure, GCP hyperscaler environment as well as on prem and can scale up to 256 systems.
Katana Graph performance results for Graph Neural Networks (GNNs) is shown below. GNNs are similar to AI/ML/DL CNNs but use graphical data rather than images. One can take a graph and reduce (convolute) and summarize segments to classify them. Moreover, GNNs can be used to understand whether two nodes are connected and whether two (sub)graphs are equivalent/similar.
In addition to GNNs, Katana Graph supports Graph Transformer Networks (GTNs) which can analyze meta paths within a larger, heterogeneous graph. The challenge with large graphs (say friend/terrorist networks) is that there are a large number of distinct sub-graphs within the graph. GTNs can break heterogenous graphs into sub- or meta-graphs, which can then be used to understand these relationships at smaller scales.
At AIFD2, Intel also presented an update on their Analytics Zoo, which is Intel’s MLops framework. But that will need to wait for another time.
It was sort of a revelation to me that graphical data was not amenable to normal compute core processing using today’s GPUs or CPUs. DARPA (and Intel) saw this defect as a need for a completely different, brand new compute architecture.
Even so, Intel’s partnership with Katana Graph says that even today compute environment could provide higher performance on graphical data with suitable optimizations.
It would be interesting to see what Katana Graph could do using PIUMA technology and appropriate optimizations.
In any case, we shouldn’t need to wait long, Intel indicated in the video that P[I]UMA Technology chips could be here within the next year or so.
- From Intel’s AIFD2 presentations
- From Intel’s PUMA you tube video