114: GreyBeards talk computational storage with Tong Zhang, Co-Founder & Chief Scientist, ScaleFlux

Seeing as how one topic on last years FMS2020 wrap-up with Jim Handy was the rise of computational storage and it’s been a long time (see GreyBeards talk with Scott Shadley at NGD Systems) since we discussed this, we thought it time to check in on the technology. So we reached out to Dr. Tong Zhang, Chief Scientist and Co-founder, ScaleFlux to see what’s going on. ScaleFlux is seeing rising adoption of their product in hyper-scalers as well as large enterprises. Their computational storage is a programmable FPGA based 4TB and 8TB SSD.

Tong was very knowledgeable on current industry trends (Moore’s law slowing & others) that have created an opening for computational storage and other outboard compute. He also is well versed into how some of the worlds biggest customers are using the technology to work faster and cheaper in their data centers. Listen to the podcast to learn more.

At the start Tong mentioned Alibaba’s use of ScaleFlux’s transparent, line speed, outboard encryption/decryption and compression/decompression. And, depending on the data, they can see compression ratios far exceeding 2:1. As such, customers not only benefit from a cheaper $/GB but can also see better NAND endurance and higher performance.

Hosts can do compression and encryption but doing so takes a lot of CPU cycles. It turns out that compression is more compute intensive than encryption. Tong said that most modern cores can encrypt/decrypt at 1GB/sec but, depending on the compression algorithm, can only compress at 40 to 100MB/sec. But in any case doing so on the host consumes a lot of CPU instruction cycles. With ScaleFlux, they can compress and decompress at PCIe bus speeds.

Most storage controllers that offer compression/decompression must have some sort of LBA (logical block address) virtualization. Because while the host may be writing 512 or 4096 byte blocks, what’s actually written to the NAND is more like, 231 or 1999 bytes. So packing these odd, variable length blocks into NAND blocks can become a problem. But most SSDs already have a flash translation layer (FTL) where LBA addresses are mapped, over time, to different physical NAND page/block addresses. ScaleFlux has combined support for LBA virtualization and FTL into the same process and by doing so, they reduce IO overhead to perform better.

ScaleFlux’s drive is an NVMe SSD, which already supports great native response times but when you are transferring 1/2 or less of (compressed) data from the host onto NAND, you can reduce latencies even more. .

Although their current generation product is based on TLC NAND they are working on the next generation which will support QLC. And the benefits of writing and reading less data should also help QLC endurance and performance.

Although ScaleFlux is seeing great adoption with just outboard transparent compression and encryption, there is more that could be done, For example,

  • Filtering query’s at the drive rather than at the host. If customers can send a search key/phrase or other filtering request directly to the drive, the drive can pass over all it’s data and send back just the data that matches that filter request.
  • Transcoding and other data format changes. Although transcoding makes a lot of sense to do outboard, Tong also mentioned format changes. We asked him to clarify and he said consider a row based database that needs to be accessed in columnar format. If the drive could change the format from one to the other, it opens up more analytics tool sets.

At the moment, ScaleFlux engineering teams are the ones that program the FPGA to perform outboard functionality. But in a future release, they plan to adding ARM cores in a SoC, which can handle more general purpose outboard functionality as code.

Because of this added complexity of compression, encryption and other outboard logic, we asked Tong what power loss protection was available at the drive level. Tong assured us that once data has been received by their device, it is maintained across a power failure with CAPs and other logic to offload it.

Tong also mentioned that Intel, AWS and the NVMe standard committee are looking at adding some computational storage support into the NVMe standard, so applications and host software can invoke and maybe modify outboard functionality on the fly. Sort of like loading containers of functionality to run on the fly on an SSD drive.

Dr. Tong Zhang, Chief Scientist and Co-fonder, ScaleFlux

Dr. Tong Zhang is a well-established researcher with significant contributions to data storage systems and VLSI signal processing. Dr. Zhang is responsible for developing key techniques and algorithms for ScaleFlux’s Computational Storage products and exploring their use in mainstream application domains.

He is currently a Professor at Rensselaer Polytechnic Institute (RPI). His current and past research span over database, filesystem, solid-state and magnetic data storage devices and systems, digital signal processing and communication, error correction coding, VLSI architectures, and computer architecture.

He has published over 150 technical papers at prestigious USENIX/IEEE/ACM conferences and journals with the citation h-index of 36, and has served as general and technical program chairs for several premier conferences. Among his many research accomplishments, he made pioneering contributions to establishing flash memory signal processing and enabling practical implementation of low-density parity-check (LDPC) codecs. He received two best paper awards and has over 20 issued/pending US patent applications.

He holds BS/MS degrees in EE from the Xi’an Jiaotong University, China, and PhD degree in ECE from the University of Minnesota.

112: GreyBeards annual year end wrap-up with Keith & Matt

It’s the end of the year, so time for our regular year end wrap up discussion with the GreyBeards. 2020 has been an interesting year to say the least. It started out just fine, then COVID19 showed up and threw a wrench in everyone’s plans and as the year closes, we were just starting to see some semblance of the new normal, when one of the largest security breaches in years shows up. Whew, almost glad that’s over and onto 2021.

As always the GreyBeards had a great discussion on these and other topics to highlight the year just past. The talk was wide ranging and hard to characterize but I did my best below. Listen to the podcast to learn more.

COVID19s impact on the enterprise

It will probably take some time before we learn the true, long term impacts of COVID19 on IT but one major change has to be the massive Work From Home (WFH) transition that took place overnight.

While WFH can be more productive for some, the lack of face2face interaction can be challenging for others. The fact that many of the GreyBeards have been working from home for decades now, left us a bit oblivious to how jarring this transition can be for newcomers.

There’s definitely some psychological changes that need to occur to be productive at WFH. Organization skills become even more important. Structured interactions (read conference calls, zoom/webex and other forms of communication become much more important. And then there’s security.

Turns out VMware and others have been touting VDI solutions for the past decade or so to better support remote work and at the same time providing corporate levels of security for remote work. While occasionally this doesn’t work quite as well as expected, it’s certainly much much better than having end users access corporate data without any security around that data or worse yet, the “bring your own device”. All these VDI solutions had a field day when WFH happened.

Many workers found they could be more productive at WFH, due the less distractions, no commute time and more flexible hours. What happens when COVID19 is vanquished to all these current WFHers is anyone’s guess.

We thought there might be less need for large office campuses/buildings. But there’s something to be said for more collaboration and random interactions through face2face meetings that can only occur in an office setting with workers present at the same time. Some organizations will take to this new way of work while others will try to dial WFH back to non-existent. Where your organization fits on this spectrum and why, will be telling across a number of dimensions.

The rise of ARM

There’s been a slow but steady improvement in ARM processors over the last almost half century. Nowadays it’s starting to make a place for itself in the enterprise. ARH has always been the goto microprocessor for low power solutions (like smartphones) but nowadays they are being deployed in the cloud and even the enterprise. These can be used as server processors but even outside servers, ARM cores are showing up in hardware accelerators as the brains behind SmartNICs, DPUs, SPUs, etc.

Keith made mention AWS 2nd generation Graviton 64-bit ARM processor EC2 instances. And yes there’s significant cost ( & power) savings that can be had using AWS Graviton ARM instances. So the cloud is starting to adopt them. Somewhere over the past couple of years I heard that VMware was porting ESX to work on ARM cores.

But apparently, it’s not just as simple as dropping an ARM multi-core processor into a server and recompiling your code and away you go. Applications need a certain amount of optimization to run effectively on ARM processors. And the speed up between non-optimized and optimized versions of an application running on ARM cores is significant.

As for SmartNICs and DPUs, these are data networking hardware accelerators that provide real time processing capabilities needed to keep up with higher speed networking, 100GbE and beyond. These DPUs perform deep packet inspection, data compression, encryption and other services all at wire speeds.. Yes you could devote 1 or more X86 cores to do this, but it’s much cheaper (and more effective) to do this outside the CPU core. Moreover, performing this activity at the network entry point to the server means that much of this data doesn’t have to be transferred back and forth through server memory. So not only does it save CPU core cycles but also memory size and memory & PCIe bus bandwidth. We published a recent podcast with Kevin Deierling, NVIDIA Networking discussing DPUs if you want to learn more.

Pat made mention at (virtual) VMworld their plans to port ESX to the DPU. Keith followed up on this and asked some other exec’s at VMware about this and they said VMware will more likely support DPUs as just another hardware accelerator in their cluster. In either case, CPU cycles should be freed up and this should help VMware use X86 cores more efficiently. And perhaps this will help them engage in more CPU constrained environments such as Telcom.

Then there’s computational storage. We have been watching this technology for a couple of years now and it’s seeing some success in being deployed to public cloud environments. They seem to be being used to provide outboard data compression. It’s unclear whether these systems depend on ARM processing or not but my bet is that they do. To learn more about computational storage check out these podcasts, FMS2020 wrap up with Jim Handy and our talk with Scott Shadley on NGD’s computational storage.

System security

At yearend, we are learning of a massive security breach throughout US government IT facilities. All based on what is believed to be a Russian hack to a software package that is embedded in a popular networking tool software solution, SolarWinds. They are calling this a software supply chain hack. Although we are mainly hearing about government agencies being hacked, SolarWinds is also pervasive in the enterprise as well.

There have been many hardware supply chain hacks in the past, where a board supplier used chips or logic that weren’t properly vetted. Over time, hardware suppliers have started to scrutinize their supply chains better and have reduced this risk.

And the US government have been lobbying for the industry to use a security chip with a backdoor or to supply back doors to smartphone encryption capabilities. Luckily, so far, none of these have been implemented by industry.

What Russia has shown us is that this particular hack is not limited to the hardware sphere. Software supply chain risk can’t be ignored anymore.

This means that any software application supplier will need to secure their supply chain or bring it all in house. Which may mean that costs for these packages will go up. It’s possible that using a pure open source supply chain may reduce this risk as well. At least that’s the promise of open source.

We said 2020 was an interesting year and it’s going out with a bang.

Matt Leib (@MBLeib), one of our co-hosts, has been blogging in the storage space for over 10 years, with work experience both on the engineering and presales/product marketing.. His blog is at Virtually Tied to My Desktop and he’s on LinkedIN.

Keith Townsend (@CTOAdvisor) is a IT thought leader who has written articles for many industry publications, interviewed many industry heavyweights, worked with Silicon Valley startups, and engineered cloud infrastructure for large government organizations. Keith is the co-founder of The CTO Advisor, blogs at Virtualized Geek, and can be found on LinkedIN.

109: GreyBeards talk SmartNICs & DPUs with Kevin Deierling, Head of Marketing at NVIDIA Networking

We decided to take a short break (of sorts) from storage to talk about something equally important to the enterprise, networking. At (virtual) VMworld a month or so ago, Pat made mention of developing support for SmartNIC-DPUs and even porting vSphere to run on top of a DPU. So we thought it best to go to the source of this technology and talk with Kevin Deierling (TechSeerKD), Head of Marketing at NVIDIA Networking who are the ones supplying these SmartNICs to VMware and others in the industry.

Kevin is always a pleasure to talk with and comes with a wealth of expertise and understanding of the technology underlying data centers today. The GreyBeards found our discussion to be very educational on what a SmartNIC or DPU can do and why VMware and others would be driving to rapidly adopt the technology. Listen to the podcast to learn more.

NVIDIA’s recent acquisition of Mellanox brought them Mellanox’s NIC, switch and router technology. And while Mellanox, and now NVIDIA have some pretty impressive switches and routers, what interested the GreyBeards was their SmartNIC technology.

Essentially, SmartNICS provide acceleration and offload of data handling needs required to move data around an enterprise network. These offload services include at a minimum, encryption/decryption, packet pacing (delivering gadzillion video streams at the right speed to insure proper playback by all), compression, firewalls, NVMeoF/RoCE, TCP/IP, GPU direct storage (GDS) transfers, VLAN micro-segmentation, scaling, and anything else that requires real time processing to perform at line speeds.

For those who haven’t heard of it, GDS transfers data from storage directly into GPU memory and from GPU memory directly to storage without any CPU cycles or server memory involvement, other than to set up the transfer. This extends NVMeoF RDMA tech to/from storage and server memory, to GPUs. That is, GDS offers a RDMA like path between storage and GPU memory. GPU to/from server memory direct interface already exists over the PCIe bus.

But even with all the offloads and accelerators above, they can also offer an additional a secure enclave outside the TPM in the CPU, to better isolate security sensitive functionality for a data center. (See DPU below).

Kevin mentioned multiple times that the new unit of computation is no longer a server but rather is now a data center. When you have public cloud, private cloud and other systems that all serve up virtual CPUs, NICs, GPUs and storage, what’s really being supplied to a user is a virtual data center. Cloud providers can carve up their hardware and serve it to you any way you want or need it. Virtual data centers can provide a multitude of VMs and any infrastructure that customers need to use to run their workloads.

Kevin mentioned by using SmartNics, IT or cloud providers can return 30% of the processor cycles (that were being spent doing networking work on CPUs) back to workloads that run on CPUs. Any data center can effectively obtain 30% more CPU cycles and increased networking speed and performance just by deploying SmartNICs throughout all the servers in their environment.

SmartNICs are an outgrowth of Mellanox technology embedded in their HPC InfiniBAND and high end Ethernet switches/routers. Mellanox had been well known for their support of NVMeoF/RoCE to supply high IOPs/low-latency IO activity for NVMe storage over Ethernet and before that their InfiniBAND RDMA technologies.

As Mellanox came out with their 2nd Gen SmartNIC they began to call their solution a “DPU” (data processing unit), which they see forming part of a “holy trinity” underpinning the new data center which has CPUs, GPUs and now DPUs. But a DPU is more than just a SmartNIC.

All NVIDIA SmartNICs and DPUs are based on Mellanox’s BlueField cards and chip technology. Their DPU uses BlueField2 (gen 2 technology) chips, which has a multi-core ARM engine inside of it and memory which can be used to perform computational processing in addition to the onboard offload/acceleration capabilities.

Besides adding VMware support for SmartNICs, PatG also mentioned that they were porting vSphere (ESX) to run on top of NVIDIA Networking DPUs. This would move the core VMware’s hypervisor functionality from running on CPUs, to running on DPUs. This of course would free up most if not all VMware Hypervisor CPU cycles for use by customer workloads.

During our discussion with Kevin, we talked a lot about the coming of AI-ML-DL workloads, which will require ever more bandwidth, ever lower latencies and ever more compute power. NVIDIA was a significant early enabler of the AI-ML-DL with their CUDA API that allowed a GPU to be used to perform DL network training and inferencing. As such, CUDA became an industry wide phenomenon allowing industry wide GPUs to be used as DL compute engines.

NVIDIA plans to do the same with their SmartNICs and DPUs. NVIDIA Networking is releasing the DOCA (Data center On a Chip Architecture) SDK and API. DOCA provides the API to use the BlueField2 chips and cards which are the central techonology behind their DPU. They have also announced a roadmap to continue enhancing DOCA, as they have done with CUDA, over the foreseeable future, to add more bandwidth, speed and functionality to DPUs.

It turns out the real problem which forced Mellanox and now NVIDIA to create SmartNics was the need to support the extremely low latencies required for NVMeoF and GDS IO.

It wasn’t clear that the public cloud providers were using SmartNICS but Kevin said it’s been sort of a widely known secret that they have been using the tech. The public clouds (AWS, Azure, Alibaba) have been deploying SmartNICS in their environments for some time now. Always on the lookout for any technology that frees up compute resources to be deployed for cloud users, it appears that public cloud providers were early adopters of SmartNICS.

Kevin Deierling, Head of Marketing NVIDIA Networking

Kevin is an entrepreneur, innovator, and technology executive with a proven track record of creating profitable businesses in highly competitive markets.

Kevin has been a founder or senior executive at five startups that have achieved positive outcomes (3 IPOs, 2 acquisitions). Combining both technical and business expertise, he has variously served as the chief officer of technology, architecture, and marketing of these companies where he led the development of strategy and products across a broad range of disciplines including: networking, security, cloud, Big Data, machine learning, virtualization, storage, smart energy, bio-sensors, and DNA sequencing.

Kevin has over 25 patents in the fields of networking, wireless, security, error correction, video compression, smart energy, bio-electronics, and DNA sequencing technologies.

When not driving new technology, he finds time for fly-fishing, cycling, bee keeping, & organic farming.

This image has an empty alt attribute; its file name is Subscribe_on_iTunes_Badge_US-UK_110x40_0824.png
This image has an empty alt attribute; its file name is play_prism_hlock_2x-300x64.png
This image has an empty alt attribute; its file name is Spotify_Logo_CMYK_Black-1024x307.png

105: Greybeards talk new datacenter architecture with Pradeep Sindhu, CEO & Co-founder, Fungible

Neither Ray nor Keith has met Pradeep before, but Ray was very interested in Fungible’s technology. Turns out Pradeep Sindhu, CEO and Co-founder, Fungible has had a long and varied career in the industry starting at Xerox Parc, then co-founding and becoming chief scientist at Juniper, and now reachitecting the data center with Fungible. Pradeep mentioned this at the end of the podcast, he has always been drawn to hard problems with the potential to open up immense possibilities. What he did at Juniper and what he is planning to accomplish with Fungible both fit that pattern.

Today, in a typical data center, we have servers, networking and storage equipment all connected through a fabric. But from Pradeep’s perspective none of it works well in support of data centric computing. What we have today is operating like changing a screw with a pliers. But if there existed some hardware that can execute data centric computing (or to follow the metaphor, a screw driver) well, the data center would operate much more efficiently, with more performance and better resource use.

Fungible was founded in 2015 with the idea that the industry is moving to a data centric computing paradigm and today’s data center is ill equipped to take IT there.

What is data centric computing

The IT industry has been moving to a new type of computing, that is focused on short bursts of CPU activity with relatively small packets of data coming off the network (from sensors/outside world, from storage, from other servers, etc.). Those workloads are often transient, short lived, are intended to be performed quickly and may not leave any persistent state.

We can see this in the emergence of micro-services architectures with Docker and k8s containers. But you don’t have to be using containers. It’s also present in machine learning where the update cycle of the neural network (with accelerators) takes lot’s of small bursts of computation while it consumes lots of small data items (pictures, text documents, ticker/status logs, etc. ).

Furthermore, the move to commodity hardware has taken the same x86/ARM core CPUs and used them to execute these small bursts of computation. And for some of these operations that may still make sense. But when the data center uses these same cores to perform data path packet processing. It bogs down the network. It consumes a lot of power, adds overhead (higher latencies), leads to packet loss, injects network jitter and a host of other problems.

So, in order to get the data packets to where they need to be with out those problems, networking endpoints need to be changed out to something designed to support data path critical workloads. Pradeep calls these data path critical work items “run to complete” code.

The critical question is what proportion of IT workloads are “data centric’ vs. not. While it might not be that high today, Pradeep and Fungible are betting that it’s going to be getting much higher over time. If we look at hyper-scalars today they are the forefront of this computing paradigm change and much of their workloads are moving to containerized execution.

The DPU enables data centric computing

Fungible plans to add a DPU that supports a power efficient, “run-to-complete” programming engine to the data center. By using DPUs, they can create a true fabric (using IPoE) that’s low latency, low jitter, lossless and provides full cross-sectional bandwidth.

The problem as Pradeep sees it is that the X86 and ARM cores are just not made to execute run-to-comple workloads well and this is required to provide a true fabric. Whereas Fungible has designed the DPU from the start to execute run-to-complete work.

Pradeep sees the data center of tomorrow utilizing JBoF(lash) & JBoD(isk) boxes with DPU(s) in front of them providing storage server services (block, file and object), JBoGP(Us) or JBoFP(GAs) boxes with DPU(s) in front of them providing accelerator/graphics server services, and compute boxes with DPU(s) and x86/ARM cores with DRAM-Optane PMEM in them providing CPU server and client services. All the DPUs together in a cluster would in total provide true fabric services.

Essentially, the DPUs would take over all data path operations and the storage, GPUS, CPUs would handle everything else. In effect, segregating data path and control path services in the data center.

Greenfield, brownfield or both

Keith and I both assumed this would be great for a green field deployments. But,. Pradeep said it’s designed to be incrementally added to servers, JBoFs, JBoDs, JBoGs/JBoFPs and start providing data path services within current data center fabric environments. Even as the rest of the data center remain unchanged.

At some point we talked about the programming model of the DPU. The DPU offers a bring your own Linux OS that can be programmed in any language you choose. But the critical, data-path functionalityi is coded in “C” to run as fast and as efficiently as possible.

Fungible has designed this hardware themselves. We didn’t get to talk about how they plan to market their product to the data center.

Pradeep also said to stay-tuned, and they were just about to announce their first product offering based on the DPU.

The podcast ran ~38 minutes. Pradeep, given his education and experience, is a very knowledgeable individual about the data center environment today. He’s certainly one of the most interesting IT tecnologist we have talked with in a while on the GreyBeards podcast. To say what Fungible is trying to do is aggressive and bold is an understatement. But Pradeep feels this is the only way forward to liberate the data center from its data path chains today. Both Keith and I thought we needed at least another hour or so to truly understand what they are doing and where they are going with it. Listen to the podcast to learn more.

This image has an empty alt attribute; its file name is Spotify_Logo_CMYK_Black-1024x307.png

This image has an empty alt attribute; its file name is play_prism_hlock_2x-300x64.png
This image has an empty alt attribute; its file name is Subscribe_on_iTunes_Badge_US-UK_110x40_0824.png

Pradeep Sindhu, CEO and Co-Founder, Fungible

Pradeep Sindhu is CEO and Co-Founder of Fungible, a Santa Clara-based startup providing at-scale, next-generation solutions for the data center, cloud and IT industries. He has been at the forefront of the network and processing industry for over three decades.

As the co-founder and CTO of Juniper Networks, he played a central role in the architecture, design and development of Juniper’s M40 router – the M series was the first of its kind, offering the industry true decoupling of the control plane and the forwarding plane.

Prior to Juniper, he was a Principal Scientist and Distinguished Engineer at the Computer Science Lab at Xerox’s Palo Alto Research Center (PARC) pushing the envelope on what silicon could do for networking and processing.

He is passionate about new ways to support our growing data-centric world with the right combination of hardware and software to build the infrastructure our future needs.

099: GreyBeards talk Folding@Home with Mike Harsch, a longtime enthusiast

Microscopic picture of Coronavirus

Mike Harsch (@harschness) is a personal friend, a computer enthusiast with a particular and enduring interest in distributed systems and GPU computing. MIke’s been a longtime user and proponent of Folding@Home, a distributed system focused on protein dynamics that anyone can download and run on their personal computer(s) or gaming devices.

We started the discussion on the history of distributed processing using home computers. Mike apparently first ran accross these systems in college and was using one in his college dorm room, back in 1997. At the time there was a system called, distributed.net, which was attempting to crack the (RC5-56[bit]) encryption keys used for computer security and offered a $10K prize for solving it. That was solved in 250 days (source: wikipedia article on distributed.net). Distributed.net is still up and working but since then they have moved to ever larger keys.

Next came Seti@Home which was a 2nd gen distributed system. SETI @Home sent out slices of recorded radio telescope spectrum and tasked people’s computers (during screen saving) to analyze that spectrum for alien signals. Seti@Home painted a nice image of the analysis. Seti@Home also used some gamification, where users gained points for analyzing spectrum. Over time they had something like a leader board tracking the top users. Recently, Seti@Home shut down their distributed system and changed their focus to analyze all the results they received from their users. I was a SETI@Home user for a while.


Folding@Home is 3rd generation distributed computing solution built along the same lines but rather than searching for aliens, with Folding@Home you are running a simulation of what a protein molecule does over time. Mike mentioned that a typical Folding@Home work unit is to simulate a few nanoseconds in the life of a protein and this could take an hour or more on a x86 class multi-core CPU (with less time on GPUs).

Mike mentioned that there was a recent Ask Me Anything (AMA) event on Reddit with the team on Folding@Home answering questions. And on March 15th, the team at Folding@Home clarified how they are helping to solve the COVID-19 pandemic.

Keith has used Folding@Home in the past. And my son was an early user as well.

What Folding@Home does

Fold@Home uses idle CPU or GPU time on home gaming platforms/computers/servers or data center servers. Initially, in October of 2000, it was used to understand protein folding. But nowadays it’s gone beyond just folding, to simulate the life of a protein.

Prior to their turn to concentrate on COVID-19, they usually had ~30K active users, supplying ~100PFlops (100 quintillian x86 double precision floating point operations per second) of compute power. 

You get points for doing Folding@Home work. When Folding@Home was launched it was designed to use a single CPU/single core. Sometime in 2006, they released a SMP version of the code ,which could use multi-cores. Later they released a multi-threaded version which worked better on multi-core CPUs. And within the last few years, they have released a GPU support that could take advantage of the massive numbers of GPU cores available today.

Mike said that Folding@Home work unit GPU is generally 10 to 100X faster than what can be done with multi-core/multi-threaded CPU systems. 

Around Feb 27, Folding@Home announced they were going to focus all their efforts on understanding how to combat the COVID-19 coronavirus. After the announcement, their user count went through the roof, to now ~400K active users/day. This led to throttling requests for work and delays in handling responses. Over the ensuing weeks, (as of 3/18), they seem to have added enough resources to support their current levels of users.

The architecture of the old Folding@Home system was 2 tiered, they had a set of Folding@Home front-end servers that handled web traffic and distributed the work requests/responses to a set of backend servers that supplied work requests to users and combined work results. In their latest rush they seemed to have had to add servers, networking and storage to both tiers.

Sometime around March 25th, Folding@Home became the firsth and only ExaFlop supercomputer, achieving 1.56 (x86) ExaFlops (10**18 FLOPS, source: wikipedia article on Folding@Home) and have over 1 million active computing devices (GPUs & CPUs) in their network (see: Greg Bowwan’s status tweet).

Deploying Folding@Home on your systems

Folding@Home operates on any number of endpoint devices OSs and gaming console -systems. It comes in two software packages, one is the software that logs into the Folding@Home server to gather the next slice of work unit to perform and the other is the one that does the simulation work. They have an option to paint a picture of what is happening but most disable this feature to devote 100% of any idle CPU/GPU resources to the simulation. They also have a support forum, if you have any questions or need assistance in deploying their software.

Keith mentioned that some gal at VMware asked VMware users to devote their home server CPUs/GPUs to the project. I checked their website and they have a vSphere appliance (FLING) that will run Folding@Home and will register itself as joining the VMware team. Mike mentioned that GitHub (announced on Twitter) was going to supply up to 60K CPU core hours a day to the project. They recently reported that they are shifting work units from understanding COVID-19 to screening compounds for therapeutic potential against the coronavirus.

The world needs you to help solve the COVID-19 pandemic. So join up with Folding@Home to do your part. Downloading the software and installing it on a Mac was easy. Just don’t forget to reboot afterwards and then run FAHcontrol and FAHviewer in “Applications/Folding@home” folder to see what’s going on.

The podcast runs a little under 40 minutes. Mike was very knowledgeable about the IT side of Folding@Home, but was less knowledgeable about the biological side of what they are doing.  Listen to the podcast to learn more.

This image has an empty alt attribute; its file name is Subscribe_on_iTunes_Badge_US-UK_110x40_0824.png
This image has an empty alt attribute; its file name is play_prism_hlock_2x-300x64.png

Mike Harsch, a computer

Mike is a long time computer enthusiast with particular interests in distributed systems and GPU computing.  He lives in CO and has a basement full of (GPUs &) computers.

Mike and I have co-coached a local high school, FTC robotics team for the last 4 years. And Mike has been involved with FTC robotics for much longer than that.