Was talking with Qumulo‘s CEO Peter Godman earlier this week for another GreyBeards On Storage Podcast (not available yet). One thing he said which was hard for me to comprehend was that they were putting out a new storage software release every 2 weeks.
Their customers are updating their storage system software every 2 weeks.
In my past life as a storage systems development director, we would normally have to wait months if not quarters before customers updated their systems to the latest release. As a result, we strived to put out an update at most, once a quarter with a major release every year to 18 months or so.
To me releasing code to the field every two weeks sounds impossible or at best very risky. Then I look at my iPhone. I get updates from Twitter, Facebook, LinkedIN and others, every other week. And Software-as-a-service (SaaS) solutions often update their systems frequently, if not every other week. Should storage software be any different?
It turns out Peter and his development team at Qumulo have adopted SaaS engineering methodology, which I believe uses Agile development.
As I understand it Agile development has a couple of tenets (see Wikipedia article for more information):
- Individuals and interaction – leading to co-located teams, with heavy use of pair programming, and developer generated automated testing, rather than dispersed teams with developers and QA separate but (occasionally) equal.
- Working software – using working software as a means of validating requirements, showing off changes and modifying code rather than developing reams of documentation.
- (Continuous) Customer collaboration – using direct engagement with customers over time to understand changes (using working software) rather than one time contracts for specifications of functionality
- Responding to change – changing direction in real time using immediate customer feedback rather than waiting months or a year or more to change development direction to address customer concerns.
In addition to all the above, Agile development typically uses Scrum for product planning. An Agile Scrum (see picture above & Wikipedia article) is a weekly (maybe daily) planning meeting, held standing up and discussing what changes go into the code next.
This is all fine for application development which involves a few dozen person years of effort but storage software development typically takes multiple person centuries of development & QA effort. In my past life, our storage system regression testing typically took 24 hours or more and proper QA validation took six months or more of elapsed time with ~ 5 person years or so of effort, not to mention beta testing the system at a few, carefully selected customer sites for 6 weeks or more. How can you compress this all into a few weeks?
Software development on Agile
With Agile, you probably aren’t beta testing a new release for 6 weeks anywhere, anymore. While you may beta test a new storage system for a period of time you can’t afford the time to do this on subsequent release updates anymore.
Next, there is no QA. It’s just a developer/engineer and their partner. Together they own code change and its corresponding test suite. When one adds functionality to the system, it’s up to the team to add new tests to validate it. Test automation helps streamline the process.
Finally, there’s continuous integration to the release code in flight. Used to be a developer would package up a change, then validate it themselves (any way they wanted), then regression test it integrated with the current build, and then if it was deemed important enough, it would be incorporated into the next (daily) build of the software. If it wasn’t important, it could wait on the shelf (degenerating over time due to lack of integration) until it came up for inclusion. In contrast, I believe Agile software builds happen hourly or even more often (in real time perhaps), changes are integrated as soon as they pass automated testing, and are never put on the shelf. Larger changes may still be delayed until a critical mass is available, but if it’s properly designed even major changes can be implemented incrementally. Once in the build, automated testing insures that any new change doesn’t impact working functionality.
Due to the length of our update cycle, we often had 2 or more releases being validated at any one time. Unclear to me whether Agile allows for multiple releases in flight as it just adds to the complexity and any change may have to be tailored for each release it goes into.
Storage on Agile
Vendors are probably not doing this with hardware that’s also undergoing significant change. Trying to do both would seem suicidal.
Hardware modifications often introduce timing alterations that can expose code bugs that had never been seen before. Hardware changes also take a longer time to instantiate (build into electronics). This can be worked around by using hardware simulators but timing is often not the same as the real hardware and it can take 10X to 100X more real-time to execute simple operations. Nonetheless, new hardware typically takes weeks to months to debug and this can be especially hard if the software is changing as well.
Similar to hardware concerns, OS or host storage protocol changes (say from NFSv3 to NFSv4) would take a lot more testing/debugging to get right.
So it helps if the hardware doesn’t change, the OS doesn’t change and the host IO protocol doesn’t change when your using Agile to develop storage software.
The other thing that we ran into is that over time, regression testing just kept growing and took longer and longer to complete. We made it a point of adding regression tests to validate any data loss fix we ever had in the field. Some of these required manual intervention (such as hardware bugs that need to be manually injected). This is less of a problem with a new storage system and limited field experience, but over time fixes accumulate and from a customer perspective, tests validating them are hard to get rid of.
Hardware on Agile
Although a lot of hardware these days is implemented as ASICs, it can also be implemented via Field Programmable Gate Arrays (FPGAs). Some FPGAs can be configured at runtime (see Wikipedia article on FPGAs), that is in the field, almost on demand.
FPGA programming is done using a hardware description language, an electronic logic coding scheme. It looks very much like software development of hardware logic. Why can’t this be incrementally implemented, continuously integrated, automatically validated and released to the field every two weeks.
As discussed above, the major concern is that new hardware introducing timing changes which expose hard to find (software and hardware) bugs.
And incremental development of original hardware, seems akin to having a building’s foundation changing while your adding more stories. One needs a critical mass of hardware to get to a base level of functionality to run storage functionality. This is less of a problem when one’s adding or modifying functionality for current running hardware.
I suppose Qumulo’s use of Agile shouldn’t be much of a surprise. They’re a startup, with limited resources, and need to play catchup with a lot of functionality to implement. It’s risky from my perspective but you have to take calculated risks if your going to win the storage game.
You have to give Qumulo credit for developing their storage using Agile and being gutsy enough to take it directly to the field. Let’s hope it continues to work for them.
Photo Credits: “Scrum Framework” by Source (WP:NFCC#4). Licensed under Fair use via Wikipedia