For some time now I have been experimenting with different approaches to normalize IO activity (in the chart above its NFS throughput operations per second) for systems that use SSDs or Flash Cache. My previous attempt (see prior SPECsfs2008 chart of the month post) normalized base on GB of NAND capacity used in a submission.
I found the previous chart to be somewhat lacking so this quarter I decided to use SSD device and/or Flash Cache card count instead. This approach is shown in the above chart. Funny thing, although the rankings were exactly the same between the two charts one can see significant changes in the magnitudes achieved, especially in the relative values, between the top 2 rankings.
For example, in the prior chart Avere FXT 3500 result still came in at number one but whereas here they achieved ~390K NFS ops/sec/SSD on the prior chart they obtained ~2000 NFS ops/sec/NAND-GB. But more interesting was the number two result. Here the NetApp FAS6240 with 1TB Flash Cache Card achieved ~190K NFS ops/sec/FC-card but on the prior chart they only hit ~185 NFS ops/sec/NAND-GB.
That means on this version of the normalization the Avere is about 2X more effective than the NetApp FAS6240 with 1TB FlashCache card but in the prior chart they were 10X more effective in ops/sec/NAND-GB. I feel this is getting closer to the truth but not quite there yet.
We still have the problem that all the SPECsfs2008 submissions that use SSDs or FlashCache also have disk drives as well as (sometimes significant) DRAM cache in them. So doing a pure SSD normalization may never suffice for these systems.
On the other hand, I have taken a shot at normalizing SPECsfs2008 performance for SSDs-NAND, disk devices and DRAM caching as one dimension in a ChampionsChart™ I use for a NAS Buying Guide, for sale on my website. If your interested in seeing it, drop me a line, or better yet purchase the guide.
The complete SPECsfs2008 performance report went out in SCI’s June newsletter. But a copy of the report will be posted on our dispatches page sometime next month (if all goes well). However, you can get the SPECsfs2008 performance analysis now and subscribe to future free newsletters by just using the signup form above right.
For a more extensive discussion of current NAS or file system storage performance covering SPECsfs2008 (Top 20) results and our new ChampionsChart™ for NFS and CIFS storage systems, please see SCI’s NAS Buying Guide available from our website.
As always, we welcome any suggestions or comments on how to improve our analysis of SPECsfs2008 results or any of our other storage performance analyses.
For some time now I have been using OPS/drive to measure storage system disk drive efficiency but have so far failed to come up with anything similar for flash or SSD use. The problem with flash in storage is that it can be used as a cache or as a storage device. Even when used as a storage device under automated storage tiering, SSD advantages can be difficult to pin down.
In my March newsletter as a first attempt to measure storage system flash efficiency I supplied a new chart shown above, which plots the top 10 NFS throughput ops/second/GB of NAND used in the SPECsfs2008 results.
What’s with Avere?
Something different has occurred with the (#1) Avere FXT 3500 44-node system in the chart. The 44-node Avere system only used ~800GB of flash as a ZIL (ZFS intent log from the SPECsfs report). However, the 44-node system also had ~7TB of DRAM across their 44-node system, most of which was used for file IO caching. If we incorporated storage system memory size with flash GB in the above chart it would have dropped the Avere numbers by a factor of 9 while only dropping the others by a factor of ~2X which would still give the Avere a significant advantage but not quite so stunning. Also, the Avere system frontends other NAS systems, (this one running ZFS) so it’s not quite the same as being a direct NAS storage system like the others on this chart.
The remainder of the chart (#2-10) belongs to NetApp and their FlashCache (or PAM) cards. Even Oracles Sun ZFS Storage 7320 appliance did not come close to either the Avere FXT 3500 system or the NetApp storage on this chart. But there were at least 10 other SPECsfs2008 NFS results using some form of flash but were not fast enough to rank on this chart.
Other measures of flash effectiveness
This metric still doesn’t quite capture flash efficiency. I was discussing flash performance with another startup the other day and they suggested that SSD drive count might be a better alternative. With such a measure, it would take into consideration that each SSD has a only a certain performance level it can sustain, not unlike disk drives.
In that case Avere’s 44-node system had 4 drives, and each NetApp system had two FlashCache cards, representing 2-SSDs per NetApp node. I try that next time to see if it’s a better fit.
The complete SPECsfs2008 performance report went out in SCI’s March newsletter. But a copy of the report will be posted on our dispatches page sometime next month (if all goes well). However, you can get the SPECsfs performance analysis now and subscribe to future free newsletters by just sending us an email or using the signup form above right.
For a more extensive discussion of current NAS or file system storage performance covering SPECsfs2008 (Top 20) results and our new ChampionChart™ for NFS and CIFS storage systems, please see SCI’s NAS Buying Guide available from our website.
As always, we welcome any suggestions or comments on how to improve our analysis of SPECsfs2008 results or any of our other storage performance analyses.
[We are still catching up on our charts for the past quarter but this one brings us up to date through last month]
There’s just something about a million SPECsfs2008(r) NFS throughput operations per second that kind of excites me (weird, I know). Yes it takes over 44-nodes of Avere FXT 3500 with over 6TB of DRAM cache, 140-nodes of EMC Isilon S200 with almost 7TB of DRAM cache and 25TB of SSDs or at least 16-nodes of NetApp FAS6240 in Data ONTAP 8.1 cluster mode with 8TB of FlashCache to get to that level.
Nevertheless, a million NFS throughput operations is something worth celebrating. It’s not often one achieves a 2X improvement in performance over a previous record. Something significant has changed here.
The age of scale-out
We have reached a point where scaling systems out can provide linear performance improvements, at least up to a point. For example, the EMC Isilon and NetApp FAS6240 had a close to linear speed up in performance as they added nodes indicating (to me at least) there may be more there if they just throw more storage nodes at the problem. Although maybe they saw some drop off and didn’t wish to show the world or potentially the costs became prohibitive and they had to stop someplace. On the other hand, Avere only benchmarked their 44-node system with their current hardware (FXT 3500), they must have figured winning the crown was enough.
However, I would like to point out that throwing just any hardware at these systems doesn’t necessary increase performance. Previously (see my CIFS vs NFS corrected post), we had shown the linear regression for NFS throughput against spindle count and although the regression coefficient was good (~R**2 of 0.82), it wasn’t perfect. And of course we eliminated any SSDs from that prior analysis. (Probably should consider eliminating any system with more than a TB of DRAM as well – but this was before the 44-node Avere result was out).
Speaking of disk drives, the FAS6240 system nodes had 72-450GB 15Krpm disks, the Isilon nodes had 24-300GB 10Krpm disks and each Avere node had 15-600GB 7.2Krpm SAS disks. However the Avere system also had a 4-Solaris ZFS file storage systems behind it each of which had another 22-3TB (7.2Krpm, I think) disks. Given all that, the 16-node NetApp system, 140-node Isilon and the 44-node Avere systems had a total of 1152, 3360 and 748 disk drives respectively. Of course, this doesn’t count the system disks for the Isilon and Avere systems nor any of the SSDs or FlashCache in the various configurations.
I would say with this round of SPECsfs2008 benchmarks scale-out NAS systems have come out. It’s too bad that both NetApp and Avere didn’t release comparable CIFS benchmark results which would have helped in my perennial discussion on CIFS vs. NFS.
But there’s always next time.
The full SPECsfs2008 performance report went out to our newsletter subscriber’s last December. A copy of the full report will be up on the dispatches page of our site sometime later this month (if all goes well). However, you can see our full SPECsfs2008 performance analysis now and subscribe to our free monthly newsletter to receive future reports directly by just sending us an email or using the signup form above right.
For a more extensive discussion of file and NAS storage performance covering top 30 SPECsfs2008 results and NAS storage system features and functionality, please consider purchasing our NAS Buying Guide available from SCI’s website.
As always, we welcome any suggestions on how to improve our analysis of SPECsfs2008 results or any of our other storage system performance discussions.
We made a mistake in our last post discussing CIFS vs. NFS results using SPECsfs2008 benchmarks by including some storage systems that had SSDs in this analysis. All of our other per spindle/disk drive analyses exclude SSDs and NAND cache because they skew per drive results so much. We have corrected this in the above chart which includes all the SPECsfs2008 results, up to the end of last month.
However, even with the corrections the results stand pretty much the way they were. CIFS is doing more throughput per disk drive spindle than NFS for all benchmark results not using SSDs or Flash Cache.
Dropping SSD results changed the linear regression equation. Specificall, the R**2 for CIFS and NFS dropped from 0.99 to 0.98 and from 0.92 to 0.82 and the B coefficient dropped from 463 to 405 and from 296 to 258 respectively.
I would be remiss if I didn’t discuss a few caveats with this analysis.
Now there are even less results in both CIFS and NFS groups, down to 15 for CIFS and 38 for NFS. For any sort of correlation comparison, more results would have better statistical significance.
In the NFS data, we include some NAS systems which have lots of DRAM cache (almost ~0.5TB). We should probably exclude these as well, which might drop the NFS line down some more (at least lower the B value).
There are not a lot of enterprise level CIFS systems in current SPECsfs resuslts, with or without SSD or NAND caching. Most CIFS benchmarks are from midrange or lower filers. Unclear why these would do much better on a per spindle basis than a wider sample of NFS systems, but they obviously do.
All that aside, it seems crystal clear here, that CIFS provides more throughput per spindle.
In contrast, we have shown in the past posts how for the limited number of systems that submitted benchmarks with both CIFS and NFS typically show roughly equivalent throughput results for CIFS and NFS. (See my other previous post on this aspect of the CIFS vs. NFS discussion).
Also, in our last post we discussed some of the criticism leveled against this analysis and provided our view to refute these issues. Mostly their concerns are due to the major differences between CIFS state-full protocol and NFS stateless protocol.
But from my perspective it’s all about the data. How quickly can I read a file, how fast can I create a file. Given similar storage systems, with similar SW, cache and hard disk drives, it’s now clear to me that CIFS provides faster access to data than NFS does, at least on a per spindle basis.
Nevertheless, more data may invalidate these results, so stay tuned.
Why this is should probably be subject for another post but it may have a lot to do with the fact that it is stateless….
It appears that the system uses 200K disk drives to support the 120PB of storage. The disk drives are packed in a new wider rack and are water cooled. According to the news report the new wider drive trays hold more drives than current drive trays available on the market.
For instance, HP has a hot pluggable, 100 SFF (small form factor 2.5″) disk enclosure that sits in 3U of standard rack space. 200K SFF disks would take up about 154 full racks, not counting the interconnect switching that would be required. Unclear whether water cooling would increase the density much but I suppose a wider tray with special cooling might get you more drives per floor tile.
There was no mention of interconnect, but today’s drives use either SAS or SATA. SAS interconnects for 200K drives would require many separate SAS busses. With an SAS expander addressing 255 drives or other expanders, one would need at least 4 SAS busses but this would have ~64K drives per bus and would not perform well. Something more like 64-128 drives per bus would have much better performer and each drive would need dual pathing, and if we use 100 drives per SAS string, that’s 2000 SAS drive strings or at least 4000 SAS busses (dual port access to the drives).
Shared storage cluster – where GPFS front end nodes access shared storage across the backend. This is generally SAN storage system(s). But the requirements for high density, it doesn’t seem likely that the 120PB storage system uses SAN storage in the backend.
Networked based cluster – here the GPFS front end nodes talk over a LAN to a cluster of NSD (network storage director?) servers which can have access to all or some of the storage. My guess is this is what will be used in the 120PB storage system
Shared Network based clusters – this looks just like a bunch of NSD servers but provides access across multiple NSD clusters.
Given the above, with ~100 drives per NSD server means another 1U extra per 100 drives or (given HP drive density) 4U per 100 drives for 1000 drives and 10 IO servers per 40U rack, (not counting switching). At this density it takes ~200 racks for 120PB of raw storage and NSD nodes or 2000 NSD nodes.
Unclear how many GPFS front end nodes would be needed on top of this but even if it were 1 GPFS frontend node for every 5 NSD nodes, we are talking another 400 GPFS frontend nodes and at 1U per server, another 10 racks or so (not counting switching).
If my calculations are correct we are talking over 210 racks with switching thrown in to support the storage. According to IBM’s discussion on the Storage challenges for petascale systems, it probably provides ~6TB/sec of data transfer which should be easy with 200K disks but may require even more SAS busses (maybe ~10K vs. the 2K discussed above).
IBM GPFS is used behind the scenes in IBM’s commercial SONAS storage system but has been around as a cluster file system designed for HPC environments for over 15 years or more now.
Given this many disk drives something needs to be done about protecting against drive failure. IBM has been talking about declustered RAID algorithms for their next generation HPC storage system which spreads the parity across more disks and as such, speeds up rebuild time at the cost of reducing effective capacity. There was no mention of effective capacity in the report but this would be a reasonable tradeoff. A 200K drive storage system should have a drive failure every 10 hours, on average (assuming a 2 million hour MTBF). Let’s hope they get drive rebuild time down much below that.
The system is expected to hold around a trillion files. Not sure but even at 1024 bytes of metadata per file, this number of files would chew up ~1PB of metadata storage space.
GPFS provides ILM (information life cycle management, or data placement based on information attributes) using automated policies and supports external storage pools outside the GPFS cluster storage. ILM within the GPFS cluster supports file placement across different tiers of storage.
All the discussion up to now revolved around homogeneous backend storage but it’s quite possible that multiple storage tiers could also be used. For example, a high density but slower storage tier could be combined with a low density but faster storage tier to provide a more cost effective storage system. Although, it’s unclear whether the application (real world modeling) could readily utilize this sort of storage architecture nor whether they would care about system cost.
Nonetheless, presumably an external storage pool would be a useful adjunct to any 120PB storage system for HPC applications.
Can it be done?
Let’s see, 400 GPFS nodes, 2000 NSD nodes, and 200K drives. Seems like the hardware would be readily doable (not sure why they needed watercooling but hopefully they obtained better drive density that way).
It would seem that a 20X multiplier times a current Isilon cluster or even a 10X multiple of a currently supported SONAS system would take some software effort to work together, but seems entirely within reason.
Of course, IBM Almaden is working on project to support Hadoop over GPFS which might not be optimum for real world modeling but would nonetheless support the node count being talked about here.
I wish there was some real technical information on the project out on the web but I could not find any. Much of this is informed conjecture based on current GPFS system and storage hardware capabilities. But hopefully, I haven’t traveled to far astray.
Well there has been more activity for both CIFS and NFS protocols since our last discussion and it showed, once again that CIFS was faster than NFS but rather than going down that same path again, I decided to try something different.
As a result, we published the above chart which places all NFS and CIFS disk only submissions in stark contrast.
This chart was originally an attempt to refute many analysts contention that storage benchmarks are more of a contest as to who has thrown more disks at the problem rather than some objective truth about the performance of one product or another.
But a curious thought occurred to me as I was looking at these charts for CIFS and NFS last month. What if I plotted both results on the same chart? Wouldn’t such a chart provide some additional rationale to our discussion on CIFS vs. NFS.
Sure enough, it did.
From my perspective this chart proves that CIFS is faster than NFS. But, maybe a couple of points might clarify my analysis:
I have tried to eliminate any use of SSDs or NAND caching from this chart as they just confound the issue. Also, all disk-based, NFS and CIFS benchmarks are represented on the above charts, not just those that have submitted both CIFS and NFS results on the same hardware.
There is an industry wide view that CIFS and NFS are impossible to compare because one is state-full (CIFS) and the other state-less (NFS). I happen to think this is wrong. Most users just want to know which is faster and/or better. It would be easier to do analyze this if SPECsfs2008 reported data transfer rates rather than operations/second rates but they don’t.
As such, one potential problem with comparing the two on the above chart is that the percentage of “real” data transfers represented by “operations per second” may be different. Ok, this would need to be normalized if they were a large difference between CIFS and NFS. But when examining the SPECsfs2008 user’s guide spec., one sees that NFS read and write data ops is 28.0% of all operations and CIFS read and write data ops is 29.1% of all operations. As they aren’t that different, the above chart should correlate well to the number of data operations done by each separate protocol. If anything, normalization would show an even larger advantage for CIFS, not less.
Another potential concern one needs to consider is the difference in the average data transfer size between the protocols. The user guide doesn’t discriminate between access transfer rates for NFS or CIFS, so we assume it’s the same for the two protocols. Given that assumption, then the above chart provides a reasonable correlation to the protocols relative data transfer rates.
The one real concern on this chart is the limited amount of CIFS disk benchmarks. At this time there are about 20 CIFS disk benchmarks vs. 40 NFS disk benchmarks. So the data is pretty slim for CIFS, nonetheless, 20 is almost enough to make this statistically significant. So with more data the advantage may change slightly but I don’t think it will ever shift back to NFS.
Ok, now that I have all the provisos dealt with, what’s the chart really telling me.
One has to look at the linear regression equations to understand this but, CIFS does ~463.0 operations/second per disk drive and NFS does ~296.5 operations/second per disk drive. What this says is, for all things being equal, i.e., the same hardware and disk drive count, CIFS does about 1.6X (463.0/296.5) more operations per second than NFS and correspondingly, CIFS provides ~1.6X more data per second than NFS does.
The full SPECsfs 2008 report went out to our newsletter subscribers last month. The above chart has been modified somewhat from a plot in our published report, but the data is the same (forced the linear equations to have an intercept of 0 to eliminate the constant, displayed the R**2 for CIFS, and fixed the vertical axis title).
The above chart comes from our last month’s newsletter on the lastest SPECsfs2008 file system performance benchmark results and depicts a scatter plot of system NFS throughput operations per second versus the number of disk drives in the system being tested. We eliminate from this chart any system that makes use of Flash Cache/SSDS or any other performance use of NAND (See below on why SONAS was still included).
One constant complaint of benchmarks is that system vendors can just throw hardware at the problem to attain better results. The scatter plot above is one attempt to get to the truth in that complaint.
The regression equation shows that NFS throughput operations per second = 193.68*(number of disk drives) + 23834. The regression coefficient (R**2) is 0.87 which is pretty good but not exactly perfect. So given these results, one would have to conclude there is some truth in the complaint but it doesn’t tell the whole story. (Regardless of how much it pains me to admit it).
A couple of other interesting things about the chart:
IBM released a new SONAS benchmark with 1975 disks, with 16 interface and 10 storage nodes to attain its 403K NFS ops/second. Now the SONAS had 512GB of NV Flash, which I assume is being used for redundancy purposes on writes and not as a speedup for read activity. Also the SONAS system complex had over 2.4TB of cache (includes the NV Flash). So there was a lot of cache to throw at the problem.
HP BL860c results were from a system with 1480 drives, 4 nodes (blades) and ~800GB of cache to attain its 333KNFS ops/second.
(aside) Probably need to do a chart like this with amount of cache as the x variable (/aside)
In the same report we talked about the new #1 performing EMC VNX Gateway that used 75TB of SAS-SSDs and 4 VNX5700’s as its backend. It was able to reach 497K NFS ops/sec. It doesn’t show up on this chart because of its extensive use of SSDs. But according to the equation above one would need to use ~2500 disk drives to attain similar performance without SSDS and I believe, a whole lot of cache.
The full performance dispatch will be up on our website after the middle of next month (I promise) but if one is interested in seeing it sooner sign up for our free monthly newsletter (see subscription request, above right) or subscribe by email and we will send the current issue along with download instructions for this and other reports. If you need an even more in-depth analysis of NAS system performance please consider purchasing SCI’s NAS Buying Guide also available from our website.
As always, we welcome any constructive suggestions on how to improve any of our storage performance analysis.
Jose Barreto blogged about a recent report Microsoft did on File Server Capacity Tool (FSCT) results (blog here, report here). As you may know FSCT is a free tool released in September of 2009, available from Microsoft that verifies a SMB (CIFS) and/or SMB2 storage server configuration.
The FSCT can be used by anyone to verify that a SMB/SMB2 file server configuration can adequately support a particular number of users, doing typical Microsoft Office/Window’s Explorer work with home folders.
Jetstress for SMB file systems?
FSCT reminds me a little of Microsoft’s Jetstress tool used in the Exchange Solution Review Program (ESRP) which I have discussed extensively in prior blog posts (search my blog) and other reports (search my website). Essentially, FSCT has a simulated “home folder” workload which can be dialed up or down by the number of users selected. As such, it can be used to validate any NAS system which supports SMB/SMB2 or CIFS protocol.
Both Jetstress and FSCT are capacity verification tools. However, I look at all such tools as a way of measuring system performance for a solution environment and FSCT is no exception.
Microsoft FSCT results
In Jose’s post on the report he discusses performance for five different storage server configurations running anywhere from 4500 to 23,000 active home directory users, employing white box servers running Windows (Storage) Server 2008 and 2008 R2 with various server hardware and SAS disk configurations.
Network throughput ranged from 114 to 650 MB/sec. Certainly respectable numbers and somewhat orthogonal to the NFS and CIFS throughput operations/second reported by SPECsfs2008. Unclear if FSCT reports activity in an operations/second.
Microsoft ‘s FSCT reports did not specifically state what the throughput was other than at the scenario level. I assume Network throughput that Jose reported was extracted concurrently with the test run from something akin to Perfmon. FSCT seems to only report performance or throughput as the number of home folder scenarios sustainable per second and the number of users. Perhaps there is an easy way to convert user scenarios to network throughput?
While the results for the file server runs looks interesting, I always want more. For whatever reason, I have lately become enamored with ESRPs log playback results (see my latest ESRP blog post) and it’s not clear whether FSCT reports anything similar to this. Something like file server simulated backup performance would suffice from my perspective.
Despite that, another performance tool is always of interest and I am sure my readers will want to take a look as well. The current FSCT tester can be downloaded here.
Not sure whether Microsoft will be posting vendor results for FSCT similar to what they do for Jetstress via ESRP but that would be a great next step. Getting the vendors onboard is another problem entirely. SPECsfs2008 took almost a year to get the first 12 (NFS) submissions and today, almost 9 months later there are still only ~40 NFS and ~20 CIFS submissions.