OCZ just announced that their new Octane 1TB SSD can perform reads and writes under a 100 μsec. (specifically “Read: 0.06ms; Write: 0.09ms”). Such fast access times boggle the imagination and even with SATA 3 seems almost unobtainable.
Speed matters, especially with SSDs
Why would any device try to reach a 90μsec write access time and a 60μsec read access time? With the advent of high-speed stock trading where even distance matters, a lot, latency is becoming a hot topic once again.
How SSD access time translates into storage system latency or response time is another matter. But one can see some seriously fast storage system latencies (or LRT) in TMS’s latest RAMSANSPC-1 benchmark results, under ~90μsec measured at the host level! (See my May dispatch on latest SPC performance). On the other hand, how they measure 90μsec host level latencies without a logic analyzer attached is beyond me.
How are they doing this?
How can a OCZ’s SATA SSD deliver such fast access times? NAND is too slow to provide this access time for writes so there must be some magic. For instance, NAND writes (programing) can take on the order of a couple of 100μsecs and that doesn’t include the erase time of more like 1/2msec. So the only way to support a 90μsec write or storage system access time with NAND chips is by buffering write data into an “ondevice” DRAM cache.
NAND reads are quite a bit faster on the order of 25μsec for the first byte and 25nsec for each byte after that. As such, SSD read data could conceivably be coming directly from NAND. However you have to set aside some device latency/access time to perform IO command processing, chip addressing, channel setup, etc. Thus, it wouldn’t surprise me to see them using the DRAM cache for read data as well.
I never thought I would see sub-1msec storage system response times but that was broken a couple of years ago with IBM’s Turbo 8300. With the advent of DRAM caching for NAND SSDs and the new, purpose built all-SSD storage systems, it seems we are already in the age of sub-100μsec response times.
I fear to get much below this we may need something like the next generation SATA or SAS to come out and even faster processing/memory speeds. But from where I sit sub-10μsec response times don’t seem that far away. By then, distance will matter even more.
Since our last blog post on this subject there have been 6 new entries in LRT Top 10 (#3-6 &, 9-10). As can be seen here which combines SPC-1 and 1/E results, response times vary considerably. 7 of these top 10 LRT results come from subsystems which either have all SSDs (#1-4, 7 & 9) or have a large NAND cache (#5). The newest members on this chart were the NetApp 3270A and the Xiotech Emprise 5000-300GB disk drives which were published recently.
The NetApp FAS3270A, a mid-range subsystem with 1TB of NAND cache (512MB in each controller) seemed to do pretty well here with all SSD systems doing better than it and a pair of all SSD systems doing worse than it. Coming in under 1msec LRT is no small feat. We are certain the NAND cache helped NetApp achieve their superior responsiveness.
What the Xiotech Emprise 5000-300GB storage subsystem is doing here is another question. They have always done well on an IOPs/drive basis (see SPC-1&-1/E results IOPs/Drive – chart of the month) but being top ten in LRT had not been their forte, previously. How one coaxes a 1.47 msec LRT out of a 20 drive system that costs only ~$41K, 12X lower than the median price(~$509K) of the other subsystems here is a mystery. Of course, they were using RAID 1 but so were half of the subsystems on this chart.
The full performance dispatch will be up on our website in a couple of weeks but if you are interested in seeing it sooner just sign up for our free monthly newsletter (see upper right) or subscribe by email and we will send you the current issue with download instructions for this and other reports.
As always, we welcome any constructive suggestions on how to improve our storage performance analysis.
Lost in much of the discussions on storage system performance is the need for both throughput and response time measurements.
By IO throughput I generally mean data transfer speed in megabytes per second (MB/s or MBPS), however another definition of throughput is IO operations per second (IO/s or IOPS). I prefer the MB/s designation for storage system throughput because it’s very complementary with respect to response time whereas IO/s can often be confounded with response time. Nevertheless, both metrics qualify as storage system throughput.
By IO response time I mean the time it takes a storage system to perform an IO operation from start to finish, usually measured in milleseconds although lately some subsystems have dropped below the 1msec. threshold. (See my last year’s post on SPC LRT results for information on some top response time results).
Benchmark measurements of response time and throughput
Both Standard Performance Evaluation Corporation’s SPECsfs2008 and Storage Performance Council’s SPC-1 provide response time measurements although they measure substantially different quantities. The problem with SPECsfs2008’s measurement of ORT (overall response time) is that it’s calculated as a mean across the whole benchmark run rather than a strict measurement of least response time at low file request rates. I believe any response time metric should measure the minimum response time achievable from a storage system although I can understand SPECsfs2008’s point of view.
On the other hand SPC-1 measurement of LRT (least response time) is just what I would like to see in a response time measurement. SPC-1 provides the time it takes to complete an IO operation at very low request rates.
In regards to throughput, once again SPECsfs2008’s measurement of throughput leaves something to be desired as it’s strictly a measurement of NFS or CIFS operations per second. Of course this includes a number (>40%) of non-data transfer requests as well as data transfers, so confounds any measurement of how much data can be transferred per second. But, from their perspective a file system needs to do more than just read and write data which is why they mix these other requests in with their measurement of NAS throughput.
Storage Performance Council’s SPC-1 reports throughput results as IOPS and provide no direct measure of MB/s unless one looks to their SPC-2 benchmark results. SPC-2 reports on a direct measure of MBPS which is an average of three different data intensive workloads including large file access, video-on-demand and a large database query workload.
Why response time and throughput matter
Historically, we used to say that OLTP (online transaction processing) activity performance was entirely dependent on response time – the better storage system response time, the better your OLTP systems performed. Nowadays it’s a bit more complex, as some of todays database queries can depend as much on sequential database transfers (or throughput) as on individual IO response time. Nonetheless, I feel that there is still a large component of response time critical workloads out there that perform much better with shorter response times.
On the other hand, high throughput has its growing gaggle of adherents as well. When it comes to high sequential data transfer workloads such as data warehouse queries, video or audio editing/download or large file data transfers, throughput as measured by MB/s reigns supreme – higher MB/s can lead to much faster workloads.
The only question that remains is who needs higher throughput as measured by IO/s rather than MB/s. I would contend that mixed workloads which contain components of random as well as sequential IOs and typically smaller data transfers can benefit from high IO/s storage systems. The only confounding matter is that these workloads obviously benefit from better response times as well. That’s why throughput as measured by IO/s is a much more difficult number to understand than any pure MB/s numbers.
Now there is a contingent of performance gurus today that believe that IO response times no longer matter. In fact if one looks at SPC-1 results, it takes some effort to find its LRT measurement. It’s not included in the summary report.
Also, in the post mentioned above there appears to be a definite bifurcation of storage subsystems with respect to response time, i.e., some subsystems are focused on response time while others are not. I would have liked to see some more of the top enterprise storage subsystems represented in the top LRT subsystems but alas, they are missing.
Call me old fashioned but I feel that response time represents a very important and orthogonal performance measure with respect to throughput of any storage subsystem and as such, should be much more widely disseminated than it is today.
For example, there is a substantive difference a fighter jet’s or race car’s top speed vs. their maneuverability. I would compare top speed to storage throughput and its maneuverability to IO response time. Perhaps this doesn’t matter as much for a jet liner or family car but it can matter a lot in the right domain.
Now do you want your storage subsystem to be a jet fighter or a jet liner – you decide.
The above chart shows the top 12 LRT(tm) (least response time) results for Storage Performance Council’s SPC-1 benchmark. The vertical axis is the LRT in milliseconds (msec.) for the top benchmark runs. As can be seen the two subsystems from TMS (RamSan400 and RamSan320) dominate this category with LRTs significantly less than 2.5msec. IBM DS8300 and it’s turbo cousin come in next followed by a slew of others.
The 1msec. barrier
Aside from the blistering LRT from the TMS systems one significant item in the chart above is that the two IBM DS8300 systems crack the <1msec. barrier using rotating media. Didn’t think I would ever see the day, of course this happened 3 or more years ago. Still it’s kind of interesting that there haven’t been more vendors with subsystems that can achieve this.
LRT is probably most useful for high cache hit workloads. For these workloads the data comes directly out of cache and the only thing between a server and it’s data is subsystem IO overhead, measured here as LRT.
Encryption cheap and fast?
The other interesting tidbit from the chart is that the DS5300 with full drive encryption (FDE), (drives which I believe come from Seagate) cracks into the top 12 at 1.8msec exactly equivalent with the IBM DS5300 without FDE. Now FDE from Seagate is a hardware drive encryption capability and might not be measurable at a subsystem level. Nonetheless, it shows that having data security need not reduce performance.
What is not shown in the above chart is that adding FDE to the base subsystem only cost an additional US$10K (base DS5300 listed at US$722K and FDE version at US$732K). Seems like a small price to pay for data security which in this case is simply turn it on, generate keys, and forget it.
FDE is a hard drive feature where the drive itself encrypts all data written and decrypts all data read to from a drive and requires a subsystem supplied drive key at power on/reset. In this way the data is never in plaintext on the drive itself. If the drive were taken out of the subsystem and attached to a drive tester all one would see is ciphertext. Similar capabilities have been available in enterprise and SMB tape drives is the past but to my knowledge the IBM DS5300 FDE is the first disk storage benchmark with drive encryption.
I believe the key manager for the DS5300 FDE is integrated within the subsystem. Most shops would need a separate, standalone key manager for more extensive data security. I believe the DS5300 can also interface with an standalone (IBM) key manager. In any event, it’s still an easy and simple step towards increased data security for a data center.