OCZ just announced that their new Octane 1TB SSD can perform reads and writes under a 100 μsec. (specifically “Read: 0.06ms; Write: 0.09ms”). Such fast access times boggle the imagination and even with SATA 3 seems almost unobtainable.
Speed matters, especially with SSDs
Why would any device try to reach a 90μsec write access time and a 60μsec read access time? With the advent of high-speed stock trading where even distance matters, a lot, latency is becoming a hot topic once again.
How SSD access time translates into storage system latency or response time is another matter. But one can see some seriously fast storage system latencies (or LRT) in TMS’s latest RAMSANSPC-1 benchmark results, under ~90μsec measured at the host level! (See my May dispatch on latest SPC performance). On the other hand, how they measure 90μsec host level latencies without a logic analyzer attached is beyond me.
How are they doing this?
How can a OCZ’s SATA SSD deliver such fast access times? NAND is too slow to provide this access time for writes so there must be some magic. For instance, NAND writes (programing) can take on the order of a couple of 100μsecs and that doesn’t include the erase time of more like 1/2msec. So the only way to support a 90μsec write or storage system access time with NAND chips is by buffering write data into an “ondevice” DRAM cache.
NAND reads are quite a bit faster on the order of 25μsec for the first byte and 25nsec for each byte after that. As such, SSD read data could conceivably be coming directly from NAND. However you have to set aside some device latency/access time to perform IO command processing, chip addressing, channel setup, etc. Thus, it wouldn’t surprise me to see them using the DRAM cache for read data as well.
I never thought I would see sub-1msec storage system response times but that was broken a couple of years ago with IBM’s Turbo 8300. With the advent of DRAM caching for NAND SSDs and the new, purpose built all-SSD storage systems, it seems we are already in the age of sub-100μsec response times.
I fear to get much below this we may need something like the next generation SATA or SAS to come out and even faster processing/memory speeds. But from where I sit sub-10μsec response times don’t seem that far away. By then, distance will matter even more.
Since our last blog post on this subject there have been 6 new entries in LRT Top 10 (#3-6 &, 9-10). As can be seen here which combines SPC-1 and 1/E results, response times vary considerably. 7 of these top 10 LRT results come from subsystems which either have all SSDs (#1-4, 7 & 9) or have a large NAND cache (#5). The newest members on this chart were the NetApp 3270A and the Xiotech Emprise 5000-300GB disk drives which were published recently.
The NetApp FAS3270A, a mid-range subsystem with 1TB of NAND cache (512MB in each controller) seemed to do pretty well here with all SSD systems doing better than it and a pair of all SSD systems doing worse than it. Coming in under 1msec LRT is no small feat. We are certain the NAND cache helped NetApp achieve their superior responsiveness.
What the Xiotech Emprise 5000-300GB storage subsystem is doing here is another question. They have always done well on an IOPs/drive basis (see SPC-1&-1/E results IOPs/Drive – chart of the month) but being top ten in LRT had not been their forte, previously. How one coaxes a 1.47 msec LRT out of a 20 drive system that costs only ~$41K, 12X lower than the median price(~$509K) of the other subsystems here is a mystery. Of course, they were using RAID 1 but so were half of the subsystems on this chart.
The full performance dispatch will be up on our website in a couple of weeks but if you are interested in seeing it sooner just sign up for our free monthly newsletter (see upper right) or subscribe by email and we will send you the current issue with download instructions for this and other reports.
As always, we welcome any constructive suggestions on how to improve our storage performance analysis.
SSD and/or SSS (solid state storage) performance is a mystery to most end-users. The technology is inherently asymmetrical, i.e., it reads much faster than it writes. I have written on some of these topics before (STEC’s new MLC drive, Toshiba’s MLC flash, Tape V Disk V SSD V RAM) but the issue is much more complex when you put these devices behind storage subsystems or in client servers.
Some items that need to be considered when measuring SSD/SSS performance include:
Is this a new or used SSD?
What R:W ratio will we use?
What blocksize should be used?
Do we use sequential or random I/O?
What block inter-reference interval should be used?
This list is necessarily incomplete but it’s representative of the sort of things that should be considered to measure SSD/SSS performance.
New device or pre-conditioned
Hard drives show little performance difference whether new or pre-owned, defect skips notwithstanding. In contrast, SSDs/SSSs can perform very differently when they are new versus when they have been used for a short period depending on their internal architecture. A new SSD can write without erasure throughout it’s entire memory address space but sooner or later wear leveling must kick in to equalize the use of the device’s NAND memory blocks. Wear leveling causes both reads and rewrites of data during it’s processing. Such activity takes bandwidth and controller processing away from normal IO. If you have a new device it may take days or weeks of activity (depending on how fast you write) to attain the device’s steady state where each write causes some sort of wear leveling activity.
Historically, hard drives have had slightly slower write seeks than reads, due to the need to be more accurately positioned to write data than to read it. As such, it might take .5msec longer to write than to read 4K bytes. But for SSDs the problem is much more acute, e.g. read times can be in microseconds while write times can almost be in milliseconds for some SSDs/SSSs. This is due to the nature of NAND flash, having to erase a block before it can be programmed (written) and the programming process taking a lot’s longer than a read.
So the question for measuring SSD performance is what read to write (R:W) ratio to use. Historically a R:W of 2:1 was used to simulate enterprise environments but most devices are starting to see more like 1:1 for enterprise applications due to the caching and buffering provided by controllers and host memory. I can’t speak as well for desktop environments but it wouldn’t surprise me to see 2:1 used to simulate desktop workloads as well.
SSDs operate a lot faster if their workload is 1000:1 than for 1:1 workloads. Most SSD data sheets tout a significant read I/O rate but only for 100% read workloads. This is like a subsystem vendor quoting a 100% read cache hit rate (which some do) but is unusual in the real world of storage.
Blocksize to use
Hard drives are not insensitive to blocksizes, as blocks can potentially span tracks which will require track-to-track seeks to be read or written. However, SSDs can also have some adverse interaction with varying blocksizes. This is dependent on the internal SSD architecture and is due to over optimizing write performance.
With an SSD, you erase a block of NAND and write a page or sector of NAND at a time. As writes takes much longer than reads, many SSD vendors add parallelism to improve write throughput. Parallelism writes or programs multiple sectors at the same time. Thus, if your blocksize is an integral multiple of the multi-sector size written performance is great, if not, performance can suffer.
In all honesty, similar issues exist with hard drive sector sizes. If your blocksize is an integral multiple of the drive sector size then performance is great, if not too bad. In contrast to SSDs, drive sector size is often configurable at the device level.
Sequential vs. random IO
Hard drives perform sequential IO much better than random IO. For SSDs this is not much of a problem, as once wear leveling kicks in, it’s all random to the NAND flash. So when comparing hard drives to SSDs the level of sequentiality is a critical parameter to control.
Cache hit rate
The block inter-reference interval is simply measures how often the same block is re-referenced. This is important for caching devices and systems because it ultimately determines the cache hit rate (reading data directly from cache instead of the device storage). Hard drives have onboard cache of 8 to 32MB today. SSD drives also have a DRAM cache for data buffering and other uses. SSDs typically publicize their cache size so in order to insure 0 cache hits one needs an block inter-reference interval close to the device’s capacity. Not a problem today with 146GB devices but as they move to 300GB and larger it becomes more of a problem to completely characterize device performance.
So how do we get a handle on SSD performance? SNIA and others are working on a specification on how to measure SSD performance that will one day become a standard. When the standard is available we will have benchmarks and service groups that can run these benchmarks to validate SSD vendor performance claims. Until then – caveat emptor.
Of course most end users would claim that device performance is not as important as (sub)system performance which is another matter entirely…