IO performance – Silverton Consulting

OCZ just announced that their new Octane 1TB SSD can perform reads and writes under a 100 μsec. (specifically “Read: 0.06ms; Write: 0.09ms”). Such fast access times boggle the imagination and even with SATA 3 seems almost unobtainable.

Speed matters, especially with SSDs

Why would any device try to reach a 90μsec write access time and a 60μsec read access time? With the advent of high-speed stock trading where even distance matters, a lot, latency is becoming a hot topic once again.

Although from my perspective it never really went away (see my Storage throughput vs. IO response time and why it matters post). So access times measured in 10’s of μsec. is just fine by me.

How SSD access time translates into storage system latency or response time is another matter. But one can see some seriously fast storage system latencies (or LRT) in TMS’s latest RAMSAN SPC-1 benchmark results, under ~90μsec measured at the host level! (See my May dispatch on latest SPC performance). On the other hand, how they measure 90μsec host level latencies without a logic analyzer attached is beyond me.

How are they doing this?

How can a OCZ’s SATA SSD deliver such fast access times? NAND is too slow to provide this access time for writes so there must be some magic. For instance, NAND writes (programing) can take on the order of a couple of 100μsecs and that doesn’t include the erase time of more like 1/2msec. So the only way to support a 90μsec write or storage system access time with NAND chips is by buffering write data into an “ondevice” DRAM cache.

NAND reads are quite a bit faster on the order of 25μsec for the first byte and 25nsec for each byte after that. As such, SSD read data could conceivably be coming directly from NAND. However you have to set aside some device latency/access time to perform IO command processing, chip addressing, channel setup, etc. Thus, it wouldn’t surprise me to see them using the DRAM cache for read data as well.

—–

I never thought I would see sub-1msec storage system response times but that was broken a couple of years ago with IBM’s Turbo 8300. With the advent of DRAM caching for NAND SSDs and the new, purpose built all-SSD storage systems, it seems we are already in the age of sub-100μsec response times.

I fear to get much below this we may need something like the next generation SATA or SAS to come out and even faster processing/memory speeds. But from where I sit sub-10μsec response times don’t seem that far away. By then, distance will matter even more.

Comments?

SSDs! :) by gimpbully (cc) (from flickr) — SSDs! 🙂 by gimpbully (cc) (from flickr)

SSD and/or SSS (solid state storage) performance is a mystery to most end-users. The technology is inherently asymmetrical, i.e., it reads much faster than it writes. I have written on some of these topics before (STEC’s new MLC drive, Toshiba’s MLC flash, Tape V Disk V SSD V RAM) but the issue is much more complex when you put these devices behind storage subsystems or in client servers.

Some items that need to be considered when measuring SSD/SSS performance include:

Is this a new or used SSD?
What R:W ratio will we use?
What blocksize should be used?
Do we use sequential or random I/O?
What block inter-reference interval should be used?

This list is necessarily incomplete but it’s representative of the sort of things that should be considered to measure SSD/SSS performance.

New device or pre-conditioned

Hard drives show little performance difference whether new or pre-owned, defect skips notwithstanding. In contrast, SSDs/SSSs can perform very differently when they are new versus when they have been used for a short period depending on their internal architecture. A new SSD can write without erasure throughout it’s entire memory address space but sooner or later wear leveling must kick in to equalize the use of the device’s NAND memory blocks. Wear leveling causes both reads and rewrites of data during it’s processing. Such activity takes bandwidth and controller processing away from normal IO. If you have a new device it may take days or weeks of activity (depending on how fast you write) to attain the device’s steady state where each write causes some sort of wear leveling activity.

R:W Ratio

Historically, hard drives have had slightly slower write seeks than reads, due to the need to be more accurately positioned to write data than to read it. As such, it might take .5msec longer to write than to read 4K bytes. But for SSDs the problem is much more acute, e.g. read times can be in microseconds while write times can almost be in milliseconds for some SSDs/SSSs. This is due to the nature of NAND flash, having to erase a block before it can be programmed (written) and the programming process taking a lot’s longer than a read.

So the question for measuring SSD performance is what read to write (R:W) ratio to use. Historically a R:W of 2:1 was used to simulate enterprise environments but most devices are starting to see more like 1:1 for enterprise applications due to the caching and buffering provided by controllers and host memory. I can’t speak as well for desktop environments but it wouldn’t surprise me to see 2:1 used to simulate desktop workloads as well.

SSDs operate a lot faster if their workload is 1000:1 than for 1:1 workloads. Most SSD data sheets tout a significant read I/O rate but only for 100% read workloads. This is like a subsystem vendor quoting a 100% read cache hit rate (which some do) but is unusual in the real world of storage.

Blocksize to use

Hard drives are not insensitive to blocksizes, as blocks can potentially span tracks which will require track-to-track seeks to be read or written. However, SSDs can also have some adverse interaction with varying blocksizes. This is dependent on the internal SSD architecture and is due to over optimizing write performance.

With an SSD, you erase a block of NAND and write a page or sector of NAND at a time. As writes takes much longer than reads, many SSD vendors add parallelism to improve write throughput. Parallelism writes or programs multiple sectors at the same time. Thus, if your blocksize is an integral multiple of the multi-sector size written performance is great, if not, performance can suffer.

In all honesty, similar issues exist with hard drive sector sizes. If your blocksize is an integral multiple of the drive sector size then performance is great, if not too bad. In contrast to SSDs, drive sector size is often configurable at the device level.

Sequential vs. random IO

Hard drives perform sequential IO much better than random IO. For SSDs this is not much of a problem, as once wear leveling kicks in, it’s all random to the NAND flash. So when comparing hard drives to SSDs the level of sequentiality is a critical parameter to control.

Cache hit rate

The block inter-reference interval is simply measures how often the same block is re-referenced. This is important for caching devices and systems because it ultimately determines the cache hit rate (reading data directly from cache instead of the device storage). Hard drives have onboard cache of 8 to 32MB today. SSD drives also have a DRAM cache for data buffering and other uses. SSDs typically publicize their cache size so in order to insure 0 cache hits one needs an block inter-reference interval close to the device’s capacity. Not a problem today with 146GB devices but as they move to 300GB and larger it becomes more of a problem to completely characterize device performance.

The future

So how do we get a handle on SSD performance? SNIA and others are working on a specification on how to measure SSD performance that will one day become a standard. When the standard is available we will have benchmarks and service groups that can run these benchmarks to validate SSD vendor performance claims. Until then – caveat emptor.

Of course most end users would claim that device performance is not as important as (sub)system performance which is another matter entirely…

Tag: IO performance

OCZ’s new Octane SATA SSD pushes latency limits below 100μsec