Must be on a materials science binge these days. I read another article this week in Phys.org on “Major leap towards data storage at the molecular level” reporting on a Nature article “Molecular magnetic hysteresis at 60K“, where researchers from University of Manchester, led by Dr David Mills and Dr Nicholas Chilton from the School of Chemistry, have come up with a new material that provides molecular level magnetics at almost liquid nitrogen temperatures.
Previously, molecular magnets only operated at from 4 to 14K (degrees Kelvin) from research done over the last 25 years or so, but this new research shows similar effects operating at ~60K or close to liquid nitrogen temperatures. Nitrogen freezes at 63K and boils at ~77K, and I would guess, is liquid somewhere between those temperatures.
What new material
The new material, “hexa-tert-butyldysprosocenium complex—[Dy(Cpttt)2][B(C6F5)4], with Cpttt = {C5H2tBu3-1,2,4} and tBu = C(CH3)3“, dysprosocenium for short was designed (?) by the researchers at Manchester and was shown to exhibit magnetism at the molecular level at 60K.
The storage effect is hysteresis, which is a materials ability to remember the last (magnetic/electrical/?) field it was exposed to and the magnetic field is measured in oersteds.
The researchers claim the new material provides magnetic hysteresis at a sweep level of 22 oersteds. Not sure what “sweep level of 22 oersteds” means but I assume a molecule of the material is magnetized with a field strength of 22 oersteds and retains this magnetic field over time.
Reports of disk’s death, have been greatly exaggerated
Disk industry researchers have been investigating HAMR, ([laser] heat assisted magnetic recording, see my Disk density hits new record … post) for some time now to increase disk storage density. But to my knowledge HAMR has not come out in any generally available disk device on the market yet. HAMR was supposed to provide the next big increase in disk storage densities.
Maybe they should be looking at CAMMR, or cold assisted magnetic molecular recording (heard it here, 1st).
According to Dr Chilton using the new material at 60K in a disk device would increase capacity by 100X. Western Digital just announced a 20TB MyBook Duo disk system for desktop storage and backup. With this new material, at 100X current densities, we could have 2PB Mybook Duo storage system on your desktop.
That should keep my ever increasing video-photo-music library in fine shape and everything else backed up for a little while longer.
(Storage QoW 15-003): Will we see SMR (shingled magnetic recording) disks in GA enterprise storage systems over the next 12 months?
Are there two vendors of SMR?
Yes, both Seagate and HGST have announced and currently shipping (?) SMR drives, HGST has a 10TB drive and Seagate has an 8TB drive on the market since last summer.
One other interesting fact is that SMR will be the common format for all future disk head technologies including HAMR, MAMR, & BPMR (see presentation).
What would storage vendors have to do to support SMR drives?
Because of the nature of SMR disks, writes overlap other tracks so they must be written, at least in part, sequentially (see our original post on Sequential only disks). Another post I did reported on recent work by Garth Gibson at CMU (Shingled Magnetic Recording disks) which showed how multiple bands or zones on an SMR disk could be used some of which could be written randomly and others which could be written sequentially but all could be read randomly. With such an approach you could have a reasonable file system on an SMR device with a metadata partition (randomly writeable) and a data partition (sequentially writeable).
In order to support SMR devices, changes have been requested for the T10 SCSI & T13 ATA command protocols. Such changes would include:
SMR devices support a new write cursor for each SMR sequential band.
SMR devices support sequential writes within SMR sequential bands at the write cursor.
SMR band write cursors can be read, statused and reset to 0. SMR sequential band LBA writes only occur at the band cursor and for each LBA written, the SMR device increments the band cursor by one.
SMR devices can report their band map layout.
The presentation refers to multiple approaches to SMR support or SMR drive modes:
Restricted SMR devices – where the device will not accept any random writes, all writes occur at a band cursor, random writes are rejected by the device. But performance would be predictable.
Host Aware SMR devices – where the host using the SMR devices is aware of SMR characteristics and actively manages the device using write cursors and band maps to write the most data to the device. However, the device will accept random writes and will perform them for the host. This will result in sub-optimal and non-predictable drive performance.
Drive managed SMR devices – where the SMR devices acts like a randomly accessed disk device but maps random writes to sequential writes internally using virtualization of the drive LBA map, not unlike SSDs do today. These devices would be backward compatible to todays disk devices, but drive performance would be bad and non-predictable.
Unclear which of these drive modes are currently shipping, but I believe Restricted SMR device modes are already available and drive manufacturers would be working on Host Aware and Drive managed to help adoption.
So assuming Restricted SMR device mode availability and prototypes of T10/T13 changes are available, then there are significant but known changes for enterprise storage systems to support SMR devices.
Nevertheless, a number of hybrid storage systems already implement Log Structured File (LSF) systems on their backends, which mostly write sequentially to backend devices, so moving to a SMR restricted device modes would be easier for these systems.
Unclear how many storage systems have such a back end, but NetApp uses it for WAFL and just about every other hybrid startup has a LSF format for their backend layout. So being conservative lets say 50% of enterprise hybrid storage vendors use LSF.
The other 60% would have more of a problem implementing SMR restricted mode devices, but it’s only a matter of time beforeall will need to go that way. That is assuming they still use disks. So, we are primarily talking about hybrid storage systems.
All major storage vendors support hybrid storage and about 60% of startups support hybrid storage, so adding these to together, maybe about 75% of enterprise storage vendors have hybrid.
Using analysis on QoW 15-001, about 60% of enterprise storage vendors will probably ship new hardware versions of their systems over the next 12 months. So of the 13 likely new hardware systems over the next 12 months, 75% have hybrid solutions and 50% have LSF, or ~4.9 new hardware systems will be released over the next 12 months that are hybrid and have LSF backends already.
What are the advantages of SMR?
SMR devices will have higher storage densities and lower cost. Today disk drives are running 6-8TB and the SMR devices run 8-10TB so a 25-30% step up in storage capacity is possible with SMR devices.
New drive support has in the past been relatively easy because command sets/formats haven’t changed much over the past 7 years or so, but SMR is different and will take more effort to support. The fact that all new drives will be SMR over time gives more emphasis to get on the band wagon as soon as feasible. So, I would give a storage vendor a 80% likelihood of implementing SMR, assuming they have new systems coming out, are already hybrid and are already using LSF.
So of the ~4.9 systems that are LSF/Hybrid/being released *.8, says ~3.9 systems will introduce SMR devices over the next 12 months.
For non-LSF hybrid systems, the effort seems much harder, so I would give the likelihood of implementing SMR about a 40% chance. So of the ~8.1 systems left that will be introduced in next year, 75% are hybrid or ~6.1 systems and they have a 40% likelihood of implementing SMR so ~2.4 of these non-LSF systems will probably introduce SMR devices.
There’s one other category that we need to consider and that would be startups in stealth. These could have been designing their hybrid storage for SMR from the get go. In QoW 15-001 analysis I assumed another ~1.8 startup vendors would emerge to GA over the next 12 months. And if we assume that 0.75% of these were hybrid then there’s ~1.4 startups vendors that could be using SMR technology in their hybrid storage for a (4.9+2.4+1.4(1.8*.75)= 8.7 systems have a high probability of SMR implementation over the next 12 months in GA enterprise storage products.
Forecast
So my forecast of SMR adoption by enterprise storage is Yes for .85 probability (unclear what the probability should be, but it’s highly probable).
Recall that shingled magnetic recording uses a write head that overwrites multiple tracks at a time (see graphic above), with one track being properly written and the adjacent (inward) tracks being overwritten. As the head moves to the next track, that track can be properly written but more adjacent (inward) tracks are overwritten, etc. In this fashion data can be written sequentially, on overlapping write passes. In contrast, read heads can be much narrower and are able to read a single track.
In my post, I assumed that this would mean that the new shingled magnetic recording disks would need to be accessed sequentially not unlike tape. Such a change would need a massive rewrite to only write data sequentially. I had suggested this could potentially work if one were to add some SSD or other NVRAM to the device to help manage the mapping of the data to the disk. Possibly that plus a very sophisticated drive controller, not unlike SSD wear leveling today, could handle mapping a physically sequentially accessed disk to a virtually randomly accessed storage protocol.
Garth’s approach to the SMR dilemma
Garth and his team of researchers are taking another tack at the problem. In his view there are multiple groups of tracks on an SMR disk (zones or bands). Each band can be either written sequentially or randomly but all bands can be read randomly. One can break up the disk to include sections of multiple shingled bands, that are sequentially written and less, non-shingled bands that can be randomly written. Of course there would be a gap between the shingled bands in order not to overwrite adjacent bands. And there would also be gaps between the randomly written tracks in a non-shingled partition to allow for the wider track writing that occurs with the SMR write head.
His pitch at the conference dealt with some characteristics of such a multi-band disk device. Such as
How to determine the density for a device that has multiple bands of both shingled write data and randomly written data.
How big or small a shingled band should be in order to support “normal” small block and randomly accessed file IO.
How many randomly written tracks or what the capacity of the non-shingled bands would need to be to support “normal” file IO activity.
For maximum areal density one would want large shingled bands. There are other interesting considerations that were not as obvious but I won’t go into here.
SCSI protocol changes for SMR disks
The other, more interesting section of Garth’s talk was on recent proposed T10 and T13 changes to support SMR disks that supported shingled and non-shingled partitions and what needed to be done to support SMR devices.
The SCSI protocol changes being considered to support SMR devices include:
A new write cursor for shingled write bands that indicates the next LBA to be written. The write cursor starts out at a relative band address of 0 and as each LBA is written consecutively in the band it’s incremented by one.
A write cursor can be reset (to zero) indicating that the band has been erased
Each drive maintains the band map and current cursor position within each band and this can be requested by SCSI drivers to understand the configuration of the drive.
Probably other changes are required as well but these seem sufficient to flesh out the problem.
SMR device software support
Garth and his team implemented an SMR device, emulated in software using real random accessed devices. They then implemented an SMR device driver that used the proposed standards changes and finally, implemented a ShingledFS file system to use this emulated SMR disk to see how it would work. (See their report on Shingled Magnetic Recording for Big Data Applications for more information.)
The CMU team implemented a log structured file system for the ShingledFS that only wrote data to the emulated SMR disk shingled partition sequentially, except for mapping and meta-data information which was written and updated randomly in a non-shingled partition.
You may recall that a log structured file system is essentially written as a sequential stream of data (not unlike a log). But there is additional mapping required that indicates where file data is located in the log which allows for randomly accessing the file data.
In their report and at the conference, Garth presented some benchmark results for a big data application called Terasort (essentially Teragen, Terasort and Teravalidate) which seems to use Hadoop to sort a large body of data. Not sure I can replicate this information here but suffice it to say at the moment the emulated SMR device with ShingledFS did not beat a base EXT3 or FUSE using the same hardware for these applications.
Now the CMU project wAs done by a bunch of smart researchers but it’s still relatively new and not necessarily that optimized. Thus, there’s probably some room for improvement in the ShingledFS and maybe even the emulated SMR device and/or the SMR device driver.
At the moment Garth and his team seem to believe that SMR devices are certainly feasible and would take only modest changes to the SCSI protocols to support such devices. As for file system support there is plenty of history surrounding log structured file systems so these are certainly doable but would require probably extensive development to implemented in various OS to support an SMR device. The device driver changes don’t seem to be as significant.
~~~~
It certainly looks like there’s going to be SMR devices in our future. It’s just a question whether they will be ever as widely supported as the randomly accessed disk device we know and love today. Possibly, this could all be behind a storage subsystem that makes the technology available as networked storage capacity and over time maybe SMR devices could be implemented in more standard OS device drivers and file systems. Nevertheless, to keep capacity and areal density on their current growth trajectory, SMR disks are coming, it’s just a matter of time.
(SCISFS111221-001) (c) 2011 Silverton Consulting, All Rights Reserved
[We are still catching up on our charts for the past quarter but this one brings us up to date through last month]
There’s just something about a million SPECsfs2008(r) NFS throughput operations per second that kind of excites me (weird, I know). Yes it takes over 44-nodes of Avere FXT 3500 with over 6TB of DRAM cache, 140-nodes of EMC Isilon S200 with almost 7TB of DRAM cache and 25TB of SSDs or at least 16-nodes of NetApp FAS6240 in Data ONTAP 8.1 cluster mode with 8TB of FlashCache to get to that level.
Nevertheless, a million NFS throughput operations is something worth celebrating. It’s not often one achieves a 2X improvement in performance over a previous record. Something significant has changed here.
The age of scale-out
We have reached a point where scaling systems out can provide linear performance improvements, at least up to a point. For example, the EMC Isilon and NetApp FAS6240 had a close to linear speed up in performance as they added nodes indicating (to me at least) there may be more there if they just throw more storage nodes at the problem. Although maybe they saw some drop off and didn’t wish to show the world or potentially the costs became prohibitive and they had to stop someplace. On the other hand, Avere only benchmarked their 44-node system with their current hardware (FXT 3500), they must have figured winning the crown was enough.
However, I would like to point out that throwing just any hardware at these systems doesn’t necessary increase performance. Previously (see my CIFS vs NFS corrected post), we had shown the linear regression for NFS throughput against spindle count and although the regression coefficient was good (~R**2 of 0.82), it wasn’t perfect. And of course we eliminated any SSDs from that prior analysis. (Probably should consider eliminating any system with more than a TB of DRAM as well – but this was before the 44-node Avere result was out).
Speaking of disk drives, the FAS6240 system nodes had 72-450GB 15Krpm disks, the Isilon nodes had 24-300GB 10Krpm disks and each Avere node had 15-600GB 7.2Krpm SAS disks. However the Avere system also had a 4-Solaris ZFS file storage systems behind it each of which had another 22-3TB (7.2Krpm, I think) disks. Given all that, the 16-node NetApp system, 140-node Isilon and the 44-node Avere systems had a total of 1152, 3360 and 748 disk drives respectively. Of course, this doesn’t count the system disks for the Isilon and Avere systems nor any of the SSDs or FlashCache in the various configurations.
I would say with this round of SPECsfs2008 benchmarks scale-out NAS systems have come out. It’s too bad that both NetApp and Avere didn’t release comparable CIFS benchmark results which would have helped in my perennial discussion on CIFS vs. NFS.
But there’s always next time.
~~~~
The full SPECsfs2008 performance report went out to our newsletter subscriber’s last December. A copy of the full report will be up on the dispatches page of our site sometime later this month (if all goes well). However, you can see our full SPECsfs2008 performance analysis now and subscribe to our free monthly newsletter to receive future reports directly by just sending us an email or using the signup form above right.
For a more extensive discussion of file and NAS storage performance covering top 30 SPECsfs2008 results and NAS storage system features and functionality, please consider purchasing our NAS Buying Guide available from SCI’s website.
As always, we welcome any suggestions on how to improve our analysis of SPECsfs2008 results or any of our other storage system performance discussions.
SCISPC110822-002 (c) 2011 Silverton Consulting, All Rights Reserved
There really wasn’t that many new submissions for the Storage Performance Council SPC-1 or SPC-2 benchmarks this past quarter (just the new Fujitsu DX80S2 SPC-2 run) so we thought it time to roll out a new chart.
The chart above shows a scatter plot of the number of disk drives in a submission vs. the MB/sec attained for the Large Database Query (LDQ) component of an SPC-2 benchmark.
As one who follows this blog and our twitter feed knows we continue to have an ongoing, long running discussion on how I/O benchmarks such as this are mostly just a measure of how much hardware (disks and controllers) are thrown at them. We added a linear regression line to the above chart to evaluate the validity of that claim and as clearly shown above, disk drive count is NOT highly correlated with SPC-2 performance.
We necessarily exclude from this analysis any system results that used NAND based caching or SSD devices so as to focus specifically on disk drive count relevance. There are not a lot of these in SPC-2 results but there are enough to make this look even worse.
We chose to only display the LDQ segment of the SPC-2 benchmark because it has the best correlation or highest R**2 at 0.41 between workload and disk count. The aggregate MBPS as well as the other components of the SPC-2 benchmark include video on demand (VOD) and large file processing (LFP) both of which had R**2’s of less than 0.36.
For instance, just look at the vertical centered around 775 disk drives. There are two systems that show up here, one doing ~ 6000 MBPS and the other doing ~11,500 MBPS – quite a difference. The fact that these are two different storage architectures from the same vendor is even more informative??
Why is the overall correlation so poor?
One can only speculate but there must be something about system sophistication at work in SPC-2 results. It’s probably tied to better caching, better data layout on disk, and better IO latency but it’s only an educated guess. For example,
Most of the SPC-2 workload is sequential in nature. How a storage system detects sequentiality in a seemingly random IO mix is an art form and what a system does armed with that knowledge is probably more of a science.
In the old days of big, expensive CKD DASD, sequential data was all laid out in consecutively (barring lacing) around a track and up a cylinder. These days of zoned FBA disks one can only hope that sequential data resides in laced sectors, along consecutive tracks on the media, minimizing any head seek activity. Another approach, popular this last decade, has been to throw more disks at the problem, resulting in many more seeking heads to handle the workload and who care where the data lies.
IO latency is another factor. We have discussed this before (see Storage throughput vs IO response time and why it matters. But one key to systems throughput is how quickly data gets out of cache and into the hands of servers. Of course the other part to this, is how fast does the storage system get the data from sitting on disk into cache.
Systems that do these better will perform better on SPC-2 like benchmarks that focus on raw sequential throughput.
Comments?
—–
The full SPC performance report went out to our newsletter subscribers last month. A copy of the full report will be up on the dispatches page of our website later next month. However, you can get this information now and subscribe to future newsletters to receive these reports even earlier by just sending us an email or using the signup form above right.
As always, we welcome any suggestions on how to improve our analysis of SPC results or any of our other storage system performance discussions.
A head assembly on a Seagate disk drive by Robert Scoble (cc) (from flickr)
Last week, Hitachi Global Storage Division(acquired by Western Digital, closing in 4Q2011) and Seagate announced some higher capacity disk drives for desk top applications over the past week.
Most of us in the industry have become somewhat jaded with respect to new capacity offerings. But last weeks announcements may give one pause.
Hitachi announced that they are shipping over 1TB/disk platter using 3.5″ platters shipping with 569Gb/sqin technology. In the past 4-6 platter disk drives were available in shipped disk drives using full height, 3.5″ drives. Given the platter capacity available now, 4-6TB drives are certainly feasible or just around the corner. Both Seagate and Samsung beat HGST to 1TB platter capacities which they announced in May of this year and began shipping in drives in June.
Speaking of 4TB drives, Seagate announced a new 4TB desktop external disk drive. I couldn’t locate any information about the number of platters, or Gb/sqin of their technology, but 4 platters are certainly feasible and as a result, a 4TB disk drive is available today.
I don’t know about you, but 4TB disk drives for a desktop seem about as much as I could ever use. But when looking seriously at my desktop environment my CAGR for storage (revealed as fully compressed TAR files) is ~61% year over year. At that rate, I will need a 4TB drive for backup purposes in about 7 years and if I assume a 2X compression rate then a 4TB desktop drive will be needed in ~3.5 years, (darn music, movies, photos, …). And we are not heavy digital media consumers, others that shoot and edit their own video probably use orders of magnitude more storage.
Hard to believe, but given current trends inevitable, a 4TB disk drive will become a necessity for us within the next 4 years.
SCISPC110527-004 (c) 2011 Silverton Consulting, Inc., All Rights Reserved
The above chart is from our May Storage Intelligence newsletter dispatch on system performance and shows the latest Storage Performance Council SPC-1 benchmark results in a scatter plot with IO/sec [or IOPS(tm)] on the vertical axis and number of disk drives on the horizontal axis. We have tried to remove all results that used NAND flash as a cache or SSDs. Also this displays only results below a $100/GB.
One negative view of benchmarks such as SPC-1 is that published results are almost entirely due to the hardware thrown at it or in this case, the number of disk drives (or SSDs) in the system configuration. An R**2 of 0.93 shows a pretty good correlation of IOPS performance against disk drive count and would seem to bear this view out, but is an incorrect interpretation of the results.
Just look at the wide variation beyond the 500 disk drive count versus below that where there are only a few outliers with a much narrower variance. As such, we would have to say that at some point (below 500 drives), most storage systems can seem to attain a reasonable rate of IOPS as a function of the number of spindles present, but after that point the relationship starts to break down. There are certainly storage systems at the over 500 drive level that perform much better than average for their drive configuration and some that perform worse.
For example, consider the triangle formed by the three best performing (IOPS) results on this chart. The one at 300K IOPS with ~1150 disk drives is from Huawei Symantec and is their 8-node Oceanspace S8100 storage system whereas the other system with similar IOPS performance at ~315K IOPS used ~2050 disk drives and is a 4-node, IBM SVC (5.1) system with DS8700 backend storage. In contrast, the highest performer on this chart at ~380K IOPS, also had ~2050 disk drives and is a 6-node IBM SVC (5.1) with DS8700 backend storage.
Given the above analysis there seems to be much more to system performance than merely disk drive count, at least at the over 500 disk count level.
—-
The full performance dispatch will be up on our website after the middle of next month but if you are interested in viewing this today, please sign up for our free monthly newsletter (see subscription request, above right) or subscribe by email and we’ll send you the current issue. If you need a more analysis of SAN storage performance please consider purchasing SCI’s SAN Storage Briefing.
As always, we welcome all constructive suggestions on how to improve any of our storage performance analyses.
Micron just announced a new SSD drive based on their 34nm SLC NAND technology with some pretty impressive performance numbers. They used an independent organization, Calypso SSD testing, to supply the performance numbers:
Random Read 44,000 IO/sec
Random Writes 16,000 IO/sec
Sequential Read 360MB/sec
Sequential Write 255MB/sec
Even more impressive considering this performance was generated using SATA 6Gb/s and measuring after reaching “SNIA test specification – steady state” (see my post on SNIA’s new SSD performance test specification).
The new SATA 6Gb/s interface is a bit of a gamble but one can always use an interposer to support FC or SAS interfaces. In addition,today many storage subsystems already support SATA drives so its interface may not even be an issue. The P300 can easily support 3Gb/s SATA if that’s whats available and sequential performance suffers but random IOPs won’t be too impacted by interface speed.
The advantages of SATA 6Gb/sec is that it’s a simple interface and it costs less to implement than SAS or FC. The downside is the loss of performance until 6Gb/sec SATA takes over enterprise storage.
P300’s SSD longevity
I have done many posts discussing SSDs and their longevity or write endurance but this is the first time I have heard any vendor describe drive longevity using “total bytes written” to a drive. Presumably this is a new SSD write endurance standard coming out of JEDEC but I was unable to find any reference to the standard definition.
In any case, the P300 comes in 50GB, 100GB and 200GB capacities and the 200GB drive has a “total bytes written” to the drive capability of 3.5PB with the smaller versions having proportionally lower longevity specs. For the 200GB drive, it’s almost 5 years of 10 complete full drive writes a day, every day of the year. This seems enough from my perspective to put any SSD longevity considerations to rest. Although at 255MB/sec sequential writes, the P300 can actually sustain ~10X that rate per day – assuming you never read any data back??
I am sure over provisioning, wear leveling and other techniques were used to attain this longevity. Nonetheless, whatever they did, the SSD market could use more of it. At this level of SSD longevity the P300 could almost be used in a backup dedupe appliance, if there was need for the performance.
You may recall that Micron and Intel have a joint venture to produce NAND chips. But the joint venture doesn’t include applications of their NAND technology. This is why Intel has their own SSD products and why Micron has started to introduce their own products as well.
—–
So which would you rather see for an SSD longevity specification:
Drive MTBF
Total bytes written to the drive,
Total number of Programl/Erase cycles, or
Total drive lifetime, based on some (undefined) predicted write rate per day?
Personally I like total bytes written because it defines the drive reliability in terms everyone can readily understand but what do you think?