Living forever – the end of evolution part-3

Posted on April 15, 2022April 15, 2022 by Ray in Data consistency, Data integrity, Data precision, data protection, Data retention, Storage longevity, Storage reliability, Strategic Inflection Points

Read an article yesterday on researchers who had been studying various mammals and trying to determine the number of DNA mutations they accumulate at about the time they die. The researchers found that after about 800 mutations for mole rats, they die, see Nature article Somatic mutation rates scale with lifespan across mammals and Telegraph article reporting on the research, Mystery of why humans die around 80 may finally be solved.

Similarly, at around 3500 mutations humans die, at around 3000 mutations dogs die and at around 1500 mutations mice die. But the real interesting thing is that the DNA mutation rates and mammal lifespan are highly (negatively) correlated. That is higher mutation rates lead to mammals with shorter life spans.

C. Linear regression of somatic substitution burden (corrected for analysable genome size) on individual age for dog, human, mouse and naked mole-rat samples. Samples from the same individual are shown in the same colour. Regression was performed using mean mutation burdens per individual. Shaded areas indicate 95% confidence intervals of the regression line. A shows microscopic images of sample mammalian cels and the DNA strands examined and B shows the distribution of different types of DNA mutations (substitutions or indels [insertion/deletions of DNA]).

The Telegraph article seems to imply that at 800 mutations all mammals die. But the Nature Article clearly indicates that death is at different mutation counts for each different type of mammal.

Such research show one way on how to live forever. We have talked about similar topics in the distant past see …-the end of evolution part 1 & part 2

But in any case it turns out that one of the leading factors that explains the average age of a mammal at death is its DNA mutation rate. Again, mammals with lower DNA mutation rates live longer on average and mammals with higher DNA mutation rates live shorter lives on average.

Moral of the story

if you want to live longer reduce your DNA mutation rates.

c, Zero-intercept LME regression of somatic mutation rate on inverse lifespan (1/lifespan), presented on the scale of untransformed lifespan (x axis). For simplicity, the y axis shows mean mutation rates per species, although rates per crypt were used in the regression. The darker shaded area indicates 95% CI of the regression line, and the lighter shaded area marks a twofold deviation from the line. Point estimate and 95% CI of the regression slope (k), FVE and range of end-of-lifespan burden are indicated.

All astronauts are subject to significant forms of cosmic radiation which can’t help but accelerate DNA mutations. So one would have to say that the risk of being an astronaut is that you will die younger.

Moon and Martian colonists will also have the same problem. People traveling, living and working there will have an increased risk of dying young. And of course anyone that works around radiation has the same risk.

Note, the mutation counts/mutation rates, that seem to govern life span are averages. Some individuals have lower mutation rates than their species and some (no doubt) have higher rates. These should have shorter and longer lives on average, respectively.

Given this variability in DNA mutation rates, I would propose that space agencies use as one selection criteria, the astronauts/colonists DNA mutation rate. So that humans which have lower than average DNA mutation rates have a higher priority of being selected to become astronauts/extra-earth colonists. One could using this research and assaying astronauts as they come back to earth for their DNA mutation counts, could theoretically determine the impact to their average life span.

In addition, most life extension research is focused on rejuvenating cellular or organism functionality, mainly through the use of young blood, other select nutrients, stem cells that target specific organs, etc. For example, see MIT Scientists Say They’ve Invented a Treatment That Reverses Hearing Loss which involves taking human cells, transform them into stem cells (at a certain maturity) and injecting them into the ear drum.

Living forever

In prior posts on this topic (see parts 1 &2 linked above) we suggested that with DNA computation and DNA storage (see or listen rather, to our GBoS podcast with CTO of Catalog) now becoming viable, one could potentially come up with a DNA program that could

Store an individuals DNA using some very reliable and long lived coding fashion (inside a cell or external to the cell) and
Craft a DNA program that could periodically be activated (cellular crontab) to access the stored DNA for the individual(in the cell would be easiest) and use this copy to replace/correct any DNA mutation throughout an individuals cells.

And we would need a very reliable and correct copy of that person’s DNA (using SHA256 hashing, CRCs, ECC, Parity and every other way to insure the DNA as captured is stored correctly forever). And the earlier we obtained the DNA copy for an individual human, the better.

Also, we would need a copy of the program (and probably the DNA) to be present in every cell in a human for this to work effectively. .

However, if we could capture a good copy of a person’s DNA early in their life we could, perhaps, sometime later, incorporate DNA code/program into the individual to use this copy and sweep through a person’s body (at that point in time) and correct any mutations that have accumulated to date. Ultimately, one could schedule this activity to occur like an annual checkup.

So yeah, life extension research can continue along the lines they are going and you can have a bunch of point solutions for cellular/organism malfunction OR it can focus on correctly copying and storing DNA forever and creating a DNA program that can correct DNA defects in every individual cell, using the stored DNA.

End of evolution

Yes mammals and that means any human could live forever this way. But it would signify the start of the end of evolution for the human species. That is whenever we captured their DNA copy, from that point on evolution (by mutating DNA) of that individual and any offspring of that individual could no longer take place. And if enough humans do this, throughout their lifespan, it means the end of evolution for humanity as a species

This assumes that evolution (which is natural variation driven by genetic mutation & survival of the fittest) requires DNA variation (essentially mutation) to drive the species forward.

~~~~

So my guess, is either we can live forever and stagnate as a species OR live normal lifespans and evolve as a species into something better over time. I believe nature has made it’s choice.

The surprising thing is that we are at a point in humanities existence where we can conceive of doing away with this natural process – evolution, forever.

Photo Credit(s):

From Nature Article, Somatic mutation rates scale with lifespan across mammals
From Nature Article, Somatic mutation rates scale with lifespan across mammals

DNA storage using nicks

Posted on April 15, 2020April 15, 2020 by Ray in Data density, Storage performance, Storage reliability, System effectiveness

Read an article the other day in Scientific American (“Punch card” DNA …) which was reporting on a Nature Magazine Article (DNA punch cards for storing data… ). The articles discussed a new approach to storing (and encoding) data into DNA sequences.

We have talked about DNA storage over the years (most recently, see our Random access DNA object storage post) so it’s been under study for almost a decade.

In prior research on DNA storage, scientists encoded data directly into the nucleotides used to store genetic information. As you may recall, there are two complementary nucleotides A-T (adenine-thymine) and G-C (guanine-cytosine) that constitute the genetic code in a DNA strand. One could use one of these pairs to encode a 1 bit and the other for a 0 bit and just lay them out along a DNA strand.

The main problem with nucleotide encoding of data in DNA is that it’s slow to write and read and very error prone (storing data in DNA separate nucleotides is a lossy data storage). Researchers have now come up with a better way.

Using DNA nicks to store bits

One could encode information in DNA by utilizing the topology of a DNA strand. Each DNA strand is actually made up of a sugar phosphate back bone with a nucleotide (A, C, G or T) hanging off of it, and then a hydrogen bond to its nucleotide complement (T, G, C or A, respectively) which is attached to another sugar phosphate backbone.

It appears that one can deform the sugar phosphate back bone at certain positions and retain an intact DNA strand. It’s in this deformation that the researchers are encoding bits and they call this a “DNA nick”.

Writing DNA nick data

The researchers have taken a standard DNA strand (E-coli), and identified unique sites on it that they can nick to encode data. They have identified multiple (mostly unique) sites for nick data along this DNA, the scientists call “registers” but we would call sectors or segments. Each DNA sector can contain a certain amount of nick data, say 5 to 10 bits. The selected DNA strand has enough unique sectors to record 80 bits (10 bytes) of data. Not quite a punch card (80 bytes of data) but it’s early yet.

Each register or sector is made up of 450 base (nucleotide) pairs. As DNA has two separate strands connected together, the researchers can increase DNA nick storage density by writing both strands creating a sort of two sided punch card. They use this other or alternate (“anti-sense”) side of the DNA strand nicks for the value “2”. We would have thought they would have used the absent of a nick in this alternate strand as being “3” but they seem to just use it as another way to indicate “0” .

The researchers found an enzyme they could use to nick a specific position on a DNA strand called the PfAgo (Pyrococcus furiosus Argonaute) enzyme. The enzyme can de designed to nick distinct locations and register (sectors) along the DNA strand. They designed 1024 (2**10) versions of this enzyme to create all possible 10 bit data patterns for each sector on the DNA strand.

Writing DNA nick data is done via adding the proper enzyme combinations to a solution with the DNA strand. All sector writes are done in parallel and it takes about 40 minutes.

Also the same PfAgo enzyme sequence is able to write (nick) multiple DNA strands without additional effort. So we can replicate the data as many times as there are DNA strands in the solution, or replicating the DNA nick data for disaster recovery.

Reading DNA nick data

Reading the DNA nick data is a bit more complicated.

In Figure 1 the read process starts by by denaturing (splitting dual strands into single strands dsDNA) and then splitting the single strands (ssDNA) up based on register or sector length which are then sequenced. The specific register (sector) sequences are identified in the sequence data and can then be read/decoded and placed in the 80 bit string. The current read process is destructive of the DNA strand (read once).

There was no information on the read time but my guess is it takes hours to perform. Another (faster) approach uses a “two-dimensional (2D) solid-state nanopore membrane” that can read the nick information directly from a DNA string without dsDNA-ssDNA steps. Also this approach is non-destructive, so the same DNA strand could be read multiple times.

Other storage characteristics of nicked DNA

Given the register nature of the nicked DNA data organization, it appears that data can be read and written randomly, rather than sequentially. So nicked DNA storage is by definition, a random access device.

Although not discussed in the paper, it appears as if the DNA nicked data can be modified. That is the same DNA string could have its data be modified (written multiple times).

The researcher claim that nicked DNA storage is so reliable that there is no need for error correction. I’m skeptical but it does appear to be more reliable than previous generations of DNA storage encoding. However, there is a possibility that during destructive read out we could lose a register or two. Yes one would know that the register bits are lost which is good. But some level of ECC could be used to reconstruct any lost register bits, with some reduction in data density.

The one significant advantage of DNA storage has always been its exceptional data density or bits stored per volume. Nicked storage reduces this volumetric density significantly, i.e, 10 bits per 450 (+ some additional DNA base pairs required for register spacing) base pairs or so nicked DNA storage reduces DNA storage volumetric density by at least a factor of 45X. Current DNA storage is capable of storing 215M GB per gram or 215 PB/gram. Reducing this by let’s say 100X, would still be a significant storage density at ~2PB/gram.

Comments?

Picture Credit(s):

From wikipedia article on Punched card
From wikipedia article on DNA
Fig 1 from the Nature Magazine Article DNA punchcards for storing data on native DNA sequences via enzymatic nicking
Fig 2 from the Nature Magazine Article DNA punchcards for storing data on native DNA sequences via enzymatic nicking
Fig 3 from the Nature Magazine Article DNA punchcards for storing data on native DNA sequences via enzymatic nicking

Surprises in disk reliability from Microsoft’s “free cooled” datacenters

Posted on July 1, 2016 by Ray in Energy efficiency, Storage reliability

HH5 At Usenix ATC’16 last week, there was a “best of the rest” session which repeated selected papers presented at FAST’16 earlier this year. One that caught my interest was discussing disk reliability in free cooled data centers at Microsoft (Environmental conditions and disk reliability in free-cooled datacenters, see pp. 53-66).

The paper discusses disk reliability at 9 different datacenters in Microsoft for over 1M drives over the course of 1.5 to 4 years vs. how datacenters were cooled.
Continue reading “Surprises in disk reliability from Microsoft’s “free cooled” datacenters” →

Has triple parity Raid time come?

Posted on June 8, 2016June 17, 2016 by Ray in data protection, SSD storage, Storage reliability

Back at SFD10 a couple of weeks back now when visiting with Nimble Storage they mentioned that their latest all flash storage array was going to support triple-parity RAID.

And last week at a NetApp-SolidFire analyst event, someone mentioned that the new ONTAP 9 triple parity RAID-TEC™ for larger SSDs. Also heard at the meeting was that a 15.3TB SSD would take on the order of 12 hours to rebuild.

Need for better protection

When Nimble discussed the need for triple parity RAID they mentioned the report from Google I talked about recently (see my Surprises from 4 years of SSD experience at Google post). In that post, the main surprise was the amount of read errors they had seen from the SSDs they deployed throughout their data center.

I think the need for triple-parity RAID and larger (+15TB SSDs) will become more common over time. There’s no reason to think that the SSD vendors will stop at 15TB. And if it takes 12 hours to rebuild a 15TB one, I think it’s probably something like ~30 hours to rebuild a 30TB one, which is just a generation or two away.

A read error on one SSD in a RAID group during an SSD rebuild can be masked by having dual parity. A read error on two SSDs can only be masked by having triple parity RAID.
Continue reading “Has triple parity Raid time come?” →

Oracle (finally) releases StorageTek VSM6

Posted on October 25, 2012 by Ray in Clustered storage, Data, Data density, Disaster Recovery, ESCON, Ethernet, FICON, Networking, R&D measures, Storage reliability, Storage virtualization, System effectiveness, Systems, Tape storage, Uncategorized

[Full disclosure: I helped develop the underlying hardware for VSM 1-3 and also way back, worked on HSC for StorageTek libraries.]

Virtual Storage Manager System 6 (VSM6) is here. Not exactly sure when VSM5 or VSM5E were released but it seems like an awful long time in Internet years. The new VSM6 migrates the platform to Solaris software and hardware while expanding capacity and improving performance.

What’s VSM?

Oracle StorageTek VSM is a virtual tape system for mainframe, System z environments. It provides a multi-tiered storage system which includes both physical disk and (optional) tape storage for long term big data requirements for z OS applications.

VSM6 emulates up to 256 virtual IBM tape transports but actually moves data to and from VSM Virtual Tape Storage Subsystem (VTSS) disk storage and backend real tape transports housed in automated tape libraries. As VSM data ages, it can be migrated out to physical tape such as a StorageTek SL8500 Modular [Tape] Library system that is attached behind the VSM6 VTSS or system controller.

VSM6 offers a number of replication solutions for DR to keep data in multiple sites in synch and to copy data to offsite locations. In addition, real tape channel extension can be used to extend the VSM storage to span onsite and offsite repositories.

One can cluster together up to 256 VSM VTSSs into a tapeplex which is then managed under one pane of glass as a single large data repository using HSC software.

What’s new with VSM6?

The new VSM6 hardware increases volatile cache to 128GB from 32GB (in VSM5). Non-volatile cache goes up as well, now supporting up to ~440MB, up from 256MB in the previous version. Power, cooling and weight all seem to have also gone up (the wrong direction??) vis a vis VSM5.

The new VSM6 removes the ESCON option of previous generations and moves to 8 FICON and 8 GbE Virtual Library Extension (VLE) links. FICON channels are used for both host access (frontend) and real tape drive access (backend). VLE was introduced in VSM5 and offers a ZFS based commodity disk tier behind the VSM VTSS for storing data that requires longer residency on disk. Also, VSM supports a tapeless or disk-only solution for high performance requirements.

System capacity moves from 90TB (gosh that was a while ago) to now support up to 1.2PB of data. I believe much of this comes from supporting the new T10,000C tape cartridge and drive (5TB uncompressed). With the ability of VSM to cluster more VSM systems to the tapeplex, system capacity can now reach over 300PB.

Somewhere along the way VSM started supporting triple redundancy for the VTSS disk storage which provides better availability than RAID6. Not sure why they thought this was important but it does deal with increasing disk failures.

Oracle stated that VSM6 supports up to 1.5GB/Sec of throughput. Presumably this is landing data on disk or transferring the data to backend tape but not both. There doesn’t appear to be any standard benchmarking for these sorts of systems so, will take their word for it.

Why would anyone want one?

Well it turns out plenty of mainframe systems use tape for a number of things such as data backup, HSM, and big data batch applications. Once you get past the sunk costs for tape transports, automation, cartridges and VSMs, VSM storage can be a pretty competitive data storage solution for the mainframe environment.

The fact that most mainframe environments grew up with tape and have long ago invested in transports, automation and new cartridges probably makes VSM6 an even better buy. But tape is also making a comeback in open systems with LTO-5 and now LTO-6 coming out and with Oracle’s 5TB T10000C cartridge and IBM’s 4TB 3592 JC cartridge.

Not to mention Linear Tape File System (LTFS) as a new tape format that provides a file system for tape data which has brought renewed interest in all sorts of tape storage applications.

Competition not standing still

EMC introduced their Disk Library for Mainframe 6000 (DLm6000) product that supports two different backends to deal with the diversity of tape use in the mainframe environment. Moreover, IBM has continuously enhanced their Virtual Tape Server the TS7700 but I would have to say it doesn’t come close to these capacities.

Lately, when I talked with long time StorageTek tape mainframe customers they have all said the same thing. When is VSM6 coming out and when will Oracle get their act in gear and start supporting us again. Hopefully this signals a new emphasis on this market. Although who is losing and who is winning in the mainframe tape market is the subject of much debate, there is no doubt that the lack of any update to VSM has hurt Oracle StorageTek tape business.

Something tells me that Oracle may have fixed this problem. We hope that we start to see some more timely VSM enhancements in the future, for their sake and especially for their customers.

~~~~

Comments?

~~~~

Image credit: Interior of StorageTek tape library at NERSC (2) by Derrick Coetzee

Graphene Flash Memory

Posted on September 6, 2011April 10, 2012 by Ray in Data, Data density, Data integrity, Data retention, SSD storage, Storage, storage economics, Storage energy use, Storage longevity, Storage reliability, Strategic Inflection Points, System effectiveness, Uncategorized

Model of graphene structure by CORE-Materials (cc) (from Flickr)

I have been thinking about writing a post on “Is Flash Dead?” for a while now. Well at least since talking with IBM research a couple of weeks ago on their new memory technologies that they have been working on.

But then this new Technology Review article came out discussing recent research on Graphene Flash Memory.

Problems with NAND Flash

As we have discussed before, NAND flash memory has some serious limitations as it’s shrunk below 11nm or so. For instance, write endurance plummets, memory retention times are reduced and cell-to-cell interactions increase significantly.

These issues are not that much of a problem with today’s flash at 20nm or so. But to continue to follow Moore’s law and drop the price of NAND flash on a $/Gb basis, it will need to shrink below 16nm. At that point or soon thereafter, current NAND flash technology will no longer be viable.

Other non-NAND based non-volatile memories

That’s why IBM and others are working on different types of non-volatile storage such as PCM (phase change memory), MRAM (magnetic RAM) , FeRAM (Ferroelectric RAM) and others. All these have the potential to improve general reliability characteristics beyond where NAND Flash is today and where it will be tomorrow as chip geometries shrink even more.

IBM seems to be betting on MRAM or racetrack memory technology because it has near DRAM performance, extremely low power and can store far more data in the same amount of space. It sort of reminds me of delay line memory where bits were stored on a wire line and read out as they passed across a read/write circuit. Only in the case of racetrack memory, the delay line is etched in a silicon circuit indentation with the read/write head implemented at the bottom of the cleft.

Graphene as the solution

Then along comes Graphene based Flash Memory. Graphene can apparently be used as a substitute for the storage layer in a flash memory cell. According to the report, the graphene stores data using less power and with better stability over time. Both crucial problems with NAND flash memory as it’s shrunk below today’s geometries. The research is being done at UCLA and is supported by Samsung, a significant manufacturer of NAND flash memory today.

Current demonstration chips are much larger than would be useful. However, given graphene’s material characteristics, the researchers believe there should be no problem scaling it down below where NAND Flash would start exhibiting problems. The next iteration of research will be to see if their scaling assumptions can hold when device geometry is shrunk.

The other problem is getting graphene, a new material, into current chip production. Current materials used in chip manufacturing lines are very tightly controlled and building hybrid graphene devices to the same level of manufacturing tolerances and control will take some effort.

So don’t look for Graphene Flash Memory to show up anytime soon. But given that 16nm chip geometries are only a couple of years out and 11nm, a couple of years beyond that, it wouldn’t surprise me to see Graphene based Flash Memory introduced in about 4 years or so. Then again, I am no materials expert, so don’t hold me to this timeline.

—-

Comments?

IBM’s 120PB storage system

Posted on September 2, 2011September 29, 2011 by Ray in Clustered storage, Data, Data integrity, data logistics, data protection, Disk storage, Distributed computing, File Storage, Infiniband, Information economy, Networking, SAS, Storage, Storage architecture, Storage availability, Storage density, Storage energy use, Storage Features, Storage reliability, Strategy, System effectiveness, Systems

Susitna Glacier, Alaska by NASA Goddard Photo and Video (cc) (from Flickr)

Talk about big data, Technology Review reported this week that IBM is building a 120PB storage system for some unnamed customer. Details are sketchy and I cannot seem to find any announcement of this on IBM.com.

Hardware

It appears that the system uses 200K disk drives to support the 120PB of storage. The disk drives are packed in a new wider rack and are water cooled. According to the news report the new wider drive trays hold more drives than current drive trays available on the market.

For instance, HP has a hot pluggable, 100 SFF (small form factor 2.5″) disk enclosure that sits in 3U of standard rack space. 200K SFF disks would take up about 154 full racks, not counting the interconnect switching that would be required. Unclear whether water cooling would increase the density much but I suppose a wider tray with special cooling might get you more drives per floor tile.

There was no mention of interconnect, but today’s drives use either SAS or SATA. SAS interconnects for 200K drives would require many separate SAS busses. With an SAS expander addressing 255 drives or other expanders, one would need at least 4 SAS busses but this would have ~64K drives per bus and would not perform well. Something more like 64-128 drives per bus would have much better performer and each drive would need dual pathing, and if we use 100 drives per SAS string, that’s 2000 SAS drive strings or at least 4000 SAS busses (dual port access to the drives).

The report mentioned GPFS as the underlying software which supports three cluster types today:

Shared storage cluster – where GPFS front end nodes access shared storage across the backend. This is generally SAN storage system(s). But the requirements for high density, it doesn’t seem likely that the 120PB storage system uses SAN storage in the backend.
Networked based cluster – here the GPFS front end nodes talk over a LAN to a cluster of NSD (network storage director?) servers which can have access to all or some of the storage. My guess is this is what will be used in the 120PB storage system
Shared Network based clusters – this looks just like a bunch of NSD servers but provides access across multiple NSD clusters.

Given the above, with ~100 drives per NSD server means another 1U extra per 100 drives or (given HP drive density) 4U per 100 drives for 1000 drives and 10 IO servers per 40U rack, (not counting switching). At this density it takes ~200 racks for 120PB of raw storage and NSD nodes or 2000 NSD nodes.

Unclear how many GPFS front end nodes would be needed on top of this but even if it were 1 GPFS frontend node for every 5 NSD nodes, we are talking another 400 GPFS frontend nodes and at 1U per server, another 10 racks or so (not counting switching).

If my calculations are correct we are talking over 210 racks with switching thrown in to support the storage. According to IBM’s discussion on the Storage challenges for petascale systems, it probably provides ~6TB/sec of data transfer which should be easy with 200K disks but may require even more SAS busses (maybe ~10K vs. the 2K discussed above).

Software

IBM GPFS is used behind the scenes in IBM’s commercial SONAS storage system but has been around as a cluster file system designed for HPC environments for over 15 years or more now.

Given this many disk drives something needs to be done about protecting against drive failure. IBM has been talking about declustered RAID algorithms for their next generation HPC storage system which spreads the parity across more disks and as such, speeds up rebuild time at the cost of reducing effective capacity. There was no mention of effective capacity in the report but this would be a reasonable tradeoff. A 200K drive storage system should have a drive failure every 10 hours, on average (assuming a 2 million hour MTBF). Let’s hope they get drive rebuild time down much below that.

The system is expected to hold around a trillion files. Not sure but even at 1024 bytes of metadata per file, this number of files would chew up ~1PB of metadata storage space.

GPFS provides ILM (information life cycle management, or data placement based on information attributes) using automated policies and supports external storage pools outside the GPFS cluster storage. ILM within the GPFS cluster supports file placement across different tiers of storage.

All the discussion up to now revolved around homogeneous backend storage but it’s quite possible that multiple storage tiers could also be used. For example, a high density but slower storage tier could be combined with a low density but faster storage tier to provide a more cost effective storage system. Although, it’s unclear whether the application (real world modeling) could readily utilize this sort of storage architecture nor whether they would care about system cost.

Nonetheless, presumably an external storage pool would be a useful adjunct to any 120PB storage system for HPC applications.

Can it be done?

Let’s see, 400 GPFS nodes, 2000 NSD nodes, and 200K drives. Seems like the hardware would be readily doable (not sure why they needed watercooling but hopefully they obtained better drive density that way).

Luckily GPFS supports Infiniband which can support 10,000 nodes within a single subnet. Thus an Infiniband interconnect between the GPFS and NSD nodes could easily support a 2400 node cluster.

The only real question is can a GPFS software system handle 2000 NSD nodes and 400 GPFS nodes with trillions of files over 120PB of raw storage.

As a comparison here are some recent examples of scale out NAS systems:

EMC Isilon demonstrated a 140 node storage cluster in a SPECsfs2008 benchmark.
IBM has already announced support for a 14.4PB SONAS storage system using high density drives

It would seem that a 20X multiplier times a current Isilon cluster or even a 10X multiple of a currently supported SONAS system would take some software effort to work together, but seems entirely within reason.

On the other hand, Yahoo supports a 4000-node Hadoop cluster and seems to work just fine. So from a feasability perspective, a 2500 node GPFS-NSD node system seems just a walk in the park for Hadoop.

Of course, IBM Almaden is working on project to support Hadoop over GPFS which might not be optimum for real world modeling but would nonetheless support the node count being talked about here.

——

I wish there was some real technical information on the project out on the web but I could not find any. Much of this is informed conjecture based on current GPFS system and storage hardware capabilities. But hopefully, I haven’t traveled to far astray.

Comments?

Pure Storage surfaces

Posted on August 23, 2011April 10, 2012 by Ray in Block Storage, Clustered storage, Data reduction, FC, SSD storage, Storage, Storage performance, Storage reliability, Strategic Inflection Points, Strategy, Systems

We were talking with Pure Storage last week, another SSD startup which just emerged out of stealth mode today. Somewhat like SolidFire which we discussed a month or so ago, Pure Storage uses only SSDs to provide primary storage. In this case, they are supporting a FC front end, with an all SSDs backend, and implementing internal data deduplication and compression, to try to address the needs of enterprise tier 1 storage.

Pure Storage is in final beta testing with their product and plan to GA sometime around the end of the year.

Pure Storage hardware

Their system is built around MLC SSDs which are available from many vendors but with a strategic investment from Samsung, currently use that vendor’s storage. As we know, MLC has write endurance limitations but Pure Storage was built from the ground up knowing they were going to use this technology and have built their IP to counteract these issues.

The system is available in one or two controller configurations, with an Infiniband interconnect between the controllers, 6Gbps SAS backend, 48GB of DRAM per controller for caching purposes, and NV-RAM for power outages. Each controller has 12-cores supplied by 2-Intel Xeon processor chips.

With the first release they are limiting the controllers to one or two (HA option) but their storage system is capable of clustering together many more, maybe even up to 8-controllers using the Infiniband back end.

Each storage shelf provides 5.5TB of raw storage using 2.5″ 256GB MLC SSDs. It looks like each controller can handle up to 2-storage shelfs with the HA (dual controller option) supporting 4 drive shelfs for up to 22TB of raw storage.

Pure Storage Performance

Although these numbers are not independently verified, the company says a single controller (with 1-storage shelf) they can do 200K sustained 4K random read IOPS, 2GB/sec bandwidth, 140K sustained write IOPS, or 500MB/s of write bandwidth. A dual controller system (with 2-storage shelfs) can achieve 300K random read IOPS, 3GB/sec bandwidth, 180K write IOPS or 1GB/sec of write bandwidth. They also claim that they can do all this IO with an under 1 msec. latency.

One of the things they pride themselves on is consistent performance. They have built their storage such that they can deliver this consistent performance even under load conditions.

Given the amount of SSDs in their system this isn’t screaming performance but is certainly up there with many enterprise class systems sporting over 1000 disks. The random write performance is not bad considering this is MLC. On the other hand the sequential write bandwidth is probably their weakest spec and reflects their use of MLC flash.

Purity software

One key to Pure Storage (and SolidFire for that matter) is their use of inline data compression and deduplication. By using these techniques and basing their system storage on MLC, Pure Storage believes they can close the price gap between disk and SSD storage systems.

The problems with data reduction technologies is that not all environments can benefit from them and they both require lots of CPU power to perform well. Pure Storage believes they have the horsepower (with 12 cores per controller) to support these services and are focusing their sales activities on those (VMware, Oracle, and SQL server) environments which have historically proven to be good candidates for data reduction.

In addition, they perform a lot of optimizations in their backend data layout to prolong the life of MLC storage. Specifically, they use a write chunk size that matches the underlying MLC SSDs page width so as not to waste endurance with partial data writes. Also they migrate old data to new locations occasionally to maintain “data freshness” which can be a problem with MLC storage if the data is not touched often enough. Probably other stuff as well, but essentially they are tuning their backend use to optimize endurance and performance of their SSD storage.

Furthermore, they have created a new RAID 3D scheme which provides an adaptive parity scheme based on the number of available drives that protects against any dual SSD failure. They provide triple parity, dual parity for drive failures and another parity for unrecoverable bit errors within a data payload. In most cases, a failed drive will not induce an immediate rebuild but rather a reconfiguration of data and parity to accommodate the failing drive and rebuild it onto new drives over time.

At the moment, they don’t have snapshots or data replication but they said these capabilities are on their roadmap for future delivery.

—-

In the mean time, all SSD storage systems seem to be coming out of the wood work. We mentioned SolidFire, but WhipTail is another one and I am sure there are plenty more in stealth waiting for the right moment to emerge.

I was at a conference about two months ago where I predicted that all SSD systems would be coming out with little of the engineering development of storage systems of yore. Based on the performance available from a single SSD, one wouldn’t need 100s of SSDs to generate 100K IOPS or more. Pure Storage is doing this level of IO with only 22 MLC SSDs and a high-end, but essentially off-the-shelf controller.

Just imagine what one could do if you threw some custom hardware at it…

Comments?