A college course on identifying BS

Read an article the other day from Recode (These University of Washington professors teaching a course on Calling BS) that seems very timely. The syllabus is online (Calling Bullshit — Syllabus) and it looks like a great start on identifying falsehood wherever it can be found.

In the beginning, what’s BS?

The course syllabus starts out referencing Brandolini’s Bullshit Asymmetry Principal (Law): the amount of energy needed to refute BS is an order of magnitude bigger than to produce it.

Then it goes into a rather lengthy definition of BS from Harry Frankfort’s 1986 On Bullshit article. In sum, it starts out reviewing a previous author’s discussions on Humbug and ends up at the OED. Suffice it to say Frankfurt’s description of BS runs the gamut from: Deceptive misrepresentation to short of lying.

They course syllabus goes on to reference two lengthy discussions/comments on Frankfurt’s seminal On Bullshit article, but both Cohen’s response, Deeper into BS and Eubank & Schaeffer’s A kind word for BS: …  are focused more on academic research rather than everyday life and news.

How to mathematically test for BS

The course then goes into mathematical tests for BS that range from Fermi’s questions, the Grim Test and Benford’s 1936 Law of Anomalous Numbers. These tests are all ways of looking at data and numbers and estimating whether they are bogus or not. Benford’s paper/book talks about how the first page of logarithms is always more used than others because numbers that start with 1 are more frequent than any other number.

How rumors propagate

The next section of the course (week 4) talks about the natural ecology of BS.

Here there’s reference to an article by Friggeri, et al, on Rumor Cascades, which discusses the frequency with which patently both true, false and partially true/partially false rumors are “shared” on social media (Facebook).

The professors look at a website called Snopes.com which evaluates the veracity of publishes rumors uses this to classify the veracity of rumors. Next they examine how these rumors are shared over time on Facebook.

Summarizing their research, both false and true rumors propagate sporadically on Facebook. But even verified false or mixed true/mixed false rumors (identified by Snopes.com) continue to propagate on Facebook. This seems to indicate that rumor sharers are ignoring the rumor’s truthfulness or are just unaware of the Snopes.com assessment of the rumor.

Other topics on calling BS

The course syllabus goes on to causality (correlation is not causation, a common misconception used in BS), statistical traps and trickery (used to create BS), data visualization (which can be used to hide BS), big data (GiGo leads to BS), publication bias (e.g., most published research presents positive results, where’s all the negative results research…), predatory publishing and scientific misconduct (organizations that work to create BS for others), the ethics of calling BS (the line between criticism and harassment), fake news and refuting BS.

Fake news

The section on Fake News is very interesting. They reference an article in the NYT, The Agency about how a group in Russia have been reaping havoc across the internet with fake news and bogus news sites.

But there’s more another article on NYT website, Inside a fake news sausage factory, details how multiple websites started publishing bogus news and then used advertisement revenue to tell them which bogus news generated more ad revenue – apparently there’s money to be made in advertising fake news. (Sigh, probably explains why I can’t seem to get any sponsors for my websites…).

Improving the course

How to improve their course? I’d certainly take a look at what Facebook and others are doing to identify BS/fake news and see if these are working effectively.

Another area to add might be a historical review of fake rumors, news or information. This is not a new phenomenon. It’s been going on since time began.

In addition, there’s little discussion of the consequences of BS on life, politics, war, etc. The world has been irrevocably changed in the past  on account of false information. Knowing how bad this has been this might lend some urgency to studying how to better identify BS.

There’s a lot of focus on Academia in the course and although this is no doubt needed, most people need to understand whether the news they see every day is fake or not. Focusing more on this would be worthwhile.

~~~~

I admire the University of Washington professors putting this course together. It’s really something that everyone needs to understand  nowadays.

They say the lectures will be recorded and published online – good for them. Also, the current course syllabus is for a one credit hour course but they would like to expand it to a three to four credit hour course – another great idea

Comments?

Photo credit(s): The Donation of ConstantineNew York World – Remember the Maine, Public Domain; Benjamin Franklin’s Bag of Scalps letter;  fake-news-rides-sociales by Portal GDA

BlockStack, a Bitcoin secured global name space for distributed storage

At USENIX ATC conference a couple of weeks ago there was a presentation by a number of researchers on their BlockStack global name space and storage system based on the blockchain based Bitcoin network. Their paper was titled “Blockstack: A global naming and storage system secured by blockchain” (see pg. 181-194, in USENIX ATC’16 proceedings).

Bitcoin blockchain simplified

Blockchain’s like Bitcoin have a number of interesting properties including completely distributed understanding of current state, based on hashing and an always appended to log of transactions.

Blockchain nodes all participate in validating the current block of transactions and some nodes (deemed “miners” in Bitcoin) supply new blocks of transactions for validation.

All blockchain transactions are sent to each node and blockchain software in the node timestamps the transaction and accumulates them in an ordered append log (the “block“) which is then hashed, and each new block contains a hash of the previous block (the “chain” in blockchain) in the blockchain.

The miner’s block is then compared against the non-miners node’s block (hashes are compared) and if equal then, everyone reaches consensus (agrees) that the transaction block is valid. Then the next miner supplies a new block of transactions, and the process repeats. (See wikipedia’s article for more info).

All blockchain transactions are owned by a cryptographic address. Each cryptographic address has a public and private key associated with it.
Continue reading “BlockStack, a Bitcoin secured global name space for distributed storage”

Surprises from 4 years of SSD experience at Google

Flash field experience at Google 

Overview SSDsIn a FAST’16 article I recently read (Flash reliability in production: the expected and unexpected, see p. 67), researchers at Google reported on field experience with flash drives in their data centers, totaling many millions of drive days covering MLC, eMLC and SLC drives with a minimum of 4 years of production use (3 years for eMLC). In some cases, they had 2 generations of the same drive in their field population. SSD reliability in the field is not what I would have expected and was a surprise to Google as well.

The SSDs seem to be used in a number of different application areas but mainly as SSDs with a custom designed PCIe interface (FusionIO drives maybe?). Aside from the technology changes, there were some lithographic changes as well from 50 to 34nm for SLC and 50 to 43nm for MLC drives and from 32 to 25nm for eMLC NAND technology.
Continue reading “Surprises from 4 years of SSD experience at Google”

SCI’s (Storage QoW 15-001) 3D XPoint in next years storage, forecast=NO with 0.62 probability

20147811875_413b041e3f_z
So as to my forecast for the first question of the week: (#Storage-QoW 2015-001) – Will 3D XPoint be GA’d in  enterprise storage systems within 12 months?

I believe the answer will be Yes with a 0.38 probability or conversely, No with a 0.62 probability.

We need to decompose the question to come up with a reasonable answer.

1. How much of an advantage will 3D XPoint provide storage systems?

The claim is 1000X faster than NAND, 1000X endurance of NAND, & 10X density of DRAM. But, I believe the relative advantage of the new technology depends mostly on its price. So now the question is what would 3D XPoint technology cost ($/GB).

It’s probably going to be way more expensive than NAND $/GB (@2.44/64Gb-MLC or ~$0.31/GB). But how will it be priced relative to  DRAM (@$2.23/4Gb DDR4 or ~$4.46/GB) and (asynch) SRAM (@$7.80/ 16Mb or $3900.00/GB)?

More than likely, it’s going to cost more than DRAM because it’s non-volatile and almost as fast to access. As for how it relates to SRAM, the pricing gulf between DRAM and asynch SRAM is so huge, I think pricing it even at 1/10th SRAM costs, would seriously reduce the market. And I don’t think its going to be too close to DRAM, so maybe ~10X the cost of DRAM, or $44.60/GB.  [Probably more like a range of prices with $44.60 at 0.5 probable, $22.30 at 0.25 and $66.90 at 0.1. Unclear how I incorporate such pricing variability into a forecast.]

At $44.60/GB, what could 3D XPoint NVM replace in a storage system: 1) non-volatile cache; 2) DRAM caches, 3) Flash caches; 4) PCIe flash storage or 5) SSD storage in storage control units.

Non-volatile caching uses battery backed DRAM (with or without SSD offload) and SuperCap backed DRAM with SSD offload. Non-volatile caches can be anywhere from 1/16 to 1/2 total system cache size. The average enterprise class storage has ~412GB of cache, so non-volatile caching could be anywhere from 26 to 206GB or lets say ~150GB of 3D XPoint, which at ~$45/GB, would cost $6.8K in chips alone, add in $1K of circuitry and it’s $7.8K

  • For battery backed DRAM – 150GB of DRAM would cost ~$670 in chips, plus an SSD (~300GB) at ~$90, and 2 batteries (8hr lithium battery costs $32) so $64. Add charging/discharging circuitry, battery FRU enclosures, (probably missing something else) but maybe all the extras come to another $500 or ~$1.3K total. So the at $45/GB the 3D Xpoint non-volatile cache would run ~6.0X the cost of battery backed up DRAM.
  • For superCAP backed DRAM – similarly, a SuperCAP cache would have the same DRAM and SSD costs ($670 & $90 respectively). The costs for SuperCAPS in equivalent (Wh) configurations, run 20X the price of batteries, so $1.3K. Charging/discharging circuitry and FRU enclosures would be simpler than batteries, maybe 1/2 as much, so add $250 for all the extras, which means a total SuperCAP backed DRAM cost of ~$2.3K., which puts 3D Xpoint at 3.4X the cost of superCAP backed DRAM.

In these configurations a 3D XPoint non-volatile memory would replace lot’s of circuitry (battery-charging/discharging & other circuitry or SuperCAP-charging/discharging & other circuitry) and the SSD. So, 3D XPoint non-volatile cache could drastically simplify hardware logic and also software coding for power outages/failures. Less parts and coding has some intrinsic value beyond pure cost, difficult to quantify, but substantive, nonetheless.

As for using 3D XPoint to replace volatile DRAM cache another advantage is you wouldn’t need to have a non-volatile cache and systems wouldn’t have to copy data between caches. But at $45/GB, costs would be significant. A 412GB DRAM cache would cost $1.8K in DRAM chips and maybe another $1K in circuitry, so~ $2.8K. Doing one in 3D XPoint would run $18K in chips and the same $1K in circuitry, so $19K.  But we eliminate the non-volatile cache. Factoring that in, the all 3D XPoint cache would run ~$19K vs. DRAM volatile and (SuperCAP backed) non-volatile cache $2.8K+$2.3K= $5.1 or ~3.7X higher costs.

Again, the parts cost differential is not the whole story. But replacing volatile cache AND non-volatile cache would probably require more coding not less.

As for using 3D XPoint as a replacement or FlashCache I don’t think it’s likely because the cost differential at $45/GB is ~100X Flash costs (not counting PCIe controller and other logic) . Ditto for PCIe Flash and SSD storage.

Being 1000X denser than DRAM is great, but board footprint is not a significant storage system cost factor today.

So at a $45/GB price maybe there’s a 0.35 likelihood that storage systems would adopt the technology.

2. How many vendors are likely to GA new enterprise storage hardware in the next 12 months?

We can use major vendors to help estimate this. I used IBM, EMC, HDS, HP and NetApp as representing the major vendors for this analysis.

IBM (2 for 4) 

  • They just released a new DS8880 last fall and their prior version DS8870 came out in Oct. 2013, so the DS8K seems to be on a 24 month development cycle. So, its very unlikely we will see a new DS8K be released in next 12 month. 
  • SVC engine hardware DH8 was introduced in May 2014. SVC CG8 engine was introduced in May 2011. So SVC hardware seems to be on a 36 month cycle. So, its very unlikely we will see a new SVC hardware engine will be released in the next 12 months.
  • FlashSystem 900 hardware was just rolled out 1Q 2015  and FlashSystem 840 was introduced in January of 2014. So FlashSystem hardware is on a ~15 month hardware cycle. So, it is very likely that a new FlashSystem hardware will be released in the next 12 months. 
  • XIV Gen 3 hardware was introduced in July of 2011. Unclear when Gen2 was rolled out but IBM acquired XIV in Jan of 2008 and released an IBM version in August, 2008. So XIV’s on a ~36 month cycle. So, it is very likely that a new generation of XIV will be released in the next 12 months. 

EMC ([4] 3 for 4) 

  • VMAX3 was GA’d in 3Q (Sep) 2014. VMAX2 was available Sep 2012, which puts VMAX on 24 month cycle. So, it’s very likely that a new VMAX will be released in the next 12 months.
  • VNX2 was announced May, 2013 and GA’d Sep 2013. VNX 1 was announced Jan ,2011 and GA’d by May 2011. So that puts VNX on a ~28 month cycle. Which means we have should have already seen a new one, so it’s very likely we will see a new version of VNX in the next 12 months.  
  • XtremIO hardware was introduced in Mar, 2013 with no new significant hardware changes since. With a lack of history to guide us let’s assume a 24 month cycle. So, it’s very likely we will see a new version of XtremIO hardware in the next 12 months.
  • Isilon S200/X200 was introduced April, 2011 and X400 was released in May, 2012. Which put Isilon on a 13 month cycle then but nothing since.  So, it’s very likely we will see a new version of Isilon hardware in the next 12 months. 

However, having EMC’s unlikely to update all their storage hardware in the same 12 moths. That being said, XtremIO could use a HW boost as IBM and the startups are pushing AFA technology pretty hard here. Isilon is getting long in the tooth, so that’s another likely changeover. Since VNX is more overdue than VMAX, I’d have to say it’s likely new VNX, XtremIO & Isilon hardware will be seen over the next year. 

HDS (1 of 3) 

  • Hitachi VSP G1000 came out in Apr of 2014. HDS VSP came out in Sep of 2010. So HDS VSP is on a 43 month cycle. So it’s very unlikely we will see a new VSP in 12 months. 
  • Hitachi HUS VM came out in Sep 2012.  As far as I can tell there were no prior generation systems. But HDS just came out with the G200-G800 series, leaving the HUS VM as the last one not updated so, it’s very likely we will see a new version of HUS VM in the next 12 months.
  • Hitachi VSP G800, G600, G400, G200 series came out in Nov of 2015. Hitachi AMS 2500 series came out in April, 2012. So the mid-range systems seem to be on an 43 month cycle. So it’s very unlikely we will see a new version of HDS G200-G800 series in the next 12 months.

HP (1 of 2) 

  • HP 3PAR 20000 was introduced August, 2015 and the previous generation system, 3PAR 10000 was introduced in June, 2012. This puts the 3PAR on a 38 month cycle. So it’s very unlikely we will see a new version of 3PAR in the next 12 months. 
  • MSA 1040 was introduced in Mar 2014. MSA 2040 was introduced in May 2013. This puts the MSA on ~10 month cycle. So it’s very likely we will see a new version of MSA in the next 12 months. 

NetApp (2 of 2)

  • FAS8080 EX was introduced June, 2014. FAS6200 was introduced in Feb, 2013. Which puts the highend FAS systems on a 16 month cycle. So it’s very likely we will see a new version high-end FAS in the next 12 months.
  • NetApp FAS8040-8060 series scale out systems were introduced in Feb 2014. FAS3200 series was introduced in Nov of 2012. Which puts the FAS systems on a 15 month cycle. A new midrange release seems overdue, so it’s very likely we will see a new version of mid-range FAS in the next 12 months.

Overall the likelihood of new hardware being released by major vendors is 2+3+1+1+2=9/15 or ~0.60 probability of new hardware in the next 12 months.

Applying 0.60 to non-major storage vendors that typically only have one storage system GA’d at a time, which includes Coho Data, DataCore, Data Gravity, Dell, DDN, Fujitsu, Infinidat, NEC, Nexenta, NexGen Storage, Nimble, Pure, Qumulo, Quantum, SolidFire, Tegile, Tintri, Violin Memory, X-IO, and am probably missing a couple more. So of these ~21 non-major/startup vendors, we are likely to see ~13 new (non-major) hardware systems in the next 12 months. 

Some of these non-major systems are based on standard off-the-shelf, Intel server hardware and some vendors (Infinidat, Violin Memory & X-IO) have their own hardware designed systems. Of the 9 major vendor products identified above, six (IBM XIV, EMC VNX, EMC Isilon, EMC XtremIO, HP MSA and NetApp mid-range) use off the shelf, server hardware.

So all told my best guess is we should see (9+13=)22 new enterprise storage systems introduced in next 12 months from major and non-major storage vendors. 

3. How likely is it that Intel-Micron will come out with GA chip products in the next 6 months?

They claimed they were sampling products to vendors back at Flash Summit in August 2015. So it’s very likely (0.85 probability) that Intel-Micron will produce 3D XPoint chips in the next 12 months.

Some systems (IBM FlashSystems, NetApp high-end, and HUS VM) could make use of raw chips or even a new level of storage connected to a memory bus. But all of them could easily take advantage of a 3D XPoint device that was an NVMe PCIe connected storage.

But to be useable for most vendor storage systems being GA’d over the next year, any new chip technology has to be available for use in 6 months at the latest.

4. How likely is it that Intel-Micron will produce servers with 3D XPoint in the next 6 months?

Listening in at Flash Summit this seems to be their preferred technological approach to market. And as most storage vendors use standard Intel Servers this would seem to be an easiest way to adopt it. If the chips are available, I deem it 0.65 probability that Intel will GA server hardware in the next 6 months with 3D XPoint technology. 

Not sure any of the major or non-major vendors above could possible use server hardware introduced later than 6 months but Qumulo uses Agile development and releases GA code every 2 weeks, so they could take this on later than most.

But given the chip pricing, lack of significant advantage, and coding update requirements, I deem it 0.33 probability that vendors will adopt the technology even if it’s in a new server that they can use.

Summary

So there’s a 0.85 probability of chips available within 6 months for 3 potential major system that leaves us with 2.6 systems using 3D XPoint chip technology directly. 

With a 0.65 probability of servers coming out in 6 months using 3D XPoint and a 0.45 of new storage systems adopting the technology for caching. That says there’s a 0.29 probability and with 18 new systems coming out. That says 5.2 systems could potentially adopt the server technology.

For a total of 7.8 systems out of a potential 22 new systems or a 0.35 probability. 

That’s just the known GA non-major and storage startups what about the stealth(ier) startups without GA storage like Primary Data. There’s probably 2 or 3 non-GA storage startups. And if we assume the same 0.6 vendors will have GA hardware next year that is an additional 1.8 systems. More than likely these will depend on standard servers, so the 0.65 probability of Intel servers probability applies. So it’s likely we will see an additional 1.2 systems here or a total of 9.0 new systems that will adopt 3D XPoint tech in the next 12 months.

So it’s 9 systems out of 23.8 or ~0,38 probable. So my forecast is Yes at 0.38 probable. 

Pricing is a key factor here. I assumed a single price but it’s more likely a range of possibilities and factoring in a pricing range would be more accurate but I don’t know how, yet.

~~~~

I could go on for another 1000 words and still be no closer to an estimate. Somebody please check my math.

Comments?

Photo Credit(s): (iTech Androidi) 3D XPoint – Intel’s new Storage chip is 1000 faster than flash memory

Million year optical disk

Read an article the other day about scientists creating an optical disk that would be readable in a million years or so. The article in Science Mag titled A million – year hard disk was intended to warn people about potential dangers in the way future that were being created today.

A while back I wrote about a 1000 year archive which was predominantly about disappearing formats. At the time, I believed given the growth in data density that information could easily be copied and saved over time but the formats for that data would be long gone by the time someone tried to read it.

The million year optical disk eliminates the format problem by using pixelated images etched on media. Which works just dandy if you happen to have a microscope handy.

Why would you need a million year disk

The problem is how do you warn people in the far future not to mess with radioactive waste deposits buried below. If the waste is radioactive for a million years, you need something around to tell people to keep away from it.

Stone markers last for a few thousand years at best but get overgrown and wear down in time. For instance, my grandmother’s tombstone in Northern Italy has already been worn down so much that it’s almost unreadable. And that’s not even 80 yrs old yet.

But a sapphire hard disk that could easily be read with any serviceable microscope might do the job.

How to create a million year disk

This new disk is similar to the old StorageTek 100K year optical tape. Both would depend on microscopic impressions, something like bits physically marked on media.

For the optical disk the bits are created by etching a sapphire platter with platinum. Apparently the prototype costs €25K but they’re hoping the prices go down with production.

There are actually two 20cm (7.9in) wide disks that are molecularly fused together and each disk can store 40K miniaturized pages that can hold text or images. They are doing accelerated life testing on the sapphire disks by bathing them in acid to insure a 10M year life for the media and message.

Presumably the images are grey tone (or in this case platinum tone). If I assume 100Kbytes per page that’s about 4GB, something around a single layer DVD disk in a much larger form factor.

Why sapphire

It appears that sapphire is available from industrial processes and it seems impervious to wear that harms other material. But that’s what they are trying to prove.

Unclear why the decided to “molecularly” fuse two platters together. It seems to me this could easily be a weak link in the technology over the course of dozen millennia or so. On the other hand, more storage is always a good thing.

~~~~

In the end, creating dangers today that last millions of years requires some serious thought about how to warn future generations.

Image: Clock of the Long Now by Arenamontanus

Top 10 blog posts for 2011

Merry Christmas! Buon Natale! Frohe Weihnachten! by Jakob Montrasio (cc) (from Flickr)
Merry Christmas! Buon Natale! Frohe Weihnachten! by Jakob Montrasio (cc) (from Flickr)

Happy Holidays.

I ranked my blog posts using a ratio of hits to post age and have identified with the top 10 most popular posts for 2011 (so far):

  1. Vsphere 5 storage enhancements – We discuss some of the more interesting storage oriented Vsphere 5 announcements that included a new DAS storage appliance, host based (software) replication service, storage DRS and other capabilities.
  2. Intel’s 320 SSD 8MB problem – We discuss a recent bug (since fixed) which left the Intel 320 SSD drive with only 8MB of storage, we presumed the bug was in the load leveling logic/block mapping logic of the drive controller.
  3. Analog neural simulation or digital neuromorphic computing vs AI – We talk about recent advances to providing both analog (MIT) and digital versions (IBM) of neural computation vs. the more traditional AI approaches to intelligent computing.
  4. Potential data loss using SSD RAID groups – We note the possibility for catastrophic data loss when using equally used SSDs in RAID groups.
  5. How has IBM researched changed – We examine some of the changes at IBM research that have occurred over the past 50 years or so which have led to much more productive research results.
  6. HDS buys BlueArc – We consider the implications of the recent acquisition of BlueArc storage systems by their major OEM partner, Hitachi Data Systems.
  7. OCZ’s latest Z-Drive R4 series PCIe SSD – Not sure why this got so much traffic but its OCZ’s latest PCIe SSD device with 500K IOPS performance.
  8. Will Hybrid drives conquer enterprise storage – We discuss the unlikely possibility that Hybrid drives (NAND/Flash cache and disk drive in the same device) will be used as backend storage for enterprise storage systems.
  9. SNIA CDMI plugfest for cloud storage and cloud data services – We were invited to sit in on a recent SNIA Cloud Data Management Initiative (CDMI) plugfest and talk to some of the participants about where CDMI is heading and what it means for cloud storage and data services.
  10. Is FC dead?! – What with the introduction of 40GbE FCoE just around the corner, 10GbE cards coming down in price and Brocade’s poor YoY quarterly storage revenue results, we discuss the potential implications on FC infrastructure and its future in the data center.

~~~~

I would have to say #3, 5, and 9 were the most fun for me to do. Not sure why, but #10 probably generated the most twitter traffic. Why the others were so popular is hard for me to understand.

Comments?

Graphene Flash Memory

Model of graphene structure by CORE-Materials (cc) (from Flickr)
Model of graphene structure by CORE-Materials (cc) (from Flickr)

I have been thinking about writing a post on “Is Flash Dead?” for a while now.  Well at least since talking with IBM research a couple of weeks ago on their new memory technologies that they have been working on.

But then this new Technology Review article came out  discussing recent research on Graphene Flash Memory.

Problems with NAND Flash

As we have discussed before, NAND flash memory has some serious limitations as it’s shrunk below 11nm or so. For instance, write endurance plummets, memory retention times are reduced and cell-to-cell interactions increase significantly.

These issues are not that much of a problem with today’s flash at 20nm or so. But to continue to follow Moore’s law and drop the price of NAND flash on a $/Gb basis, it will need to shrink below 16nm.  At that point or soon thereafter, current NAND flash technology will no longer be viable.

Other non-NAND based non-volatile memories

That’s why IBM and others are working on different types of non-volatile storage such as PCM (phase change memory), MRAM (magnetic RAM) , FeRAM (Ferroelectric RAM) and others.  All these have the potential to improve general reliability characteristics beyond where NAND Flash is today and where it will be tomorrow as chip geometries shrink even more.

IBM seems to be betting on MRAM or racetrack memory technology because it has near DRAM performance, extremely low power and can store far more data in the same amount of space. It sort of reminds me of delay line memory where bits were stored on a wire line and read out as they passed across a read/write circuit. Only in the case of racetrack memory, the delay line is etched in a silicon circuit indentation with the read/write head implemented at the bottom of the cleft.

Graphene as the solution

Then along comes Graphene based Flash Memory.  Graphene can apparently be used as a substitute for the storage layer in a flash memory cell.  According to the report, the graphene stores data using less power and with better stability over time.  Both crucial problems with NAND flash memory as it’s shrunk below today’s geometries.  The research is being done at UCLA and is supported by Samsung, a significant manufacturer of NAND flash memory today.

Current demonstration chips are much larger than would be useful.  However, given graphene’s material characteristics, the researchers believe there should be no problem scaling it down below where NAND Flash would start exhibiting problems.  The next iteration of research will be to see if their scaling assumptions can hold when device geometry is shrunk.

The other problem is getting graphene, a new material, into current chip production.  Current materials used in chip manufacturing lines are very tightly controlled and  building hybrid graphene devices to the same level of manufacturing tolerances and control will take some effort.

So don’t look for Graphene Flash Memory to show up anytime soon. But given that 16nm chip geometries are only a couple of years out and 11nm, a couple of years beyond that, it wouldn’t surprise me to see Graphene based Flash Memory introduced in about 4 years or so.  Then again, I am no materials expert, so don’t hold me to this timeline.

 

—-

Comments?

IBM’s 120PB storage system

Susitna Glacier, Alaska by NASA Goddard Photo and Video (cc) (from Flickr)
Susitna Glacier, Alaska by NASA Goddard Photo and Video (cc) (from Flickr)

Talk about big data, Technology Review reported this week that IBM is building a 120PB storage system for some unnamed customer.  Details are sketchy and I cannot seem to find any announcement of this on IBM.com.

Hardware

It appears that the system uses 200K disk drives to support the 120PB of storage.  The disk drives are packed in a new wider rack and are water cooled.  According to the news report the new wider drive trays hold more drives than current drive trays available on the market.

For instance, HP has a hot pluggable, 100 SFF (small form factor 2.5″) disk enclosure that sits in 3U of standard rack space.  200K SFF disks would take up about 154 full racks, not counting the interconnect switching that would be required.  Unclear whether water cooling would increase the density much but I suppose a wider tray with special cooling might get you more drives per floor tile.

There was no mention of interconnect, but today’s drives use either SAS or SATA.  SAS interconnects for 200K drives would require many separate SAS busses. With an SAS expander addressing 255 drives or other expanders, one would need at least 4 SAS busses but this would have ~64K drives per bus and would not perform well.  Something more like 64-128 drives per bus would have much better performer and each drive would need dual pathing, and if we use 100 drives per SAS string, that’s 2000 SAS drive strings or at least 4000 SAS busses (dual port access to the drives).

The report mentioned GPFS as the underlying software which supports three cluster types today:

  • Shared storage cluster – where GPFS front end nodes access shared storage across the backend. This is generally SAN storage system(s).  But the requirements for high density, it doesn’t seem likely that the 120PB storage system uses SAN storage in the backend.
  • Networked based cluster – here the GPFS front end nodes talk over a LAN to a cluster of NSD (network storage director?) servers which can have access to all or some of the storage. My guess is this is what will be used in the 120PB storage system
  • Shared Network based clusters – this looks just like a bunch of NSD servers but provides access across multiple NSD clusters.

Given the above, with ~100 drives per NSD server means another 1U extra per 100 drives or (given HP drive density) 4U per 100 drives for 1000 drives and 10 IO servers per 40U rack, (not counting switching).  At this density it takes ~200 racks for 120PB of raw storage and NSD nodes or 2000 NSD nodes.

Unclear how many GPFS front end nodes would be needed on top of this but even if it were 1 GPFS frontend node for every 5 NSD nodes, we are talking another 400 GPFS frontend nodes and at 1U per server, another 10 racks or so (not counting switching).

If my calculations are correct we are talking over 210 racks with switching thrown in to support the storage.  According to IBM’s discussion on the Storage challenges for petascale systems, it probably provides ~6TB/sec of data transfer which should be easy with 200K disks but may require even more SAS busses (maybe ~10K vs. the 2K discussed above).

Software

IBM GPFS is used behind the scenes in IBM’s commercial SONAS storage system but has been around as a cluster file system designed for HPC environments for over 15 years or more now.

Given this many disk drives something needs to be done about protecting against drive failure.  IBM has been talking about declustered RAID algorithms for their next generation HPC storage system which spreads the parity across more disks and as such, speeds up rebuild time at the cost of reducing effective capacity. There was no mention of effective capacity in the report but this would be a reasonable tradeoff.  A 200K drive storage system should have a drive failure every 10 hours, on average (assuming a 2 million hour MTBF).  Let’s hope they get drive rebuild time down much below that.

The system is expected to hold around a trillion files.  Not sure but even at 1024 bytes of metadata per file, this number of files would chew up ~1PB of metadata storage space.

GPFS provides ILM (information life cycle management, or data placement based on information attributes) using automated policies and supports external storage pools outside the GPFS cluster storage.  ILM within the GPFS cluster supports file placement across different tiers of storage.

All the discussion up to now revolved around homogeneous backend storage but it’s quite possible that multiple storage tiers could also be used.  For example, a high density but slower storage tier could be combined with a low density but faster storage tier to provide a more cost effective storage system.  Although, it’s unclear whether the application (real world modeling) could readily utilize this sort of storage architecture nor whether they would care about system cost.

Nonetheless, presumably an external storage pool would be a useful adjunct to any 120PB storage system for HPC applications.

Can it be done?

Let’s see, 400 GPFS nodes, 2000 NSD nodes, and 200K drives. Seems like the hardware would be readily doable (not sure why they needed watercooling but hopefully they obtained better drive density that way).

Luckily GPFS supports Infiniband which can support 10,000 nodes within a single subnet.  Thus an Infiniband interconnect between the GPFS and NSD nodes could easily support a 2400 node cluster.

The only real question is can a GPFS software system handle 2000 NSD nodes and 400 GPFS nodes with trillions of files over 120PB of raw storage.

As a comparison here are some recent examples of scale out NAS systems:

It would seem that a 20X multiplier times a current Isilon cluster or even a 10X multiple of a currently supported SONAS system would take some software effort to work together, but seems entirely within reason.

On the other hand, Yahoo supports a 4000-node Hadoop cluster and seems to work just fine.  So from a feasability perspective, a 2500 node GPFS-NSD node system seems just a walk in the park for Hadoop.

Of course, IBM Almaden is working on project to support Hadoop over GPFS which might not be optimum for real world modeling but would nonetheless support the node count being talked about here.

——

I wish there was some real technical information on the project out on the web but I could not find any. Much of this is informed conjecture based on current GPFS system and storage hardware capabilities. But hopefully, I haven’t traveled to far astray.

Comments?