Two paths to better software

Read an article last week in the Atlantic, The coming software apocalypse, about some of the problems in how we develop software today.

Most software development today is editing text files. Some of these text files have 1,000s of lines and are connected to other text files with 1,000s of more lines which are connected to other text files with 1,000s of lines, etc. Pretty soon you have millions of lines of code all interacting with one another.

The problem

Been there done that and it’s not pretty. We even spent some time trying to reduce the code bloat by macro-izing some of it, and that just made it harder to understand, but reduced the lines of code.

The problem is much worse now where . we have software everywhere you look, from the escalator-elevator you take up and down between floors, to the cars you drive around town, to the trains and airplanes you travel between cities.

All of these literally have millions of lines of code controlling them and are many more each year. How can they all possibly be correct.

Well you can test the s&*t out of them. But you can’t cover every path in a lifetime or ten of testing a million line program. And even if you could, changing a single line would generate another 100K or more paths to test. So testing was never a true answer.

Two solutions

The article talks about two approaches that have some merit to solve the real problem.

  • Model based development, a new development and coding environment. In this approach your not so much coding as playing with a model of the behavior your looking for. Say you were coding robot control logic, rather than editing 1000s of lines of Java text, you work with a model of your robot and its environment on 1/2 a screen and on the other half, model parameters (dials, sliders, arrow keys, etc) and logic (sequences) that you  manipulate to do what the robot needs to do. Sort of like Scratch on steroids (see my post on 10 years of Scratch) with the sprite being whatever you need to code for be it a jet engine, automobile, elevator, whatever. The playground would be a realtime/real life simulation of the entity under control of the code and you would code by setting parameters  and defining sequences. But the feedback would be immediate!
  • TLA+ a formal design verification approach. Formal methods have been around since the early 70s. They are used to rigorously specify a design of  some code or a whole system. The idea is that if you can specify a  provably correct design, then the code (derived from that design) has the potential to be more correct. Yes there’s still the translation from code to design that’s error prone but the likelihood is that these errors will be smaller in scope than having a design that wrong.

Model based  development

One can find model based development already in the Apple new application development language, Swift, ANSYS SCADE suite based on Esterel Technologies, and Light Table software development environment.

I have never used any of them but they all look interesting. Esterel was developed for safety critical, real-time aerospace applications. Light Table was a kickstarter project started by a leading engineer of Microsoft’s Visual Studio, the leading IDE. Apple Swift was developed to make it much easier to develop IOS apps.

TLA+

TLA+ takes a bit getting used to. All formal methods depend on advanced mathematics and sophisticated logic and requires an adequate understanding of these in order to use properly. TLA+ was developed by Leslie Lamport and stands for temporal logic of actions.

TLA+ specifications identify the set of all correct system actions. I would call it a formal pseudo code.

There’s apparently a video course , a hyperbook and a book on the language It’s being used in AWS and Microsoft XBOX and Azure. (See the wikipedia TLA+ article for more information).

There’s PlusCal algorithm (specification) language which is translated into a TLA+ specification which can then be checked by the automated TLC model checker.  There’s also an automated TLAPS, a TLA+ proof system although it doesn’t support all of the TLA+ primitives.  There’s a whole TLA+ toolbox that has these and other tools that can make TLA+ easier to use.

~~~~

We dabbled in formal specifications methods for on our million+ line storage system at a former employer. It worked well and cleaned up a integrity critical area of the product. Alas, we didn’t expand it’s use to other areas of the product and it sort of fell out of favor. But it worked when and where we applied it.

Of course this was before automated formal methods of today, but even manual methods of specification precision can be helpful to think out what a design has to do to be correct.

I have no doubt that both TLA+ formal methods and model based development approaches and more are required to truly vanquish the coming software apocalypse.

At least until artificial intelligence starts developing all our code for us.

Comments?

Photo Credits: Six easy pieces of quantitatively analyzing open source, SAP Research;

Spaghetti code still existed, Toolbox.com;

How to write apps with Swift, MacWorld;

Modeling the dining philosophers problem in TLA+, Metadata blog

 

A college course on identifying BS

Read an article the other day from Recode (These University of Washington professors teaching a course on Calling BS) that seems very timely. The syllabus is online (Calling Bullshit — Syllabus) and it looks like a great start on identifying falsehood wherever it can be found.

In the beginning, what’s BS?

The course syllabus starts out referencing Brandolini’s Bullshit Asymmetry Principal (Law): the amount of energy needed to refute BS is an order of magnitude bigger than to produce it.

Then it goes into a rather lengthy definition of BS from Harry Frankfort’s 1986 On Bullshit article. In sum, it starts out reviewing a previous author’s discussions on Humbug and ends up at the OED. Suffice it to say Frankfurt’s description of BS runs the gamut from: Deceptive misrepresentation to short of lying.

They course syllabus goes on to reference two lengthy discussions/comments on Frankfurt’s seminal On Bullshit article, but both Cohen’s response, Deeper into BS and Eubank & Schaeffer’s A kind word for BS: …  are focused more on academic research rather than everyday life and news.

How to mathematically test for BS

The course then goes into mathematical tests for BS that range from Fermi’s questions, the Grim Test and Benford’s 1936 Law of Anomalous Numbers. These tests are all ways of looking at data and numbers and estimating whether they are bogus or not. Benford’s paper/book talks about how the first page of logarithms is always more used than others because numbers that start with 1 are more frequent than any other number.

How rumors propagate

The next section of the course (week 4) talks about the natural ecology of BS.

Here there’s reference to an article by Friggeri, et al, on Rumor Cascades, which discusses the frequency with which patently both true, false and partially true/partially false rumors are “shared” on social media (Facebook).

The professors look at a website called Snopes.com which evaluates the veracity of publishes rumors uses this to classify the veracity of rumors. Next they examine how these rumors are shared over time on Facebook.

Summarizing their research, both false and true rumors propagate sporadically on Facebook. But even verified false or mixed true/mixed false rumors (identified by Snopes.com) continue to propagate on Facebook. This seems to indicate that rumor sharers are ignoring the rumor’s truthfulness or are just unaware of the Snopes.com assessment of the rumor.

Other topics on calling BS

The course syllabus goes on to causality (correlation is not causation, a common misconception used in BS), statistical traps and trickery (used to create BS), data visualization (which can be used to hide BS), big data (GiGo leads to BS), publication bias (e.g., most published research presents positive results, where’s all the negative results research…), predatory publishing and scientific misconduct (organizations that work to create BS for others), the ethics of calling BS (the line between criticism and harassment), fake news and refuting BS.

Fake news

The section on Fake News is very interesting. They reference an article in the NYT, The Agency about how a group in Russia have been reaping havoc across the internet with fake news and bogus news sites.

But there’s more another article on NYT website, Inside a fake news sausage factory, details how multiple websites started publishing bogus news and then used advertisement revenue to tell them which bogus news generated more ad revenue – apparently there’s money to be made in advertising fake news. (Sigh, probably explains why I can’t seem to get any sponsors for my websites…).

Improving the course

How to improve their course? I’d certainly take a look at what Facebook and others are doing to identify BS/fake news and see if these are working effectively.

Another area to add might be a historical review of fake rumors, news or information. This is not a new phenomenon. It’s been going on since time began.

In addition, there’s little discussion of the consequences of BS on life, politics, war, etc. The world has been irrevocably changed in the past  on account of false information. Knowing how bad this has been this might lend some urgency to studying how to better identify BS.

There’s a lot of focus on Academia in the course and although this is no doubt needed, most people need to understand whether the news they see every day is fake or not. Focusing more on this would be worthwhile.

~~~~

I admire the University of Washington professors putting this course together. It’s really something that everyone needs to understand  nowadays.

They say the lectures will be recorded and published online – good for them. Also, the current course syllabus is for a one credit hour course but they would like to expand it to a three to four credit hour course – another great idea

Comments?

Photo credit(s): The Donation of ConstantineNew York World – Remember the Maine, Public Domain; Benjamin Franklin’s Bag of Scalps letter;  fake-news-rides-sociales by Portal GDA

Insecure SHA-1 imperils Internet security, PKI, and most password systems

safe 'n green by Robert S. Donovan (cc) (from flickr)
safe ‘n green by Robert S. Donovan (cc) (from flickr)

I suppose it’s inevitable but surprising nonetheless.  A recent article Faster computation will damage the Internet’s integrity in MIT Technology Review indicates that by 2018, SHA-1 will be crackable by any determined large  organization. Similarly, just a few years later,  perhaps by 2021 a much smaller organization will have the computational power to crack SHA-1 hash codes.

What’s a hash?

Cryptographic hash functions like SHA-1 are designed such that, when a string of characters is “hash”ed they generate a binary value which has a couple of great properties:

  • Irreversibility – given a text string and a “hash_value” generated by hashing “text_string”, there is no way to determine what the “text_string” was from its hash_value.
  • Uniqueness – given two or more text strings, “text_string1” and “text_string2” they should generate two unique hash values, “hash_value1” and “hash_value2”.

Although hash functions are designed to be irreversible that doesn’t mean that they couldn’t be broken via a brute force attack. For example, if one were to try every known text string, sooner or later one would come up with a “text_string1” that hashes to “hash_value1”.

But perhaps even more serious, the SHA-1 algorithm is prone to hash collisions  which makes fails the uniqueness property above.  That is, there are a few “text_string1″s that hash to the same “hash_value1”.

All this wouldn’t be much of a problem except that with Moore’s law in force and continuing for the next 6 years or so we will have processing power in chips capable of doing a brute force attack against SHA-1 to find text_strings that match any specific hash value.

So what’s the big deal?

Well it turns out that SHA-1 algorithms underpin almost all secure data transmissions today. That is, most Public-key infrastructure (PKI) depend on SHA-1 to sign digital certificates.  And although that’s pretty bad, what’s even worse is that Secure Socket Layer/Transport Layer Security (SSL/TLS) used by “https://” websites the world over also depend on SHA-1 to send key information used to encrypt/decrypt secure Internet transactions.

On top of all that, many of today’s secure systems with passwords, use SHA-1 to hash passwords and instead of storing actual passwords in plain-text on their password files, they only store the SHA-1 hash of the passwords.  As such, by 2021, anyone that can read the hashed password file can retrieve any password in plain text.

What all this means is that by 2018 for some and 2021 or thereabouts for just about anybody else, todays secure internet traffic, PKI and most system passwords will no longer be secure.

What needs to be done

It turns out that NSA knew about the failings of SHA-1 quite awhile ago and as such, NIST released SHA-2 as a new hash algorithm and its functional replacement.  Probably just in time, this month, NIST announced a winner for a new SHA-3 algorithm as a functional replacement for SHA-2.

This may take awhile, what needs to be done is to have all digital certificates that use SHA-1, be invalidated with new ones generated using SHA-2 or SHA-3.  And of course, TLS and SSL Internet functionality all have to be re-coded to recognize and use SHA-2 or SHA-3, instead of SHA-1.

Finally, for most of those password systems, users will need to re-login and have their password hashes changed over from SHA-1 to SHA-2 or SHA-3.

Naturally, in order to use SHA-2 or SHA-3 many systems may need to be upgraded to later levels of code.  Seems like Y2K all over again, only this time it’s security that’s going to crash.  It’s good to be in the consulting business, again.

~~~~

But the real problem IMHO, is Moore’s law.  If it continues to double processing power/transistor density every two years or so, how long before SHA-2 or SHA-3 succumb to same sorts of brute force attacks?  Given that, we appear destined to change hashing, encryption and other security algorithms every decade or so until Moore’s law slows down or god forbid, stops altogether.

Comments?

 

Shingled magnetic recording disks

A couple of weeks ago I attended a day of the SNIA Storage Developers Conference (SDC) where Garth Gibson of Carnegie Mellon University Parallel Data Lab (CMU PDL) and Panasas was giving a talk of what they are up to at CMU’s storage lab.  His talk at the conference was on shingled magnetic recording (SMR) disks. We have discussed this topic before in our posts on Sequential only disks?!  and in Disk trends revisited.  SMR may require a re-thinking of how we currently access disk storage.

Recall that shingled magnetic recording uses a write head that overwrites multiple tracks at a time (see graphic above), with one track being properly written and the adjacent (inward) tracks being overwritten. As the head moves to the next track, that track can be properly written but more adjacent (inward) tracks are overwritten, etc. In this fashion data can be written sequentially, on overlapping write passes.  In contrast, read heads can be much narrower and are able to read a single track.

In my post, I assumed that this would mean that the new shingled magnetic recording disks would need to be accessed sequentially not unlike tape. Such a change would need a massive rewrite to only write data sequentially.  I had suggested this could potentially work if one were to add some SSD or other NVRAM to the device to help manage the mapping of the data to the disk.  Possibly that plus a very sophisticated drive controller, not unlike SSD wear leveling today, could handle mapping a physically sequentially accessed disk to a virtually randomly accessed storage protocol.

Garth’s approach to the SMR dilemma

Garth and his team of researchers are taking another tack at the problem. In his view there are multiple groups of tracks on an SMR disk (zones or bands).  Each band can be either written sequentially or randomly but all bands can be read randomly.  One can break up the disk to include sections of multiple shingled bands, that are sequentially written and less, non-shingled bands that can be randomly written. Of course there would be a gap between the shingled bands in order not to overwrite adjacent bands. And there would also be gaps between the randomly written tracks in a non-shingled partition to allow for the wider track writing that occurs with the SMR write head.

His pitch at the conference dealt with some characteristics of such a multi-band disk device.  Such as

  • How to determine the density for a device that has multiple bands of both shingled write data and randomly written data.
  • How big or small a shingled band should be in order to support “normal” small block and randomly accessed file IO.
  • How many randomly written tracks or what the capacity of the non-shingled bands would need to be to support “normal” file IO activity.

For maximum areal density one would want large shingled bands.  There are other interesting considerations that were not as obvious but I won’t go into here.

SCSI protocol changes for SMR disks

The other, more interesting section of Garth’s talk was on recent proposed T10 and T13 changes to support SMR disks that supported shingled and non-shingled partitions and what needed to be done to support SMR devices.

The SCSI protocol changes being considered to support SMR devices include:

  • A new write cursor for shingled write bands that indicates the next LBA to be written.  The write cursor starts out at a relative band address of 0 and as each LBA is written consecutively in the band it’s incremented by one.
  • A write cursor can be reset (to zero) indicating that the band has been erased
  • Each drive maintains the band map and current cursor position within each band and this can be requested by SCSI drivers to understand the configuration of the drive.

Probably other changes are required as well but these seem sufficient to flesh out the problem.

SMR device software support

Garth and his team implemented an SMR device, emulated in software using real random accessed devices.  They then implemented an SMR device driver that used the proposed standards changes and finally, implemented a ShingledFS file system to use this emulated SMR disk to see how it would work.  (See their report on Shingled Magnetic Recording for Big Data Applications for more information.)

The CMU team implemented a log structured file system for the ShingledFS that only wrote data to the emulated SMR disk shingled partition sequentially, except for mapping and meta-data information which was written and updated randomly in a non-shingled partition.

You may recall that a log structured file system is essentially written as a sequential stream of data (not unlike a log).  But there is additional mapping required that indicates where file data is located in the log which allows for randomly accessing the file data.

In their report and at the conference, Garth presented some benchmark results for a big data application called Terasort (essentially Teragen, Terasort and Teravalidate) which seems to use Hadoop to sort a large body of data.   Not sure I can replicate this information here but suffice it to say at the moment the emulated SMR device with ShingledFS did not beat a base EXT3 or FUSE using the same hardware for these applications.

Now the CMU project wAs done by a bunch of smart researchers but it’s still relatively new and not necessarily that optimized.  Thus, there’s probably some room for improvement in the ShingledFS and maybe even the emulated SMR device and/or the SMR device driver.

At the moment Garth and his team seem to believe that SMR devices are certainly feasible and would take only modest changes to the SCSI protocols to support such devices.  As for file system support there is plenty of history surrounding log structured file systems so these are certainly doable but would require probably extensive development to implemented in various OS to support an SMR device.  The device driver changes don’t seem to be as significant.

~~~~

It certainly looks like there’s going to be SMR devices in our future.  It’s just a question whether they will be ever as widely supported as the randomly accessed disk device we know and love today.  Possibly, this could all be behind a storage subsystem that makes the technology available as networked storage capacity and over time maybe SMR devices could be implemented in more standard OS device drivers and file systems.  Nevertheless, to keep capacity and areal density on their current growth trajectory, SMR disks are coming, it’s just a matter of time.

Comments?

Image: (c) 2012 Hitachi Global Storage Technologies, from IEEE SCV Magnetics Society presentation by Roger Wood

 

Robots on the road

Just heard that California is about to start working on formal regulations for robot cars to travel their roads which is the second state to regulate these autonomous machines, the first was Nevada.  At the moment the legislation signed into law requires CA to draft regulations for these vehicles by January 1, 2015.

I suppose being in the IT industry this shouldn’t be a surprise to me or anyone else. Google has been running autonomously driven vehicles for over 300K miles now.

But it always seems a bit jarring when something like this goes from testing  to production, seems almost Jetson like.  I remember seeing a video of something like this from Bell Labs/GM Labs or somebody like that when they were talking about the future way back in the 60s of last century.  Gosh only 50 years later and its almost here.

DARPA Grand Challenges spurred it on

Of course it all started probably in the late 70s when AI was just firing up.  But robot cars seemed to really take off when DARPA, back in 2004 wanted to push the technology to develop a autonomous vehicle for the DOD. They funded a and created the DARPA Grand Challenge.

In 2004 the requirements were to drive over 150 miles (240 km) in and around the Mojave desert in southwestern USA. In that first year, none of the vehicles managed to finish the distance.  Over the next few years, the course got more difficult, the prize money increased, and the vehicles got a lot smarter.

In 2005 DARPA grand challenge once again a rural setting, 5 vehicles finished the course 1 from Stanford, 2 from Carnegie Mellon (CMU), 1 from Oshkosh Trucking, and the other 1 from Gray’s Insurance Company.  At first I thought an insurance company, then it hit me maybe there’s a connection to auto insurance.

DARPA’s next challenge for 2007 was for an urban driving environment but this time  DARPA providing research funding to a select group as well as larger prize to any winners.  Six teams were able to finish the urban challenge, 1 each from CMU, Stanford, Virginia Tech, MIT, University of Pennsylvania & Lehigh University and Cornell University.  That was the last DARPA challenge for autonomous vehicles, seems they had what they wanted.

Google’s streetview helped

Sometime around 2010, Google started working withg self-driving cars to provide some of the streetview shots they needed.  Shortly thereafter they had  logged ~140K miles with them.   Fast forward a couple of years and Google’s Sergey Brin was claiming that people will be driving in robotic cars in 5 years. To get their self-driving cars up and running they hired the leaders of both the CMU and Stanford teams as well as somebody who worked on the first autonomous motorcycle which ran in the Urban Challenge.

For all of the 300K miles they currently have logged, the cars were manned by a safety driver and a software engineer in the car, just for safety reasons.  Also, local police were notified that the car would be in their area.  Before the autonomous car took off another car, this one driven by a human, was sent out to map out the route in detail including all traffic signs, signals, lane markers, etc.  This was then up(?) loaded to the self-driving car which followed the same exact route.

I couldn’t find and detailed hardware list but Google’s blog post on the start of the project indicated computers (maybe 2 for HA), multiple cameras, infrared sensors, laser rangefinders, radar, and probably multiple servos (gear shift, steering, accelerator and brake pedals), all fitted to Toyota Prius cars.  Although the servos may no longer be as necessary as many new cars, use drive by wire for some of these function.

Monetization?

I could imagine quite a few ways to monetize self-driving, robotic cars:

  • License the service to the major auto and truck manufacturers around the world, with the additional hardware either supplied as a car/truck option (probably at first) or provided on all cars/trucks (probably a ways down the line).
  • Cars/trucks would need computer screens for the driving console as well as probably for entertainment.  Possibly advertisements on these screens could be used to offset some of the licensing/hardware costs.
  • Insurance companies may wish to subsidize the cost of the system.  Especially, if the cars could reduce accidents, it would then have a positive ROI, just for accident reduction alone, let alone saving lives.
  • In the car internet would need to be more available (see below). This would no doubt be based on 4G or whatever the next cellular technology comes along. Maybe the mobile phone companies would want to help subsidize this service, like they do for phones, if you had to sign a contract for a couple of years. I am thinking the detailed maps required for self-driving might require a more bandwidth than Google Maps does today, which could help chew up those bandwidth limits.
  • With all these sensors, it’s quite possible that self-driving cars, when being driven by humans, could be used to map new routes.  If you elected to provide these sorts of services then maybe one could also get something of a kickback.

I assume the robotic cars need Internet access but nothing I read says for sure. Maybe they could get by without Internet access if they just used manual driving mode for those sections of travel which lacked Internet  Perhaps, the cars could download the route before it went into self-driving mode and that way if you kept to the plan you would be ok.

Other uses of robotic cars

Of course with all these Internet enabled cars, tollways and city centers could readily establish new congestion based pricing.  Police could potentially override a car and cause it to pull over, automatically without the driver being able to stop it.  Traffic data would be much more available, more detailed, and more real time than it is already.  All these additional services could help to offset the cost of the HW and licensing of the self-driving service.

The original reason for the DARPA grand challenge was to provide a way to get troops and/or equipment from one place to another without soldiers having to drive the whole way there.  Today, this is still a dream but if self-driving cars become a reality in 5 years or so, I would think the DOD could have something deployed before then.

~~~~

If the self-driving car maps require more detailed information than today’s GPS maps, there’s probably a storage angle here both in car and at some centralized data center(s) located around a country.  If the cars could be also used to map new routes,  perhaps even a skosh more storage would be required in car.

Just imagine driving cross country and being able to sleep most of the way, all by yourself with your self-driving car.  Now if they could only make a port-a-potty that would fit inside a sedan I would be all set to go…, literally 🙂

Comments?

Image: Google streetview self-driving car by DoNotLick

 

Big science/big data ENCODE project decodes “Junk DNA”

Project ENCODE (ENCyclopedia of DNA Elements) results were recently announced. The ENCODE project was done by a consortium of over 400 researchers from 32 institutions and has deciphered the functionality of so called Junk DNA in the human genome. They have determined that junk DNA is actually used to regulate gene expression.  Or junk DNA is really on-off switches for protein encoding DNA.  ENCODE project results were published by Nature,  Scientific American, New York Times and others.

The paper in Nature ENCODE Explained is probably the best introduction to  the project. But probably the best resource on the project computational aspects comes from these papers at Nature, The making of ENCODE lessons for BIG data projects by Ewan Birney and ENCODE: the human encyclopedia by Brendan Maher.

I have been following the Bioinformatics/DNA scene for some time now. (Please see Genome Informatics …, DITS, Codons, & Chromozones …, DNA Computing …, DNA Computing … – part 2).  But this is perhaps the first time it has all come together to explain the architecture of DNA and potentially how it all works together to define a human.

Project ENCODE results

It seems like there were at least four major results from the project.

  • Junk DNA is actually programming for protein production in a cell.  Scientists previously estimated that <3% of human DNA’s over 3 billion base pairs encode for proteins.  Recent ENCODE results seem to indicate that at least 9% of this human DNA and potentially, as much as 50% provide regulation for when to use those protein encoding DNA.
  • Regulation DNA undergoes a lot of evolutionary drift. That is it seems to be heavily modified across species. For instance, protein encoding genes seem to be fairly static and differ very little between species. On the the other hand, regulating DNA varies widely between these very same species.  One downside to all this evolutionary variation is that regulatory DNA also seems to be the location for many inherited diseases.
  • Project ENCODE has further narrowed the “Known Unknowns” of human DNA.  For instance, about 80% of human DNA is transcribed by RNA. Which means on top of the <3% protein encoding DNA and ~9-50% regulation DNA already identified, there is another 68 to 27% of DNA that do something important to help cells transform DNA into life giving proteins. What that residual DNA does is TBD and is subject for the next phase of the ENCODE project (see below).
  • There are cell specific regulation DNA.  That is there are regulation DNA that are specifically activated if it’s bone cell, skin cell, liver cell, etc.  Such cell specific regulatory DNA helps to generate the cells necessary to create each of our organs and regulate their functions.  I suppose this was a foregone conclusion but it’s proven now

There are promoter regulatory DNA which are located ahead and in close proximity to the proteins that are being encoded and enhancer/inhibitor regulatory DNA which are located a far DNA distance away from the proteins they regulate.

I believe it seems that we are seeing two different evolutionary time frames being represented in the promoter vs. enhancer/inhibitor regulatory DNA.  Whereas promoter DNA seem closely associated with protein encoding DNA, the enhancer DNA seems more like patches or hacks that fixed problems in the original promoter-protein encoding DNA sequences, sort of like patch Tuesday DNA that fixes problems with the original regulation activity.

While I am excited about Project ENCODE results. I find the big science/big data aspects somewhat more interesting.

Genome Big Science/Big Data at work

Some stats from the ENCODE Project:

  • Almost 1650 experiments on around 180 cell types were conducted to generate data for the ENCODE project.   All told almost 12,000 files were analyzed from these experiments.
  • 15TB of data were used in the project
  • ENCODE project internal Wiki had 18.5K page edits and almost 250K page views.

With this much work going on around the world, data quality control was a necessary, ongoing consideration.   It took about half way into the project before they figured out  how to define and assess data quality from experiments.   What emerged from this was a set of published data standards (see data quality UCSC website) used to determine if experimental data were to be accepted or rejected as input to the project.  In the end the retrospectively applied the data quality standards to the earlier experiments and had to jettison some that were scientifically important but exhibited low data quality.

There was a separation between the data generation team (experimenters) and the data analysis team.  The data quality guidelines represented a key criteria that governed these two team interactions.

Apparently the real analysis began when they started layering the base level experiments on top of one another.  This layering activity led to researchers further identifying the interactions and associations between regulatory DNA and protein encoding DNA.

All the data from the ENCODE project has been released and are available to anyone interested. They also have provided a search and browser capability for the data. All this can be found on the top UCSC website. Further, from this same site one can download the software tools used to analyze, browse and search the data if necessary.

This multi-year project had an interesting management team that created a “spine of leadership”.  This team consisted of a few leading scientists and a few full time scientifically aware project officers that held the project together, pushed it along and over time delivered the results.

There were also a set of elaborate rules that were crafted so that all the institutions, researchers and management could interact without friction.  This included rules guiding data quality (discussed above), codes of conduct, data release process, etc.

What no Hadoop?

What I didn’t find was any details on the backend server, network or storage used by the project or the generic data analysis tools.  I suspect Hadoop, MapReduce, HBase, etc. were somehow involved but could find no reference to this.

I expected with the different experiments and wide variety of data fusion going on that there would be some MapReduce scripting that would transcribe the data so it could be further analyzed by other project tools.  Alas, I didn’t find any information about these tools in the 30+ research papers that were published in the last week or so.

It looks like the genomic analysis tools used in the ENCODE project are all open source. They useh the OpenHelix project deliverables.  But even a search of the project didn’t reveal any hadoop references.

~~~~

The ENCODE pilot project (2003-2007) cost ~$53M, the full ENCODE project’s recent results cost somewhere around $130M and they are now looking to the next stage of the ENCODE project estimated to cost ~$123M.  Of course there are 1000s of more human cell types that need to be examined and ~30% more DNA that needs to be figured out. But this all seems relatively straight forward now that the ENCODE project has laid out an architectural framework for human DNA.

Anyone out there that knows more about the data processing/data analytics side of the ENCODE project please drop me a line.  I would love to hear more about it or you can always comment here.

Comments?

Image: From Project Encode, Credits: Darryl Leja (NHGRI), Ian Dunham (EBI)

The end of NAND is near, maybe…

In honor of today’s Flash Summit conference, I give my semi-annual amateur view of competing NAND technologies.

I was talking with a major storage vendor today and they said they were sampling sub-20nm NAND chips with P/E cycles of 300 with a data retention period under a week at room temperatures. With those specifications these chips almost can’t get out of the factory with any life left in them.

On the other hand the only sub-20nm (19nm) NAND information I could find online were inside the new Toshiba THNSNF SSDs with toggle MLC NAND that guaranteed data retention of 3 months at 40°C.   I could not find any published P/E cycle specifications for the NAND in their drive but presumably this is at most equivalent to their prior generation 24 nm NAND or at worse somewhere below that generations P/E cycles. (Of course, I couldn’t find P/E cycle specifications for that drive either but similar technology in other drives seems to offer native 3000 P/E cycles.)

Intel-Micron, SanDisk and others have all recently announced 20nm MLC NAND chips with a P/E cycles around 3K to 5K.

Nevertheless, as NAND chips go beyond their rated P/E cycle quantities, NAND bit errors increase. With a more powerful ECC algorithm in SSDs and NAND controllers, one can still correct the data coming off the NAND chips.  However at some point beyond 24 bit ECC this probably becomes unsustainable. (See interesting post by NexGen on ECC capabilities as NAND die size shrinks).

Not sure how to bridge the gap between 3-5K P/E cycles and the 300 P/E cycles being seen by storage vendors above but this may be a function of prototype vs. production technology and possibly it had other characteristics they were interested in.

But given the declining endurance of NAND below 20nm, some industry players are investigating other solid state storage technologies to replace NAND, e.g.,  MRAM, FeRAM, PCM and ReRAM all of which are current contenders, at least from a research perspective.

MRAM is currently available in small capacities from Everspin and elsewhere but hasn’t really come up with similar densities on the order of today’s NAND technologies.

ReRAM is starting to emerge in low power applications as a substitute for SRAM/DRAM, but it’s still early yet.

I haven’t heard much about FeRAM other than last year researchers at Purdue having invented a new non-destructive read FeRAM they call FeTRAM.   Standard FeRAMs are already in commercial use, albeit in limited applications from Ramtron and others but density is still a hurdle and write performance is a problem.

Recently the PCM approach has heated up as PCM technology is now commercially available being released by Micro.  Yes the technology has a long way to go to catch up with NAND densities (available at 45nm technology) but it’s yet another start down a technology pathway to build volume and research ways to reduce cost, increase density and generally improve the technology.  In the mean time I hear it’s an order of magnitude faster than NAND.

Racetrack memory, a form of MRAM using wires to store multiple bits, isn’t standing still either.  Last December, IBM announced they have demonstrated  Racetrack memory chips in their labs.  With this milestone IBM has shown how a complete Racetrack memory chip could be fabricated on a CMOS technology lines.

However, in the same press release from IBM on recent research results, they announced a new technique to construct CMOS compatible graphene devices on a chip.  As we have previously reported, another approach to replacing standard NAND technology  uses graphene transistors to replace the storage layer of NAND flash.  Graphene NAND holds the promise of increasing density with much better endurance, retention and reliability than today’s NAND.

So as of today, NAND is still the king of solid state storage technologies but there are a number of princelings and other emerging pretenders, all vying for its throne of tomorrow.

Comments?

Image: 20 nanometer NAND Flash chip by IntelFreePress

New wireless technology augmenting data center cabling

1906 Patent for Wireless Telegraphy by Wesley Fryer (cc) (from Flickr)
1906 Patent for Wireless Telegraphy by Wesley Fryer (cc) (from Flickr)

I read a report today in Technology Review about how Bouncing data would speed up data centers, which talked about using wireless technology and special ceiling tiles to create dedicated data links between servers.  The wireless signal was in the 60Ghz range and would yield something on the order of couple of Gb per second.

The cable mess

Wireless could solve a problem evident to anyone that has looked under data center floor tiles today – cabling.  Underneath our data centers today there is a spaghetti-like labyrinth of cables connecting servers to switches to storage and back again.  The amount of cables underneath some data centers is so deep and impenetrable that some shops don’t even try to extract old cables when replacing equipment just leaving them in place and layering on new ones as the need arises.

Bouncing data around a data center

The nice thing about the new wireless technology is that you can easily set up a link between two servers (or servers and switches) by just properly positioning antenna and ceiling tiles, without needing any cables.  However, in order to increase bandwidth and reduce interference the signal has to be narrowly focused which makes the technology point-to-point, requiring line of sight between the end points.   But with signal bouncing ceiling tiles, a “line-of-sight” pathway could readily be created around the data center.

This could easily be accomplished by different shaped ceiling tiles such as pyramids, flat panels, or other geometric configurations that would guide the radio signal to the correct transceiver.

I see it all now, the data center of the future would have its ceiling studded with geometrical figures protruding below the tiles, providing wave guides for wireless data paths, routing the signals around obstacles to its final destination.

Probably other questions remain.

  • It appears the technology can only support 4 channels per stream.  Which means it might not scale up to much beyond current speeds.
  • Electromagnetic radiation is something most IT equipment tries to eliminate rather than transmit.  Having something generate and receive radio waves in a data center may require different equipment regulations and having those types of signals bouncing around a data center may make proper shielding more of a concern..
  • Signaling interference is a real problem which might make routing these signals even more of a problem than routing cables.  Which is why I believe they need  some sort of multi-directional wireless switching equipment might help.

In the report, there wasn’t any discussion as to the energy costs of the wireless technology and that may be another issue to consider. However, any reduction in cabling can only help IT labor costs which are a major factor in today’s data center economics.

~~~~

It’s just in investigation stages now but Intel, IBM and others are certainly thinking about how wireless technology could help the data centers of tomorrow reduce costs, clutter and cables.

All this gives a whole new meaning to top of rack switching.

Comments?