And of course the buildings going up at Ground Zero are all smart buildings as well, containing sensors embedded in the structure, the infrastructure, and anywhere else that matters.
But what does this mean in terms of data
Data requirements will explode as the smart home and other sensor clouds build out. For example, even if a smart thermostat only issues a message every 15 minutes and the message is only 256 bytes, the data from the 130 million households in the US alone would be an additional ~3.2TB/day. And that’s just one sensor per household.
If you add the smart power meter, lawn sensor, intrusion/fire/chemical sensor, and god forbid, the refrigerator and freezer product sensors to the mix that’s another another 16TB/day of incoming data.
And that’s just assuming a 256 byte payload per sensor every 15 minutes. The intrusion sensors could easily be a combination of multiple, real time exterior video feeds as well as multi-point intrusion/motion/fire/chemical sensors which would generate much, much more data.
But we have smart roads/bridges, smart cars/trucks, smart skyscrapers, smart port facilities, smart railroads, smart boats/ferries, etc. to come. I could go on but the list seems long enouch already. Each of these could generate another ~19TB/day data stream, if not more. Some of these infrastructure entities/devices are much more complex than a house and there are a lot more cars on the road than houses in the US.
It’s great to be in the (cloud) storage business
All that data has to be stored somewhere and that place is going to be the cloud. The Honeywell smart thermostat uses Opower’s cloud storage and computing infrastructure specifically designed to support better power management for heating and cooling the home. Following this approach, it’s certainly feasible that more cloud services would come online to support each of the smart entities discussed above.
Naturally, using this data to provide real time understanding of the infrastructure they monitor will require big data analytics. Hadoop, and it’s counterparts are the only platforms around today that are up to this task.
So cloud computing, cloud storage, and big data analytics have yet another part to play. This time in the upcoming sensor cloud that will envelope the world and all of it’s infrastructure.
I don’t know about you, but 4TB disk drives for a desktop seem about as much as I could ever use. But when looking seriously at my desktop environment my CAGR for storage (revealed as fully compressed TAR files) is ~61% year over year. At that rate, I will need a 4TB drive for backup purposes in about 7 years and if I assume a 2X compression rate then a 4TB desktop drive will be needed in ~3.5 years, (darn music, movies, photos, …). And we are not heavy digital media consumers, others that shoot and edit their own video probably use orders of magnitude more storage.
Hard to believe, but given current trends inevitable, a 4TB disk drive will become a necessity for us within the next 4 years.
It appears that the system uses 200K disk drives to support the 120PB of storage. The disk drives are packed in a new wider rack and are water cooled. According to the news report the new wider drive trays hold more drives than current drive trays available on the market.
For instance, HP has a hot pluggable, 100 SFF (small form factor 2.5″) disk enclosure that sits in 3U of standard rack space. 200K SFF disks would take up about 154 full racks, not counting the interconnect switching that would be required. Unclear whether water cooling would increase the density much but I suppose a wider tray with special cooling might get you more drives per floor tile.
There was no mention of interconnect, but today’s drives use either SAS or SATA. SAS interconnects for 200K drives would require many separate SAS busses. With an SAS expander addressing 255 drives or other expanders, one would need at least 4 SAS busses but this would have ~64K drives per bus and would not perform well. Something more like 64-128 drives per bus would have much better performer and each drive would need dual pathing, and if we use 100 drives per SAS string, that’s 2000 SAS drive strings or at least 4000 SAS busses (dual port access to the drives).
Shared storage cluster – where GPFS front end nodes access shared storage across the backend. This is generally SAN storage system(s). But the requirements for high density, it doesn’t seem likely that the 120PB storage system uses SAN storage in the backend.
Networked based cluster – here the GPFS front end nodes talk over a LAN to a cluster of NSD (network storage director?) servers which can have access to all or some of the storage. My guess is this is what will be used in the 120PB storage system
Shared Network based clusters – this looks just like a bunch of NSD servers but provides access across multiple NSD clusters.
Given the above, with ~100 drives per NSD server means another 1U extra per 100 drives or (given HP drive density) 4U per 100 drives for 1000 drives and 10 IO servers per 40U rack, (not counting switching). At this density it takes ~200 racks for 120PB of raw storage and NSD nodes or 2000 NSD nodes.
Unclear how many GPFS front end nodes would be needed on top of this but even if it were 1 GPFS frontend node for every 5 NSD nodes, we are talking another 400 GPFS frontend nodes and at 1U per server, another 10 racks or so (not counting switching).
If my calculations are correct we are talking over 210 racks with switching thrown in to support the storage. According to IBM’s discussion on the Storage challenges for petascale systems, it probably provides ~6TB/sec of data transfer which should be easy with 200K disks but may require even more SAS busses (maybe ~10K vs. the 2K discussed above).
IBM GPFS is used behind the scenes in IBM’s commercial SONAS storage system but has been around as a cluster file system designed for HPC environments for over 15 years or more now.
Given this many disk drives something needs to be done about protecting against drive failure. IBM has been talking about declustered RAID algorithms for their next generation HPC storage system which spreads the parity across more disks and as such, speeds up rebuild time at the cost of reducing effective capacity. There was no mention of effective capacity in the report but this would be a reasonable tradeoff. A 200K drive storage system should have a drive failure every 10 hours, on average (assuming a 2 million hour MTBF). Let’s hope they get drive rebuild time down much below that.
The system is expected to hold around a trillion files. Not sure but even at 1024 bytes of metadata per file, this number of files would chew up ~1PB of metadata storage space.
GPFS provides ILM (information life cycle management, or data placement based on information attributes) using automated policies and supports external storage pools outside the GPFS cluster storage. ILM within the GPFS cluster supports file placement across different tiers of storage.
All the discussion up to now revolved around homogeneous backend storage but it’s quite possible that multiple storage tiers could also be used. For example, a high density but slower storage tier could be combined with a low density but faster storage tier to provide a more cost effective storage system. Although, it’s unclear whether the application (real world modeling) could readily utilize this sort of storage architecture nor whether they would care about system cost.
Nonetheless, presumably an external storage pool would be a useful adjunct to any 120PB storage system for HPC applications.
Can it be done?
Let’s see, 400 GPFS nodes, 2000 NSD nodes, and 200K drives. Seems like the hardware would be readily doable (not sure why they needed watercooling but hopefully they obtained better drive density that way).
It would seem that a 20X multiplier times a current Isilon cluster or even a 10X multiple of a currently supported SONAS system would take some software effort to work together, but seems entirely within reason.
Of course, IBM Almaden is working on project to support Hadoop over GPFS which might not be optimum for real world modeling but would nonetheless support the node count being talked about here.
I wish there was some real technical information on the project out on the web but I could not find any. Much of this is informed conjecture based on current GPFS system and storage hardware capabilities. But hopefully, I haven’t traveled to far astray.
There was some twitter traffic yesterday on how Facebook was locked into using MySQL (see article here) and as such, was having to shard their MySQL database across 1000s of database partitions and memcached servers in order to keep up with the processing load.
The article indicated that this was painful, costly and time consuming. Also they said Facebook would be better served moving to something else. One answer was to replace MySQL with recently emerging, NewSQL database technology.
One problem with old SQL database systems is they were never architected to scale beyond a single server. As such, multi-server transactional operations was always a short-term fix to the underlying system, not a design goal. Sharding emerged as one way to distribute the data across multiple RDBMS servers.
Relational database tables are sharded by partitioning them via a key. By hashing this key one can partition a busy table across a number of servers and use the hash function to lookup where to process/access table data. An alternative to hashing is to use a search lookup function to determine which server has the table data you need and process it there.
In any case, sharding causes a number of new problems. Namely,
Cross-shard joins – anytime you need data from more than one shard server you lose the advantages of distributing data across nodes. Thus, cross-shard joins need to be avoided to retain performance.
Load balancing shards – to spread workload you need to split the data by processing activity. But, knowing ahead of time what the table processing will look like is hard and one weeks processing may vary considerably from the next weeks load. As such, it’s hard to load balance shard servers.
Non-consistent shards – by spreading transactions across multiple database servers and partitions, transactional consistency can no longer be guaranteed. While for some applications this may not be a concern, traditional RDBMS activity is consistent.
These are just some of the issues with sharding and I am certain there are more.
What about Hadoop projects and its alternatives?
One possibility is to use Hadoop and its distributed database solutions. However, Hadoop systems were not intended to be used for transaction processing. Nonetheless, Cassandra and HyperTable (see my post on Hadoop – Part 2) can be used for transaction processing and at least Casandra can be tailored to any consistency level. But both Cassandra and HyperTable are not really meant to support high throughput, consistent transaction processing.
Also, the other, non-Hadoop distributed database solutions support data analytics and most are not positioned as transaction processing systems (see Big Data – Part 3). Although Teradata might be considered the lone exception here and can be a very capable transaction oriented database system in addition to its data warehouse operations. But it’s probably not widely distributed or scaleable above a certain threshold.
The problems with most of the Hadoop and non-Hadoop systems above mainly revolve around the lack of support for ACID transactions, i.e., atomic, consistent, isolated, and durable transaction processing. In fact, most of the above solutions relax one or more of these characteristics to provide a scaleable transaction processing model.
NewSQL to the rescue
There are some new emerging database systems that are designed from the ground up to operate in distributed environments called “NewSQL” databases. Specifically,
Clustrix – is a MySQL compatible replacement, delivered as a hardware appliance that can be distributed across a number of nodes that retains fully ACID transaction compliance.
NimbusDB – is a client-cloud based SQL service which distributes copies of data across multiple nodes and offers a majority of SQL99 standard services.
VoltDB – is a fully SQL compatible, ACID compliant, distributed, in-memory database system offered as a software only solution executing on 64bit CentOS system but is compatible with any POSIX-compliant, 64bit Linux platform.
Xeround – is a cloud based, MySQL compatible replacement delivered as a (Amazon, Rackspace and others) service offering that provides ACID compliant transaction processing across distributed nodes.
I might be missing some, but these seem to be the main ones today. All the above seem to take a different tack to offer distributed SQL services. Some of the above relax ACID compliance in order to offer distributed services. But for all of them distributed scale out performance is key and they all offer purpose built, distributed transactional relational database services.
RDBMS technology has evolved over the last century and have had at least ~35 years of running major transactional systems. But todays hardware architecture together with web scale performance requirements stretch these systems beyond their original design envelope. As such, NewSQL database systems have emerged to replace old SQL technology, with a new, intrinsically distributed system architecture providing high performing, scaleable transactional database services for today and the foreseeable future.
To try to partition this space just a bit, there is unstructured data analysis and structured data analysis. Hadoop is used to analyze un-structured data (although Hadoop is used to parse and structure the data).
On the other hand, for structured data there are a number of other options currently available. Namely:
EMC Greenplum – a relational database that is available in a software only as well as now as a hardware appliance. Greenplum supports both row or column oriented data structuring and has support for policy based data placement across multiple storage tiers. There is a packaged solution that consists of Greenplum software and a Hadoop distribution running on a GreenPlum appliance.
HP Vertica – a column oriented, relational database that is available currently in a software only distribution. Vertica supports aggressive data compression and provides high throughput query performance. They were early supporters of Hadoop integration providing Hadoop MapReduce and Pig API connectors to provide Hadoop access to data in Vertica databases and job scheduling integration.
IBM Netezza – a relational database system that is based on proprietary hardware analysis engine configured in a blade system. Netezza is the second oldest solution on this list (see Teradata for the oldest). Since the acquisition by IBM, Netezza now provides their highest performing solution on IBM blade hardware but all of their systems depend on purpose built, FPGA chips designed to perform high speed queries across relational data. Netezza has a number of partners and/or homegrown solutions that provide specialized analysis for specific verticals such as retail, telcom, finserv, and others. Also, Netezza provides tight integration with various Oracle functionality but there doesn’t appear to be much direct integration with Hadoop on thier website.
ParAccel – a column based, relational database that is available in a software only solution. ParAccel offers a number of storage deployment options including an all in-memory database, DAS database or SSD database. In addition, ParAccel offers a Blended Scan approach providing a two tier database structure with DAS and SAN storage. There appears to be some integration with Hadoop indicating that data stored in HDFS and structured by MapReduce can be loaded and analyzed by ParAccel.
Teradata – a relational databases that is based on a proprietary purpose built appliance hardware. Teradata recently came out with an all SSD, solution which provides very high performance for database queries. The company was started in 1979 and has been very successful in retail, telcom and finserv verticals and offer a number of special purpose applications supporting data analysis for these and other verticals. There appears to be some integration with Hadoop but it’s not prominent on their website.
Probably missing a few other solutions but these appear to be the main ones at the moment.
In any case both Hadoop and most of it’s software-only, structured competition are based on a massively parrallelized/share nothing set of linux servers. The two hardware based solutions listed above (Teradata and Netezza) also operate in a massive parallel processing mode to load and analyze data. Such solutions provide scale-out performance at a reasonable cost to support very large databases (PB of data).
Now that EMC owns Greenplum and HP owns Vertica, we are likely to see more appliance based packaging options for both of these offerings. EMC has taken the lead here and have already announced Greenplum specific appliance packages.
One lingering question about these solutions is why don’t customers use current traditional database systems (Oracle, DB2, Postgres, MySQL) to do this analysis. The answer seems to lie in the fact that these traditional solutions are not massively parallelized. Thus, doing this analysis on TB or PB of data would take a too long. Moreover, the cost to support data analysis with traditional database solutions over PB of data would be prohibitive. For these reasons and the fact that compute power has become so cheap nowadays, structured data analytics for large databases has migrated to these special purpose, massively parallelized solutions.
I was talking with another analyst the other day by the name of John Koller of Kai Consulting who specializes in the medical space and he was talking about the rise of electronic pathology (e-pathology). I hadn’t heard about this one.
He said that just like radiology had done in the recent past, pathology investigations are moving to make use of digital formats.
What does that mean?
The biopsies taken today for cancer and disease diagnosis which involve one more specimens of tissue examined under a microscope will now be digitized and the digital files will be inspected instead of the original slide.
Apparently microscopic examinations typically use a 1×3 inch slide that can have the whole slide devoted to some tissue matter. To be able to do a pathological examination, one has to digitize the whole slide, under magnification at various depths within the tissue. According to Koller, any tissue is essentially a 3D structure and pathological exams, must inspect different depths (slices) within this sample to form their diagnosis.
I was struck by the need for different slices of the same specimen. I hadn’t anticipated that but whenever I look in a microscope, I am always adjusting the focal length, showing different depths within the slide. So it makes sense, if you want to understand the pathology of a tissue sample, multiple views (or slices) at different depths are a necessity.
So what does a slide take in storage capacity?
Koller said, an uncompressed, full slide will take about 300GB of space. However, with compression and the fact that most often the slide is not completely used, a more typical space consumption would be on the order of 3 to 5GB per specimen.
As for volume, Koller indicated that a medium hospital facility (~300 beds) typically does around 30K radiological studies a year but do about 10X that in pathological studies. So at 300K pathological examinations done a year, we are talking about 90 to 150TB of digitized specimen images a year for a mid-sized hospital.
Why move to e-pathology?
It can open up a whole myriad of telemedicine offerings similar to the radiological study services currently available around the globe. Today, non-electronic pathology involves sending specimens off to a local lab and examination by medical technicians under microscope. But with e-pathology, the specimen gets digitized (where, the hospital, the lab, ?) and then the digital files can be sent anywhere around the world, wherever someone is qualified and available to scrutinize them.
At a recent analyst event we were discussing big data and aside from the analytics component and other markets, the vendor made mention of content archives are starting to explode. Given where e-pathology is heading, I can understand why.
I was at another conference the other day where someone showed a chart that said the world will create 35ZB (10**21) of data and content in 2020 from 800EB (10**18) in 2009.
Every time I see something like this I cringe. Yes, lot’s of data is being created today but what does that tell us about corporate data growth. Not much, I’d wager.
That being said, I have a couple of questions I would ask of the people who estimated this:
How much is personal data and how much is corporate data.
Did you factor how entertainment data growth rates will change over time.
These two questions are crucial.
Entertainment dominates data growth
Just as personal entertainment is becoming the major consumer of national bandwidth (see study [requires login]), it’s clear to me that the majority of the data being created today is for personal consumption/entertainment – video, music, and image files.
I look at my own office, our corporate data (office files, PDFs, text, etc.) represents ~14% of the data we keep. Images, music, video, audio take up the remainder of our data footprint. Is this data growing yes, faster than I would like but the corporate data is only averaging ~30% YoY growth while the overall data growth for our shop is averaging a total of ~116% YoY growth . [As I interrupt this activity to load up another 3.3GB of photos and videos from our camera]
Moreover, although some media content is of significant external interest to select (Media and Entertainment, social media-photo/video sharing sites, mapping/satellite, healthcare, etc.) companies today, most corporations don’t deal with lot’s of video, music or audio data. Thus, I personally see that the 30% growth is a more realistic growth rate for corporate data than 116%.
Will entertainment data growth flatten?
Will we see a drop in the entertainment data growth rates over time, undoubtedly.
Two factors will reduce the growth of this data.
What happens to entertainment data recording formats. I believe media recording formats are starting to level out. I think the issue here is one of fidelity to nature, in terms of how closely a digital representation matches reality as we perceive it. For example, the fact is that most digital projection systems in movie theaters today run from ~2 to 8TBs per feature length motion picture which seems to indicate that at some point further gains in fidelity (or in more pixels/frame) may not be worth it. Similar issues, will ultimately lead to a slowing down of other media encoding formats.
When will all the people that can create content be doing so? Recent data indicates that more than 2B people will be on the internet this year or ~28% of the world’s. But sometime we must reach saturation on internet penetration and when that happens data growth rates should also start to level out. Let’s say for argument sake, that 800EB in 2009 was correct and let’s assume there were 1.5B internet users (in 2009). As such, 1B internet users correlates to a data and content footprint of about 533EB or ~0.5TB/internet user — seems high but certainly doable.
Once these two factors level off, we should see world data and content growth rates plummet. Nonetheless, internet user population growth could be driving data growth rates for some time to come.
The scary part is that the 35ZB represents only a ~41% growth rate over the period against the baseline 2009 data and content creation levels.
But I must assume this estimate doesn’t consider much growth in digital creators of content, otherwise these numbers should go up substantially. In the last week, I ran across someone who said there would be 6B internet users by the end of the decade (can’t seem to recall where, but it was a TEDx video). I find that a little hard to believe but this was based on the assumption that most people will have smart phones with cellular data plans by that time. If that be the case, 35ZB seems awfully short of the mark.
A previous post blows this discussion completely away with just one application, (see Yottabytes by 2015 for the NSA A Yottabyte (YB) is 10**24 bytes of data) and I had already discussed an Exabyte-a-day and 3.3 Exabytes-a-day in prior posts. [Note, those YB by 2015 are all audio (phone) recordings but if we start using Skype Video, FaceTime and other video communications technologies can Nonabytes (10**27) be far behind… BOOM!]
I started out thinking that 35ZB by 2020 wasn’t pertinent to corporate considerations and figured things had to flatten out, then convinced myself that it wasn’t large enough to accommodate internet user growth, and then finally recalled prior posts that put all this into even more perspective.
I have had this conversation before (and have blogged about it with Crowdsourcing business analyst …) where there is lots of time and effort (person years?) devoted to scheduling one-on-one meetings between analyst firms and corporate executives. I may be repeating my earlier post but the problem persists and I see an obvious easier way to solve this.
Auction off 1 on 1 time slots
By doing this the company puts the burden on the analyst community by giving every firm some amount of “analyst buck”s (A$) and then auction off executive meeting slots. In this way the crowd of analysts would determine who best to meet with whom (putting crowdsourcing to work).
Consider today’s solution:
Send out a list of topics to be discussed at the meeting,
Have the analyst firm select their top 3 or 5 topics, and
Have analyst relation’s sift the requests and executive availability to schedule the meetings.
For analyst events with 100s of analyst firms, 20 or more executives, and 10 or more time slots, the scheduling activity can become quite complex and time consuming.
I understand a corporation’s need to make the most effective use of analysts and executive management time, but what better way to make this determination than to let the (analyst) market decide.
How an executive 1 on 1 auction could work
The way I see it is to hold some sort of dutch or japanese auction (see wikipedia auction) where all the analyst firm representatives attended a webex session and bid for 1-1 time slots with various executives. In this fashion the company could have the whole schedule laid out in a single day with the only effort involved in identifying executives, time slots and supplying A$s to analyst firms.
It doesn’t even need to be that sophisticated and potentially could be done on eBay with real money supplied by the company (useable only for bidding in executive time slot auctions) and donated to charity when the process is finished. There are any number of ways to do this on the quick and cheap. However, using eBay may be a bit too public but doing this over a conference call with webex would probably suffice just as well and could be totally private.
Of course with this approach, the company may find that their are some executives that are in higher demand than others. If such is the case, perhaps a secondary auction could be supplied with more time slots. Ditto for executives that have time slots that are not in demand – they could be released from providing time for 1 on 1 meetings.
In my prior post I mentioned the option that maybe the corporation might want more control over who meets who. In that case allocating some A$s to the corporate executives (or A/R as their proxy) to use to augment analyst firm bids might do the trick. Of course providing those firms more A$s would also give them preferential access. Obviously, this wouldn’t provide as much absolute control as spending person years of effort doing 1 on 1 scheduling but it would provide a quick and relatively easy solution to the problem from both the analyst firm as well as analyst relations.
But how much to grant to each analyst firm?
The critical question is the amount of A$’s to provide each firm. This might take some thought but there is an easy solution. Just use last years analyst spend as the amount of A$s to provide the firm. Another option is to provide some base level of analyst bucks to any firm invited to attend and then add more for the prior year spend.
Possibly, a less appealing approach (to me at least) is to give each analyst firm an amount proportional to their annual revenue regardless of company spending with the firm. But perhaps some combination of the above, say
1/3 base amount for any invitee + 1/3 proportional to annual spend +1/3 proportional to annual firm revenue = A$s
In my previous post I suggested so many A$s per analyst. As such, bigger firms with more analysts would get more than firms with less analysts. But I feel the formula described above makes more sense to me.
Information provided to facilitate the 1 on 1’s auction
In order for the auction to work well, analyst firms would need to know more information about the executive whose time is being auctioned off. But aside from that just a schedule of the time slots available would allow the auction to work. On the other hand, some idea of the company’s org chart and where the executive fit in would be very useful to facilitate the auction.
That’s it, pretty simple, set up a conference call, send out executive information and org chart, allocate analyst bucks and let the bidding begin.
Auctioning off Lot-132: 30 minutes of Ray Lucchesi’s time …, let the bidding begin.