Big data analytics – Silverton Consulting

Western Digital at SFD15: ActiveScale object storage

Posted on May 5, 2018 by Ray in data protection, Distributed computing, Object storage, Storage architecture, Storage archive, Storage availability

Phill Bullinger and his staff from Western Digital presented at Storage Field Day 15 (SFD15) on a number of their enterprise products including Tegile and IntelliFlash but the one that caught my interest was their ActiveScale object store acquired from Amplidata back in 2015.

ActiveScale is an onprem, object storage system that provides cloud-like economics for customer data.

ActiveScale Hardware

ActiveScale systems can both scale up and scale out within a single site. ActiveScale systems have both storage and system nodes. Storage nodes perform erasure coding and System nodes are control points and metadata managers for the object store.

ActiveScale comes in two appliance configurations that contain both storage and system nodes and storage required. The two appliances are:

ActiveScale P100 is a 7U 720TB pod system and A full rack of P100s can read 8GB/sec and can have 17-9s data availability. The P100 can scale up to 2.1PB in a single rack and up to 18PB in the same namespace. The P100 is a higher performing solution with better performing storage and system nodes
ActiveScale X100 is a 42U rack scale solution that holds up to 588 12TB drives or 5.8PB per rack. The X100 can scale up to 9 racks or 52PB in the same namespace. The X100 is a denser configuration with only 6 storage nodes and as such, has a better $/GB than the P100 above.

As WDC is both the supplier of the ActiveScale appliance and a supplier of disk storage they can be fairly aggressive with pricing on appliance systems.

Data integrity in ActiveScale

They make a point of saying that ActiveScale object metadata and data are stored separately. By separating data and metadata, they claim to be more resilient to system failures. Object metadata is 3 way replicated, in a replicated database, residing in system nodes. Other object systems often store metadata and object data in the same way.

Object data can be erasure coded. That is, object data is chunked, erasure coding protected and then spread across multiple disk drives for data protection. ActiveScale erasure coding is called BitSpread. With BitSpread customers identify the number of disk drives to spread object data across and the number of drive failures the system should recover from without data loss.

A typical BitSpread configuration splits object data into 18 chunks and spreads these chunks across storage columns. A storage column is from 6-18 storage nodes. There’s no pre-allocated space in BitSpread. Object data chunks are allocated to disk storage based on current capacity and performance of the system, within redundancy constraints.

In addition, ActiveScale has a background task called BitDynamics that scans erasure coded chunks and does a mathematical health check on the object data. If a chunk is bad, the object data chunk can be recovered and re-erasure coded back to proper health.

WDC performance testing shows that BitDynamics has 0 performance degradation when performing re-erasure coding. Indeed, they took out 98 drives in an ActiveScale cluster and BitDynamics re-coded all that data onto other disk drives and detected no performance impact. No indication how long re-encoding 98 disk drives of data took nor the % of object store capacity utilization at the time of the test but presumably there’s a report someplace to back this up

Unlike many public cloud based object storage systems, ActiveScale is strongly consistent. That is object puts (writes) are not responded back to the entity doing the put, until the object metadata and object data are properly and safely recorded in the object store.

ActiveScale also supports 3 site erasure coding. GeoSpread is their approach to erasure coding across sites. In this case, object metadata is replicated across 3 system nodes across the sites. Object data and erasure coded information is split into 20 chunks which are then spread across the three sites. This way if any one site goes down, the other two sites have sufficient metadata, object data chunks and erasure coded information to reconstruct the data.

ActiveScale 5.2 now supports asynch replication. That is any one ActiveScale cluster can replicate to any other ActiveScale cluster located continent distances away.

Unclear how GeoSpread and asynch replication would interact together, but my guess is that each of the 3 GeoSpread sites could be asynchronously replicated to 3 other sites for maximum redundancy.

Both GeoSpread and ActiveScale replication impact performance, depending on how far the sites are from one another and the speed and bandwidth of the links between sites.

ActiveScale markets

ActiveScale’s biggest market is media and entertainment (M&E), mostly used for media archive or tape replacement/augmentation. WDC showed one customer case study for the Montreaux Jazz Festival, which migrated 49 years of performance videos up to ActiveScale and can now stream any performance, on request, without delay. Montreax media is GeoSpread across 3 sites in France. Another option is to perform transcoding on the object media in realtime and stream the transcoded media.

Another large market is Bio/Life Sciences. Medical & biological scanners are transitioning to higher resolution scans which take more data space. And this sort of medical information needs to be kept a long time

Data analytics on ActiveScale

One other emerging market is data analytics. With the new S3A (S3 adapter), Hadoop clusters can now support object storage as a 2nd tier. One problem with data analytics is that they have lots of data and storing it in triplicate, costs an awful lot.

In big data world, datasets can get very large very quickly. Indeed PB sizes data sets aren’t that unusual. And with triple replication (in native HDFS). When HDFS runs out of space you have to delete data. Before S3A, the only way you could increase storage you had to scale out (with compute and storage and networking) in order to add capacity.

Using Hadoop’s S3A, ActiveScale’s can provide cold archive for data analytics. From a Hadoop user/application perspective, S3A ActiveScale storage looks like just another directory under HDFS (Hadoop Data File System). You can run MapReduce or other Hadoop application directly against object buckets. But a more realistic approach is to move inactive or cold data from an disk resident HDFS directory to a S3A directory

HDFS and MapReduce are tightly coupled and were designed to have data close to where computation happens. So, as long as the active data or working set data is on HDFS disk storage or directly in memory the rest of the (inactive) data could all be placed on S3A object storage. Inactive data is normally historical data no longer being actively analyzed while newer data would be actively analyzed. Older, inactive data can be manually or automatically archived off to S3A. With HIVE you can partition your database to have active data in HDFS disk storage and inactive data in S3A.

Another approach is if the active, working set data can all fit directly in memory then the data can reside on S3A object storage. This way the data is read from S3A storage into memory, analyzed there and output be done back to object store or HDFS disk. Because the data is only read (loaded) once, there’s only a minimal performance penalty to use S3A storage.

Western Digital is an active contributor to Hadoop S3A and have recently added performance improvements to S3A, such as better caching, partial object reading, and core XML performance tuning options.

~~~~
If your interested in learning more about Western Digital ActiveScale, check out the videos referenced earlier and their website.

Also you may be interested in these other posts on the WD sessions at SFD15:

The A is for Active, The S is for Scale by Dan Firth (@PenguinPunk)

Comments?

Releasing social and mobile data as a public good

Posted on May 15, 2014 by Ray in data access, Data availability, Data science, Strategic Inflection Points, Strategy, Visionary leadershp

I have been reading a book recently, called Uncharted: Big data as a lens on human culture by Erez Aiden and Jean-Baptiste Michel that discusses the use of Google’s Ngram search engine which counts phrases (Ngrams) used in all the books they have digitized. Ngram phrases are charted against other Ngrams and plotted in real time.

It’s an interesting concept and one example they use is “United States are” vs. “United States is” a 3-Ngram which shows that the singular version of the phrase which has often been attributed to emerge immediately after the Civil War actually was in use prior to the Civil War and really didn’t take off until 1880’s, 15 years after the end of the Civil War.

I haven’t finished the book yet but it got me to thinking. The authors petitioned Google to gain access to the Ngram data which led to their original research. But then they convinced Google after their original research time was up to release the information to the general public. Great for them but it’s only a one time event and happened to work this time with luck and persistance.

The world needs more data

But there’s plenty of other information or data out there where we could use to learn an awful lot about human social interaction and other attributes about the world that are buried away in corporate databases. Yes, sometimes this information is made public (like Google), or made available for specific research (see my post on using mobile phone data to understand people mobility in an urban environment) but these are special situations. Once the research is over, the data is typically no longer available to the general public and getting future or past data outside the research boundaries requires yet another research proposal.

And yet books and magazines are universally available for a fair price to anyone and are available in most research libraries as a general public good for free. Why should electronic data be any different?

Social and mobile dta as a public good

What I would propose is that the Library of Congress and other research libraries around the world have access to all corporate data that documents interaction between humans, humans and the environment, humanity and society, etc. This data would be freely available to anyone with library access and could be used to provide information for research activities that have yet to be envisioned.

Hopefully all of this data would be released, free of charge (or for some nominal fee) to these institutions after some period of time has elapsed. For example, if we were talking about Twitter feeds, Facebook feeds, Instagram feeds, etc. the data would be provided from say 7 years back on a reoccurring yearly or quarterly basis. Not sure if the delay time should be 7, 10 or 15 years, but after some judicious period of time, the data would be released and made publicly available.

There are a number of other considerations:

Anonymity – somehow any information about a person’s identity, actual location, or other potentially identifying characteristics would need to be removed from all the data. I realize this may reduce the value of the data to future researchers but it must be done. I also realize that this may not be an easy thing to accomplish and that is why the data could potentially be sold for a fair price to research libraries. Perhaps after 35 to 100 years or so the identifying information could be re-incorporated into the original data set but I think this highly unlikely.
Accessibility – somehow the data would need to have an easily accessible and understandable description that would enable any researcher to understand the underlying format of the data. This description should probably be in XML format or some other universal description language. At a minimum this would need to include meta-data descriptions of the structure of the data, with all the tables, rows and fields completely described. This could be in SQL format or just XML but needs to be made available. Also the data release itself would then need to be available in a database or in flat file formats that could be uploaded by the research libraries and then accessed by researchers. I would expect that this would use some sort of open source database/file service tools such as MySQL or other database engines. These database’s represent the counterpart to book shelves in today’s libraries and has to be universally accessible and forever available.
Identifyability – somehow the data releases would need to be universally identifiable, not unlike the ISBN scheme currently in use for books and magazines and ISRC scheme used for recordings. This would allow researchers to uniquely refer to any data set that is used to underpin their research. This would also allow the world’s research libraries to insure that they purchase and maintain all the data that becomes available by using some sort of master worldwide catalog that would hold pointers to all this data that is currently being held in research institutions. Such a catalog entry would represent additional meta-data for the data release and would represent a counterpart to a online library card catalog.
Legality – somehow any data release would need to respect any local Data Privacy and Protection laws of the country where the data resides. This could potentially limit the data that is generated in one country, say Germany to be held in that country only. I would think this could be easily accomplished as long as that country would be willing to host all its data in its research institutions.

I am probably forgetting a dozen more considerations but this covers most of it.

How to get companies to release their data

One that quickly comes to mind is how to compel companies to release their data in a timely fashion. I believe that data such as this is inherently valuable to a company but that its corporate value starts to diminish over time and after some time goes to 0.

However, the value to the world of such data encounters an inverse curve. That is, the longer away we are from a specific time period when that data was created, the more value it has for future research endeavors. Just consider what current researchers do with letters, books and magazine articles from the past when they are researching a specific time period in history.

But we need to act now. We are already over 7 years into the Facebook era and mobile phones have been around for decades now. We have probably already lost much of the mobile phone tracking information from the 80’s, 90’s, 00’s and may already be losing the data from the early ’10’s. Some social networks have already risen and gone into a long eclipse where historical data is probably their lowest concern. There is nothing that compels organizations to keep this data around, today.

Types of data to release

Obviously, any social networking data, mobile phone data, or email/chat/texting data should all be available to the world after 7 or more years. Also the private photo libraries, video feeds, audio recordings, etc. should also be released if not already readily available. Less clear to me are utility data, such as smart power meter readings, water consumption readings, traffic tollway activity, etc.

I would say that one standard to use might be if there is any current research activity based on private, corporate data, then that data should ultimately become available to the world. The downside to this is that companies may be more reluctant to grant such research if this is a criteria to release data.

But maybe the researchers themselves should be able to submit requests for data releases and that way it wouldn’t matter if the companies declined or not.

There is no way, anyone could possibly identify all the data that future researchers would need. So I would err on the side to be more inclusive rather than less inclusive in identifying classes of data to be released.

The dawn of Psychohistory

The Uncharted book above seems to me to represent a first step to realizing a science of Psychohistory as envisioned in Asimov’s Foundation Trilogy. It’s unclear whether this will ever be a true, quantified scientific endeavor but with appropriate data releases, readily available for research, perhaps someday in the future we can help create the science of Psychohistory. In the mean time, through the use of judicious, periodic data releases and appropriate research, we can certainly better understand how the world works and just maybe, improve its internal workings for everyone on the planet.

Comments?

Picture Credit(s): Amazon and Wikipedia

New cloud storage and Hadoop managed service offering from Spring SNW

Posted on April 13, 2012April 16, 2012 by Ray in Cloud services, Cloud storage, Crowdsourcing, Distributed computing, Strategic Inflection Points, System effectiveness

Strange Clouds by michaelroper (cc) (from Flickr)

Last week I posted my thoughts on Spring SNW in Dallas, but there were two more items that keep coming back to me (aside from the tornados). The first was a new startup called Symform in cloud storage and the other was an announcement from SunGard about their new Hadoop managed services offering.

Symform

Symform offers an interesting alternative on cloud storage that avoids the build up of large multi-site data centers and uses your desktop storage as a sort of crowd-sourced storage cloud, sort of bit-torrent cloud storage.

You may recall I discussed such a Peer-to-Peer cloud storage and computing services in a posting a couple of years ago. It seems Symform has taken this task on, at least for storage.

A customer downloads (Windows or Mac) software which is installed and executes on your desktop. The first thing you have to do after providing security credentials is to identify which directories will be moved to the cloud and the second is to tell whether you wish to contribute to Symform’s cloud storage and where this storage is located. Symform maintains a cloud management data center which records all the metadata about your cloud resident data and everyone’s contributed storage space.

Symform cloud data is split up into 64MB blocks and encrypted (AES-256) using a randomly generated key (known only to Symform). Then this block is broken up into 64 fragments with 32 parity fragments (using erasure coding) added to the stream which is then written to 96 different locations. With this arrangement, the system could potentially lose 31 fragments out of the 96 and still reconstitute your 64MB of data. The metadata supporting all this activity sits in Symform’s data center.

Unclear to me what you have to provide as far as ongoing access to your contributed storage. I would guess you would need to provide 7X24 access to this storage but the 32 parity fragments are there for possible network/power failures outside your control.

Cloud storage performance is an outcome of the many fragments that are disbursed throughout their storage cloud world. It’s similar to a bit torrent stream with all 96 locations participating in reconstituting your 64MB of data. Of course, not all 96 locations have to be active just some > 64 fragment subset but it’s still cloud storage so data access latency is on the order of internet time (many seconds). Nonetheless, once data transfer begins, throughput performance can be pretty high, which means your data should arrive shortly thereafter.

Pricing seemed comparable to other cloud storage services with a monthly base access fee and a storage amount fee over that. But, you can receive significant discounts if you contribute storage and your first 200GB is free as long as you contribute 200GB of storage space to the Symform cloud.

Sungard’s new Apache Hadoop managed service

Hadoop Logo (from http://hadoop.apache.org website)

We are well aware of Sungard’s business continuity/disaster recovery (BC/DR) services, an IT mainstay for decades now. But sometime within the last decade or so Sungard has been expanding outside this space by moving into managed availability services.

Apparently this began when Sungard noticed the number of new web apps being deployed each year exceeded the number of client server apps. Then along came virtualization, which reduced the need for lots of server and storage hardware for BC/DR.

As evident of this trend, last year Sungard announced a new enterprise class computing cloud service. But in last week’s announcement, Sungard has teamed up with EMC Greenplum to supply an enterprise ready Apache Hadoop managed service offering.

Recall, that EMC Greenplum is offering their own Apache Hadoop supported distribution, Greenplum HD. Sungard is basing there service on this distribution. But there’s more.

In conjunction with Hadoop, Sungard adds Greenplum appliances. With this configuration Sungard can load Hadoop processed and structured data into a Greenplum relational database for high performance data analytics. Once there, any standard SQL analytics and queries can be used against to analyze the data.

With these services Sungard is attempting to provide a unified analytics service that spans all structured, semi-structured and unstructured data.

~~~~

Probably more to Spring SNW but given my limited time on the exhibition floor and time in vendor discussions these and my previously published post are what I seem of most interest to me.

The sensor cloud comes home

Posted on September 15, 2011 by Ray in Cloud services, Data, Data analytics, Data growth, Information economy, Strategic Inflection Points

We thought the advent of smart power meters would be the killer app for building the sensor cloud in the home. But, this week Honeywell announced a new smart thermostat that attaches to the Internet and uses Opower’s cloud service to record and analyze home heating and cooling demand. Looks to be an even better bet.

9/11 Memorial renderings, aerial view (c) 9/11 Memorial.org (from their website)

Just this past week, on a NPR NOVA telecast: Engineering Ground Zero on building the 9/11 memorial in NYC, it was mentioned that all the trees planted in the memorial had individual sensors to measure soil chemistry, dampness, and other tree health indicators. Yes, even trees are getting on the sensor cloud.

And of course the buildings going up at Ground Zero are all smart buildings as well, containing sensors embedded in the structure, the infrastructure, and anywhere else that matters.

But what does this mean in terms of data

Data requirements will explode as the smart home and other sensor clouds build out. For example, even if a smart thermostat only issues a message every 15 minutes and the message is only 256 bytes, the data from the 130 million households in the US alone would be an additional ~3.2TB/day. And that’s just one sensor per household.

If you add the smart power meter, lawn sensor, intrusion/fire/chemical sensor, and god forbid, the refrigerator and freezer product sensors to the mix that’s another another 16TB/day of incoming data.

And that’s just assuming a 256 byte payload per sensor every 15 minutes. The intrusion sensors could easily be a combination of multiple, real time exterior video feeds as well as multi-point intrusion/motion/fire/chemical sensors which would generate much, much more data.

But we have smart roads/bridges, smart cars/trucks, smart skyscrapers, smart port facilities, smart railroads, smart boats/ferries, etc. to come. I could go on but the list seems long enouch already. Each of these could generate another ~19TB/day data stream, if not more. Some of these infrastructure entities/devices are much more complex than a house and there are a lot more cars on the road than houses in the US.

It’s great to be in the (cloud) storage business

All that data has to be stored somewhere and that place is going to be the cloud. The Honeywell smart thermostat uses Opower’s cloud storage and computing infrastructure specifically designed to support better power management for heating and cooling the home. Following this approach, it’s certainly feasible that more cloud services would come online to support each of the smart entities discussed above.

Naturally, using this data to provide real time understanding of the infrastructure they monitor will require big data analytics. Hadoop, and it’s counterparts are the only platforms around today that are up to this task.

—-

So cloud computing, cloud storage, and big data analytics have yet another part to play. This time in the upcoming sensor cloud that will envelope the world and all of it’s infrastructure.

Welcome to the future, it’s almost here already.

Comments?