Can we back up a PB?

Tradition says no way. IT backup history says not on your life. Common sense would say never in a million years.

Most organizations with PB of data or more, depend on remote replication to protect against data center outage or massive loss of data. This of course costs ~2X your original data center. And for some organizations one copy is not enough, so ~3X .

I don’t know what a PB scale data storage costs these days but I can’t believe it’s under a couple Million $ USD in hw and sw costs and probably at least another Million or so in OpEx/year. Multiply that by 2 or 3X and you’re now talking real money.

How could backup help?

Well for one you wouldn’t need replicas, so that would cut your hw & sw acquisition costs by a factor of 2 or 3. But backup storage is not free either. So you’d probably need to add back 30-50% of the original data center in hw & sw costs for backups.

You certainly wouldn’t need as many admins. And power for backup storage should also be substantially less. So maybe your OpEx would only be 1.5X in total for the original PB and its backups.

But what could possibly back up a PB of data?

We were talking with Igneous at Cloud Field Day 8 (CFD8, see their video here)  a couple of weeks back and they said they could and do backup PBs of data for customers today. A while back, e also talked with them on a GreyBeards on Storage podcast.

The problems with backing up a PB seem insurmountable. First you have to be able to scan a PB of data. This means looking into multiple file systems on many different hardware platforms, across potentially multiple data centers, and that’s just to get a baseline of what all needs to be backed up.

Then at some point you actually have to store all that data on backup storage. So, to gain some cost advantage, you’d want to compress and deduplicate a PB of data, so that the first full backup wouldn’t take a full PB of backup storage.

Then of course you have to transfer a PB of data to your backup storage, in something that wouldn’t take months to perform. And that just gets you the first full backup.

Next, comes the daily scan of what’s changed. This has to re-scan your PB of data to find that 100TB or so, that’s changed over the last 24 hrs. Sometime after that scan completes, then all that 100TB or so of changed data needs to be compressed, deduped and transferred again to backup storage

And if that’s not enough, you have to do it all over again, every day, from now on, almost forever. And data continues to grow. So 1PB today is likely to be 2PB of more in 12 months (it’s great to be in the storage business). 

So those are the challenges. How can it be done, effectively, day in and day out, enough so that IT can depend on their data being backed up.

Igneous to the rescue…

First, Igneous came out of stealth a while back (listen to our podcast) with a couple of unique capabilities needed for massive data repository discovery and analysis. That is they built a unique engine to scan and index PB scale data repositories. This was so they couldd provide administrators better visibility into their PB scale data repositories. But this isn’t about that product, it’s about backup. 

But some of the capabilities they needed to support that product helped them perform backups as well. For instance, their scan needed to handle PBs of data. They came up with AdaptiveSCAN, which didn’t use standard NFS and SMB data transfer protocols to gain access to file metadata. To open a file on NFS or SMB takes quite a lot of NFS or SMB transactions. But to access metadata only, one doesn’t have to use all those NFS and SMB capabilities, it can be done with much less overhead even when using NFS or SMB.

Of course having a way to scan Billions of files was a major accomplishment, but then where do you put all that metadata. And how can you access it effectively to support backup up a PB data repository. So they needed some serious data indexing capabilities and so came up with InfiniteINDEX

Now a trillion item index, seems a bit much, even for PB scale repositories. But my guess is they have eyes on taking their PB scale backups and going after even bigger fish,. That is offering backups for EB scale data repository. And that might just take a trillion item index

Next, there’s moving PB or even TB of data quickly is no small trick. As the development team at Igneous mostly came from unstructured data providers, they also understood and have access to APIs for most storage vendors (NetApp, Dell-EMC Isilon, Pure FlashBlade, Qumulo, etc.). As such, where available, they utilized those native vendor storage API calls to help them move data rather than having to Open an NFS or SMB file and Read it. 

Of course, even doing all that, moving 100TBs of data around or scanning PB sized data repositories is going to take a lot of processing and IO bandwidth to do in a reasonable period of time. 

So another capability they developed is massive parallelism. That is being able to distribute scan, indexing or data movement work, out to multiple systems. In that fashion it can be accomplished in significantly less wall clock time. 

Well with all that, they pretty much had the guts of a backup application system for PB data repositories but they still didn’t have the glue to put it all together. But recently they announced just that a Igneous’s DataProtect, a full scale backup application for PB of data. 

I suppose I haven’t done justice to all of what they have developed or talked about at their session, so I would suggest viewing their talk at CFD8 and listening to our GBoS podcast to learn more. They did demo their product at CFD8 but I believe it was a canned demo.

I didn’t think I’d see the day when some vendor would offer backup services for PBs of data let alone be shooting for more, but there you have it. Igneous means to take your PB scale data repositories and make them as easy to operate as TB scale data repositories. They call that democratizing data.

Comments?

See these other CFD8 bloggers write ups on Igneous.

CFD8  – Igneous Follow Up  by Nate Avery (@Nathaniel_Avery)

Picture credit(s): All from screen saves during Igneous’s session at CFD8

Marketing meet Big Data, call records, credit card purchases & demographics

Read an article in Science Daily (Understanding urban issues through credit cards) that talked about a study published in Nature (Sequences of purchases in credit card data reveals lifestyles of urban populations) that applies big data to B2C marketing.

The researchers examined call data records (CDRs), credit card transactions records (CCRs) and demographic (age, sex, residential zip code, wage level, etc.) data and did a cross table between them to identify sequences of purchases. They then used these sequences to identify different lifestyle groups in the urban area.

Marketing 2.0

The analyzed data from Mexico City, Mexico. The CCR data was collected for 10 weeks across 150K users. The had CDR data for 1/10th of the users for 6 months surrounding the 10 weeks duration. Credit card adoption is still low in Mexico (18%), so the analysis may be biased.  When thy matched CCR expenditures against median wages in a district and they found their participants came from higher wage populations. Their data also spanned all districts within the city.

The analysis identified sequences of purchase categories as well as expenditures.  They characterized purchase sequences as “words”.

 

 

 

Using the word data and further statistical analysis they were able to split the population up into 5 distinct lifestyle groups. 

The loops of icons above represent major purchase categories derived from the CCR data merchant category codes (MCC).  Each of the rings in “a” above show the same 12 major MCC purchase categories. If you look at each ring, one can identify a central or core node that seems to have the most incoming or outgoing arks. These seem to be the central purchases made by that lifestyle group after which they branch out to other purchase categories.

There are five different lifestyle categories (they also show the city average) delineated in the data:

  • Commuter – generally they have to pay tolls, have longer travel between home and work and have a diverse sequence of purchase that occurs after purchases from the toll category.
  • Household – purchases seem to center on grocery stores/supermarkets and then branch off from there.
  • Young – purchases seem to center on the taxicab category and then go to computer-networking, restaurants, grocery stores/supermarkets.
  • Hi-Tech – purchases seem to center on computer-networking,  then go to gas stations, grocery stores/supermarkets, restaurants, and telecomm.
  • Average – seems to have two focuses grocery stores/supermarkets and restaurants and then goes out from there to gas stations, specialty food stores and department stores.
  • Dinner-out – purchases seem to center on restaurants and then branch out fro there to computer-networking, gas stations, supermarkets, fast food, etc.

In “b”  breakout above, you can see the socio-demographic characteristics of each lifestyle group as compared with the median user. And in “c” one can see some population histograms of the demographic data.

They were then able to use the CDR data to construct a map of which lifestyle called which other life style to identify call correlation data. Most calls were contacts between the same groups but the second most active call was calls to householders.

They took this same analysis to another city in Mexico and came up with six  lifestyle categories, five of the same and a different one.

~~~~

When I went to Uni (a long long time ago), I attended an urban geography class that was much more scientific and mathematical than any other geography class I had ever attended. I remember asking the professor when did geography become an exact science. As best as I can recall, he laughed and said over the last decade.

Analysis like the above could make B2C marketing, almost an exact science.

Big Data meet Marketing – Buyer beware.

Comments?

Photo Credit(s):  All charts/photos are from the Nature article Sequences of purchase in credit card data reveal lifestyles in urban populations

Stanford Data Lab students hit the ground running…

Read an article (Students confront the messiness of data) today about Stanford’s Data Lab  and how their students are trained to cleanup and analyze real world data.

The Data Lab teaches two courses the Data Challenge Lab course and the Data Impact Lab course. The Challenge Lab is an introductory course in data gathering, cleanup and analysis. The Impact Lab is where advanced students tackle real world, high impact problems through data analysis.

Data Challenge Lab

Their Data Challenge Lab course is a 10 week course with no pre-requisites that teaches students how to analyze real world data to solve problems.

Their are no lectures. You’re given project datasets and the tools to manipulate, visualize and analyze the data. Your goal is to master the tools, cleanup the data and gather insights from the data. Professors are there to provide one on one help so you can step through the data provided and understand how to use the tools.

In the information provided on their website there were no references and no information about the specific tools used in the Data Challenge Lab to manipulate, visualize and analyze the data. From an outsiders’ viewpoint it would be great to have a list of references or even websites describing the tools being used and maybe the datasets that are accessed.

Data Impact Lab

The Data Impact lab course is an independent study course, whose only pre-req is the Data Challenge Lab.

Here students are joined into interdisplinary teams with practitioner partners to tackle ongoing, real world problems with their new data analysis capabilities.

There is no set time frame for the course and it is a non-credit activity. But here students help to solve real world problems.

Current projects in the Impact lab include:

  • The California Poverty Project  to create an interactive map of poverty in California to supply geographic guidance to aid agencies helping the poor
  • The Zambia Malaria Project to create an interactive map of malarial infestation to help NGOs and other agencies target remediation activity.

Previous Impact Lab projects include: the Poverty Alleviation Project to provide a multi-dimensional index of poverty status for areas in Kenya so that NGOs can use these maps to target randomized experiments in poverty eradication and the Data Journalism Project to bring data analysis tools to breaking stories and other journalistic endeavors.

~~~~

Courses like these should be much more widely available. It’s almost the analog to the scientific method, only for the 21st century.

Science has gotten to a point, these days, where data analysis is a core discipline that everyone should know how to do. Maybe it doesn’t have to involve Hadoop but rudimentary data analysis, manipulation, and visualization needs to be in everyone’s tool box.

Data 101 anyone?

Photo Credit(s): Big_Data_Prob | KamiPhuc;

Southbound traffic speeds on Masonic avenue on different dates | Eric Fisher;

Unlucky Haiti (1981-2010) | Jer Thorp;

Bristol Cycling Level by Wards 2011 | Sam Saunders

Domesticating data

4111674475_76be20e180_zRead an article the other day from MIT News (Taming Data) about a new system that scans all your tabular data and provides an easy way to query all this data from one system. The researchers call the system the Data Civilizer.

What does it do

Tabular data seems to be the one constant in corporate data (that and for me PowerPoint and Word docs). Most data bases are tables of one form or another (some row and some column based). Lots of operational data is in spreadsheets (tables by another name) of some type.  And when I look over most IT/Networking/Storage management GUIs, tables (rows and columns) of data are the norm.

156788318_628fb0e4dc_oThe Data Civilizer takes all this tabular data and analyzes it all, column by column, and calculates descriptive characterization statistics for each column.

Numerical data could be characterized by range, standard deviation, median/average, cardinality etc. For textual data a list of words in the column by frequency might suffice. It also indexes every  word in the tables it analyzes.

Armed with its statistical characterization of each column, the Data Civilizer can then generate a similarity index between any two columns of data across the tables it has analyzed. In that way it can connect data in one table with data in another.

Once it has a similarity matrix and has indexed all the words in every table column it has analyzed, it can then map the tabular data, showing which columns look similar to other columns. Then any arbitrary query for data, can be executed on any table that contains similar data supplying the results of the query across the multiple tables it has analyzed.

Potential improvements

The researchers indicated that they currently don’t support every table data format. This may be a sizable task on its own.

In addition statistical characterization or classification seems old school nowadays. Most new AI is moving off statistical analysis to more neural net types of classification. Unclear if you could just feed all the tabular data to a deep learning neural net, but if the end game is to find similarities across disparate data sets, then neural nets are probably a better way to go. How you would combine this with brute force indexing of all tabular data words is another question.

~~~~

In the end as I look at my company’s information, even most of my Word docs are organized in some sort of table, so cross table queries could help me a lot. Let me know when it can handle Excel and Word docs and I’ll take another look.

Photo Credit(s): Linear system table representation 2 by Ronald O’ Daniel

Glenda Sims by Glendathegood

 

Big open data leads to citizen science

Read an article the other day in ScienceLine about the Astronomical Data Explosion.  It appears that as international observatories start to open up their archives and their astronomical data to anyone and anybody, people are starting to do useful science with it.

Hunting for planets

The story talked about a pair of amateur astronomers who were looking through Kepler telescope data which had recently been put online (see PlanetHunters.org) to find anomalies that signal the possibility of a planet.  They saw a diming of a particular star’s brightness and then saw it again 132 days later. At that point they brought it to the attention of real scientists who later discovered that what they found was a 4 star solar system which they labeled Tatooine.

It seems with all the latest astronomical observations coming in from Kepler, the Sloan Digital Sky Survey and Hubble observatories are generating a deluge of data. And although all this data is being subjected to intense scrutiny by professional astronomers, they can’t do everything they want to do with it.

Consequently, in astronomy today we now have come to a new world of abundant data but not enough resources to do all the science that can be done.  This is where the citizen or amateur scientist enters the picture. Using standard web accessible tools they are able to subject the data to many more eyes each looking for whatever interest spurs them on and as such, can often contribute real science from their efforts.

Citizen science platforms

It turns out PlanetHunters.org is one of a number of similar websites put up by Zooniverse to support citizen science in astronomy, biology, nature, climate and humanities. Their latest project is to classify animal found in snapshots taken on the Serengheti (see SnapshotSerengeti.org).

Of course crowdsourced scientific activity like this has been going on for a long time now with Boinc projects like SETI@Home screen savers that sifted through radio signals searching for extra-terestial signals. But that made use of the extra desktop compute cycles people were waisting with screen savers.

 

In contrast, Zooniverse started with the GalaxyZoo project (original retired site here). They put Hubble telescope images online and asked for amateur astronomers to classify the type of galaxies found in the images.

GalaxyZoo had modest aspirations at first but when they put the Hubble images online their servers were overwhelmed with the response and had to be beefed up considerably to deal with the traffic.  Overtime, they were able to get literally millions of galaxy classifications. Now they want more, and the recent incarnation of GalaxyZoo has put the brightest 250K galaxies online and they are asking for even finer, more detailed classifications of them.

Today’s Zooniverse projects are taking advantage of recent large and expanding data repositories plus newer data visualization tools to help employ human analysis to their data.  Automated tools are not yet sophisticated enough to classify images as well as a human can.

One criteria for Zooniverse projects is to have a massive amount of data which needs to be classified.  In this way, science is once again returning to it’s amateur roots but this time guided by professionals.  Together we can do more than what either could do apart.

~~~~

I suppose it was only a matter of time before science got inundated with more data than they could process effectively.  Having the ability to put all this data online, parcel it out to concerned citizens and ask them to help understand/classify it has brought a new dawn to citizen science.

Comments?

Photo credits:
Twin Suns on Mos Espa by Stéfan
BONIC running SETI@Home by Keng Susumpow
Galaxy Group Stephan’s Quintet by HubbleColor {Zolt}

Backup is for (E)discovery too

Electronic Discovery Reference Model (from EDRM.net)
Electronic Discovery Reference Model (from EDRM.net)

There has been lot’s of talk in twitterverse and elsewhere on how “backup is used for restore and archive is for e-discovery”, but I beg to differ.

If one were to take the time to review the EDRM (Electronic Discovery Reference Model) and analyze what happens during actual e-discovery processes, one would see that nothing is outside the domain of court discovery requests. Backups have and always will hold discoverable data just as online and user desktop/laptop storage do. In contrast, archives are not necessarily a primary source of discoverable data.

In my view, any data not in archive, by definition is online or on user desktop/laptop storage. Once online, data is most likely being backed up periodically and will show up in backups long before it’s moved to archive. Data deletions and other modifications can often be reconstructed from backups much better than from archive (with the possible exception of records management systems). Also, reconstructing data proliferation, such as who had a copy of what data when, is often crucial to court proceedings and normally, can only be reconstructed from backups.

Archives have a number of purposes but primarily it’s to move data that doesn’t change off company storage and out of its backup stream. Another popular reason for archive is to be used to satisfy compliance regimens that require companies to hold data for periods of time, such as mandated by SEC, HIPPA, SOX, and others. For example, SEC brokerage records must be held long after an account goes inactive, HIPPA health records must be held long after a hospital visit, SOX requires corporate records to be held long after corporate transactions transpire. Such records are more for compliance and/or customer back-history request purposes than e-discovery but here again any data stored by the corporation is discoverable.

So I believe it’s wrong to say that Backup is only for restore and archive is only for discovery. Information, anywhere within a company is discoverable. However, I would venture to say that a majority of e-discovery data comes from backups rather than elsewhere.

Now, as for using backups for restore,…