Marketing meet Big Data, call records, credit card purchases & demographics

Read an article in Science Daily (Understanding urban issues through credit cards) that talked about a study published in Nature (Sequences of purchases in credit card data reveals lifestyles of urban populations) that applies big data to B2C marketing.

The researchers examined call data records (CDRs), credit card transactions records (CCRs) and demographic (age, sex, residential zip code, wage level, etc.) data and did a cross table between them to identify sequences of purchases. They then used these sequences to identify different lifestyle groups in the urban area.

Marketing 2.0

The analyzed data from Mexico City, Mexico. The CCR data was collected for 10 weeks across 150K users. The had CDR data for 1/10th of the users for 6 months surrounding the 10 weeks duration. Credit card adoption is still low in Mexico (18%), so the analysis may be biased.  When thy matched CCR expenditures against median wages in a district and they found their participants came from higher wage populations. Their data also spanned all districts within the city.

The analysis identified sequences of purchase categories as well as expenditures.  They characterized purchase sequences as “words”.

 

 

 

Using the word data and further statistical analysis they were able to split the population up into 5 distinct lifestyle groups. 

The loops of icons above represent major purchase categories derived from the CCR data merchant category codes (MCC).  Each of the rings in “a” above show the same 12 major MCC purchase categories. If you look at each ring, one can identify a central or core node that seems to have the most incoming or outgoing arks. These seem to be the central purchases made by that lifestyle group after which they branch out to other purchase categories.

There are five different lifestyle categories (they also show the city average) delineated in the data:

  • Commuter – generally they have to pay tolls, have longer travel between home and work and have a diverse sequence of purchase that occurs after purchases from the toll category.
  • Household – purchases seem to center on grocery stores/supermarkets and then branch off from there.
  • Young – purchases seem to center on the taxicab category and then go to computer-networking, restaurants, grocery stores/supermarkets.
  • Hi-Tech – purchases seem to center on computer-networking,  then go to gas stations, grocery stores/supermarkets, restaurants, and telecomm.
  • Average – seems to have two focuses grocery stores/supermarkets and restaurants and then goes out from there to gas stations, specialty food stores and department stores.
  • Dinner-out – purchases seem to center on restaurants and then branch out fro there to computer-networking, gas stations, supermarkets, fast food, etc.

In “b”  breakout above, you can see the socio-demographic characteristics of each lifestyle group as compared with the median user. And in “c” one can see some population histograms of the demographic data.

They were then able to use the CDR data to construct a map of which lifestyle called which other life style to identify call correlation data. Most calls were contacts between the same groups but the second most active call was calls to householders.

They took this same analysis to another city in Mexico and came up with six  lifestyle categories, five of the same and a different one.

~~~~

When I went to Uni (a long long time ago), I attended an urban geography class that was much more scientific and mathematical than any other geography class I had ever attended. I remember asking the professor when did geography become an exact science. As best as I can recall, he laughed and said over the last decade.

Analysis like the above could make B2C marketing, almost an exact science.

Big Data meet Marketing – Buyer beware.

Comments?

Photo Credit(s):  All charts/photos are from the Nature article Sequences of purchase in credit card data reveal lifestyles in urban populations

Stanford Data Lab students hit the ground running…

Read an article (Students confront the messiness of data) today about Stanford’s Data Lab  and how their students are trained to cleanup and analyze real world data.

The Data Lab teaches two courses the Data Challenge Lab course and the Data Impact Lab course. The Challenge Lab is an introductory course in data gathering, cleanup and analysis. The Impact Lab is where advanced students tackle real world, high impact problems through data analysis.

Data Challenge Lab

Their Data Challenge Lab course is a 10 week course with no pre-requisites that teaches students how to analyze real world data to solve problems.

Their are no lectures. You’re given project datasets and the tools to manipulate, visualize and analyze the data. Your goal is to master the tools, cleanup the data and gather insights from the data. Professors are there to provide one on one help so you can step through the data provided and understand how to use the tools.

In the information provided on their website there were no references and no information about the specific tools used in the Data Challenge Lab to manipulate, visualize and analyze the data. From an outsiders’ viewpoint it would be great to have a list of references or even websites describing the tools being used and maybe the datasets that are accessed.

Data Impact Lab

The Data Impact lab course is an independent study course, whose only pre-req is the Data Challenge Lab.

Here students are joined into interdisplinary teams with practitioner partners to tackle ongoing, real world problems with their new data analysis capabilities.

There is no set time frame for the course and it is a non-credit activity. But here students help to solve real world problems.

Current projects in the Impact lab include:

  • The California Poverty Project  to create an interactive map of poverty in California to supply geographic guidance to aid agencies helping the poor
  • The Zambia Malaria Project to create an interactive map of malarial infestation to help NGOs and other agencies target remediation activity.

Previous Impact Lab projects include: the Poverty Alleviation Project to provide a multi-dimensional index of poverty status for areas in Kenya so that NGOs can use these maps to target randomized experiments in poverty eradication and the Data Journalism Project to bring data analysis tools to breaking stories and other journalistic endeavors.

~~~~

Courses like these should be much more widely available. It’s almost the analog to the scientific method, only for the 21st century.

Science has gotten to a point, these days, where data analysis is a core discipline that everyone should know how to do. Maybe it doesn’t have to involve Hadoop but rudimentary data analysis, manipulation, and visualization needs to be in everyone’s tool box.

Data 101 anyone?

Photo Credit(s): Big_Data_Prob | KamiPhuc;

Southbound traffic speeds on Masonic avenue on different dates | Eric Fisher;

Unlucky Haiti (1981-2010) | Jer Thorp;

Bristol Cycling Level by Wards 2011 | Sam Saunders

Domesticating data

4111674475_76be20e180_zRead an article the other day from MIT News (Taming Data) about a new system that scans all your tabular data and provides an easy way to query all this data from one system. The researchers call the system the Data Civilizer.

What does it do

Tabular data seems to be the one constant in corporate data (that and for me PowerPoint and Word docs). Most data bases are tables of one form or another (some row and some column based). Lots of operational data is in spreadsheets (tables by another name) of some type.  And when I look over most IT/Networking/Storage management GUIs, tables (rows and columns) of data are the norm.

156788318_628fb0e4dc_oThe Data Civilizer takes all this tabular data and analyzes it all, column by column, and calculates descriptive characterization statistics for each column.

Numerical data could be characterized by range, standard deviation, median/average, cardinality etc. For textual data a list of words in the column by frequency might suffice. It also indexes every  word in the tables it analyzes.

Armed with its statistical characterization of each column, the Data Civilizer can then generate a similarity index between any two columns of data across the tables it has analyzed. In that way it can connect data in one table with data in another.

Once it has a similarity matrix and has indexed all the words in every table column it has analyzed, it can then map the tabular data, showing which columns look similar to other columns. Then any arbitrary query for data, can be executed on any table that contains similar data supplying the results of the query across the multiple tables it has analyzed.

Potential improvements

The researchers indicated that they currently don’t support every table data format. This may be a sizable task on its own.

In addition statistical characterization or classification seems old school nowadays. Most new AI is moving off statistical analysis to more neural net types of classification. Unclear if you could just feed all the tabular data to a deep learning neural net, but if the end game is to find similarities across disparate data sets, then neural nets are probably a better way to go. How you would combine this with brute force indexing of all tabular data words is another question.

~~~~

In the end as I look at my company’s information, even most of my Word docs are organized in some sort of table, so cross table queries could help me a lot. Let me know when it can handle Excel and Word docs and I’ll take another look.

Photo Credit(s): Linear system table representation 2 by Ronald O’ Daniel

Glenda Sims by Glendathegood

 

Big open data leads to citizen science

Read an article the other day in ScienceLine about the Astronomical Data Explosion.  It appears that as international observatories start to open up their archives and their astronomical data to anyone and anybody, people are starting to do useful science with it.

Hunting for planets

The story talked about a pair of amateur astronomers who were looking through Kepler telescope data which had recently been put online (see PlanetHunters.org) to find anomalies that signal the possibility of a planet.  They saw a diming of a particular star’s brightness and then saw it again 132 days later. At that point they brought it to the attention of real scientists who later discovered that what they found was a 4 star solar system which they labeled Tatooine.

It seems with all the latest astronomical observations coming in from Kepler, the Sloan Digital Sky Survey and Hubble observatories are generating a deluge of data. And although all this data is being subjected to intense scrutiny by professional astronomers, they can’t do everything they want to do with it.

Consequently, in astronomy today we now have come to a new world of abundant data but not enough resources to do all the science that can be done.  This is where the citizen or amateur scientist enters the picture. Using standard web accessible tools they are able to subject the data to many more eyes each looking for whatever interest spurs them on and as such, can often contribute real science from their efforts.

Citizen science platforms

It turns out PlanetHunters.org is one of a number of similar websites put up by Zooniverse to support citizen science in astronomy, biology, nature, climate and humanities. Their latest project is to classify animal found in snapshots taken on the Serengheti (see SnapshotSerengeti.org).

Of course crowdsourced scientific activity like this has been going on for a long time now with Boinc projects like SETI@Home screen savers that sifted through radio signals searching for extra-terestial signals. But that made use of the extra desktop compute cycles people were waisting with screen savers.

 

In contrast, Zooniverse started with the GalaxyZoo project (original retired site here). They put Hubble telescope images online and asked for amateur astronomers to classify the type of galaxies found in the images.

GalaxyZoo had modest aspirations at first but when they put the Hubble images online their servers were overwhelmed with the response and had to be beefed up considerably to deal with the traffic.  Overtime, they were able to get literally millions of galaxy classifications. Now they want more, and the recent incarnation of GalaxyZoo has put the brightest 250K galaxies online and they are asking for even finer, more detailed classifications of them.

Today’s Zooniverse projects are taking advantage of recent large and expanding data repositories plus newer data visualization tools to help employ human analysis to their data.  Automated tools are not yet sophisticated enough to classify images as well as a human can.

One criteria for Zooniverse projects is to have a massive amount of data which needs to be classified.  In this way, science is once again returning to it’s amateur roots but this time guided by professionals.  Together we can do more than what either could do apart.

~~~~

I suppose it was only a matter of time before science got inundated with more data than they could process effectively.  Having the ability to put all this data online, parcel it out to concerned citizens and ask them to help understand/classify it has brought a new dawn to citizen science.

Comments?

Photo credits:
Twin Suns on Mos Espa by Stéfan
BONIC running SETI@Home by Keng Susumpow
Galaxy Group Stephan’s Quintet by HubbleColor {Zolt}

Backup is for (E)discovery too

Electronic Discovery Reference Model (from EDRM.net)
Electronic Discovery Reference Model (from EDRM.net)

There has been lot’s of talk in twitterverse and elsewhere on how “backup is used for restore and archive is for e-discovery”, but I beg to differ.

If one were to take the time to review the EDRM (Electronic Discovery Reference Model) and analyze what happens during actual e-discovery processes, one would see that nothing is outside the domain of court discovery requests. Backups have and always will hold discoverable data just as online and user desktop/laptop storage do. In contrast, archives are not necessarily a primary source of discoverable data.

In my view, any data not in archive, by definition is online or on user desktop/laptop storage. Once online, data is most likely being backed up periodically and will show up in backups long before it’s moved to archive. Data deletions and other modifications can often be reconstructed from backups much better than from archive (with the possible exception of records management systems). Also, reconstructing data proliferation, such as who had a copy of what data when, is often crucial to court proceedings and normally, can only be reconstructed from backups.

Archives have a number of purposes but primarily it’s to move data that doesn’t change off company storage and out of its backup stream. Another popular reason for archive is to be used to satisfy compliance regimens that require companies to hold data for periods of time, such as mandated by SEC, HIPPA, SOX, and others. For example, SEC brokerage records must be held long after an account goes inactive, HIPPA health records must be held long after a hospital visit, SOX requires corporate records to be held long after corporate transactions transpire. Such records are more for compliance and/or customer back-history request purposes than e-discovery but here again any data stored by the corporation is discoverable.

So I believe it’s wrong to say that Backup is only for restore and archive is only for discovery. Information, anywhere within a company is discoverable. However, I would venture to say that a majority of e-discovery data comes from backups rather than elsewhere.

Now, as for using backups for restore,…