Storywrangler, ranking tweet ngrams over time

Read a couple of articles the past few weeks on a project in Vermont that has randomly selected 10% of all tweets (150 Billion) since the beginning of Twitter (2008) and can search and rank this tweet corpus for ngrams (1-, 2-, & 3-word phrases). All of these articles were reporting on a Science Advances article: Storywrangler: A massive exploratorium for sociolinguistic, cultural, socioeconomic, and political timelines using Twitter.

Why we need Storywrangler

The challenge with all social media is that it is transient, here now, (mostly) gone tomorrow. That is once posted, if it’s liked/re-posted/re-tweeted it can exist in echoes of the original on the service for some time, and if not, it dies out very quickly never to be seen (externally ever) again. While each of us could potentially see every tweet we have ever created (when this post is published it should be my 5387th tweet on my twitter account) but most of us cannot see this history for others.

All that makes viewing what goes on on social media impossible which leads to a lot of mis-understanding and makes it difficult to analyze. It would be great if we had a way of looking at social media activity in more detail to understand it better.

I wrote about this before (see my Computational anthropology & archeology post) and if anything, the need for such capabilities has become even more important in today’s society.

If only there was a way to examine the twitter-verse. What’s mainly lacking is a corpus of all tweets that have ever been tweeted. A way to slice, dice, search, and rank this text data would be a godsend to understanding (twitter and maybe social) history, in real time.

Storywrangler, has a randomized version of 10% of all tweets since twitter started. And it provides ngram searching and ranking over a specified time interval. It’s not everything but it’s a start.

Storywrangler currently has over 1 trillion (1- to 3- word) ngrams and they support ngram rankings for over 150 different languages.

Google books ngram viewer

The idea for the Storywrangler project came from Google’s books ngram viewer. Google’s ngram viewer has a corpus of Google books, over a time period (from 1800 to 2019) and allows one to search for ngrams (1- to 5-word phrases) over any time period they support.

Google’s ngram viewer charts ngrams with a vertical axis that is the % of all ngrams in their book corpus. One can see the rise and fall of ngrams, e.g., “atomic power”. The phrase “atomic power” peaked in Google books around 1960 at a height of 0.000260% of all 2 word ngrams. The time period level of granularity is a year.

The nice thing about Google books ngram data is you can download their book ngram data yourself. The data is of the form of tab separated list of rows with ngram text (1 to 5 words), year, how many times it occurred that year, on how many pages, on how many books on each row. Google books ngram data is generally about 2 years old.

Unclear just how much data is in Google’s books ngram database but for instance in the 1 gram English fiction list, they show a sample of two rows (the 3,000,000 and 3,000,001 rows) which are the 1978 and 1979 book counts for the word “circumvallate”.

Storywrangler tweet ngram viewer

The usage tab on the Storywrangler website provides a search engine that one can use to input N-grams that you want to search the corpus for and can visualize how their rank changes over time. For example, one can do a similar search on the “atomic power” ngram only for tweets.

From Storywrangler search one can see that peak tweet use of “Atomic Power” and “ATOMIC POWER” occurred somewhere in July of 2020 (only way to see the month is to hover over that line) and it’s rank reached somewhere around ~10,000 highest used tweet 2 word ngram during that time.

It’s interesting to see that ngram books and ngram twitter don’t seem to have any correlation. For example the prior best ranking for atomic power (~200Kth highest) was in June of 2015. There was no similar peak for book ngrams of the phrase.

For Storywrangler you can download a JSON or CSV version of the charts displayed. It’s not the complete ngram history that Google book ngram viewer provides. Storywrangler data is generally about 2 days old.

The other nice thing about Storywrangler is under the real-time tab it will show you ngram rankings at 15 minute intervals for whatever timeline you wish to see. Also under the trending tab it will show you the changing ranks for the top 5 ngrams over a selected time period. And the languagetab will do tracking for tweet language use for select languages. The common tab will track the ranking of most common ngrams (pretty boring mostly articles/prepositions) over time. And for any of these searches one can turn on or off retweet counting, which can help to eliminate bot activity.

Storywrangler provides a number of other statistics for ngrams other than just ranking such as odds (of occurring) and frequency (of occurrence). And one can also track rank change, old (years) rank vs. current (year) rank, rank (turbulence) divergence.

~~~~

Comments?

Photo Credit(s):

Software defined power grid

Read an article this past week in IEEE Spectrum (The Software Defined Power Grid is here) about a company that has been implementing software defined power grids throughout USA and the world to better integrate and utilize renewable energy alongside conventional power generation equipment.

Moreover, within the last year or so, Tesla has installed a Virtual Power Plant (VPP) using residential solar and grid scale batteries to better manage the electrical grid of South Australia (see Tesla’s Australian VPP propped up grid during coal outage). VPP use to offset power outages would necessitate something like a software defined power grid.

Software defined power grid

Not sure if there’s a real definition somewhere but from our perspective, a software defined power grid is one where power generation and control is all done through the use of programatic automation. The human operator still exists to monitor and override when something goes wrong but they are not involved in the moment to moment control of which power is saved vs. fed into the grid.

About a decade ago, we wrote a post about smart power meters (Smart metering’s data storage appetite) discussing the implementation of smart meters for home owners that had some capabilities to help monitor and control power use. But although that technology still exists, the software defined power grid has moved on.

The IEEE Spectrum article talks about a phasor measurement units (PMUs) that are already installed throughout most power grids. It turns out that most PMUs are capable of transmitting phasor power status at 60 times a second granularity and each status report is time stamped with high accuracy, GPS synchronized time.

On the other hand, most power grids today use SCADAs (supervisory control and data acquisition) to monitor and manage the power grid. But SCADAs only send data every 2-4 seconds. PMU’s are also installed in most power grids, but their information is not as important as SCADA to the monitoring, management and control of most (non-software defined) power grids.

One software defined power grid

PXiSE, the company in the IEEE Spectrum article, implemented their first demonstration project in Hawaii. That power grid had reached the limit of wind and solar power that it could support with human management. The company took their time and implemented a digital simulation of the power grid. But with the simulation in hand, battery storage and a off the shelf PC, the company was able to manage the grids power generation mix in real time with complete automation.

After that success, the company next turned to a micro-grid (building level power) with electronic vehicles, battery and solar power. Their software defined power grid reduced peak electricity demand within the building, saving significant money. With that success the company took their software defined power grid on the road to South Korea, Chile, Mexico and a number of other locations the world.

Tesla’s VPP

The Tesla VPP in South Australia, is planned to consists of up to 50K houses with solar PV panels and 13.5Kwh of batteries, able to deliver up to 250Mw of power generation and 650Mwh of power storage.

At the present time, the system has ~1000 house systems installed but even with that limited generation and storage capability it has already been called upon at least twice to compensate for coal generation power outage. To manage each and every household, they’d need something akin to the smart meters mentioned above in conjunction with a plethora of PMUs.

Puerto Rico’s power grid problems and solutions

There was an article not so long ago about the disruption to Puerto Rico’s power grid caused by Hurricanes Irma and Maria in IEEE Spectrum (Rebuilding Puerto Rico’s Power Grid: The Inside Story) and a subsequent article on making Puerto Rico’s power grid more resilient to hurricanes and other natural disasters (How to harden Puerto Rico’s power grid). The later article talked about creating micro grids, community PV and battery storage that could be disconnected from the main grid in times of disaster but also used to distribute power generation throughout the island.

Although the researchers didn’t call for the software defined power grid, it is our understanding that something similar would be an outstanding addition to their work there.

~~~~

As the use of renewables goes up and the price of batteries decreases while their capabilities go up over time, more and more power grids will need to become software defined. In the end, more software defined power grids with increasing renewables power generation and storage will make any power grid, more resilient and more fault tolerant.

Photo Credit(s):

All that AI DL training data comes from us

Read a couple of articles the past few weeks that highlighted something that not many of us are aware of, most of the data used to train AI deep learning (DL) models comes from us.

That is through our ignorance or tacit acceptation of licenses for apps that we use every day and for just walking around/interacting with the world.

The article in Atlantic, The AI supply chain runs on ignorance, talks about Ever, a picture sharing app (like Flickr), where users opted in to its facial recognition software to tag people in pictures. Ever also used that (tagged by machine or person) data to train its facial recognition software which it sells to government agencies throughout the world.

The second article, in Engadget , Colorado College students were secretly used to train AI facial recognition (software), talks about a group using a telephoto security camera than was pointed at a high traffic area on campus. The data obtained was used to help train an AI DL model to identify facial characteristics from far away.

The article went on to say that gathering photos from people in public places is not against the law. The study was also cleared by the school. The database was not released until after the students graduated but it did have information about the time and date the photos were taken.

But that’s nothing…

The same thing applies to video sharing and photo animation models, podcasting and text speaking models, blogging and written word generation models, etc. All this data is just lying around the web, freely available for any AI DL data engineer to grab and use to train their models. The article which included the image below talks about a new dataset of millions of webpages.

From an OpenAI paper on better language models showing the accuracy of some AI DL models “trained on a new dataset of millions of webpages called WebText.”

,Google photo search is scanning the web and has access to any photo posted to use for training data. Facebook, IG, and others have millions of photos that people are posting online every day, many of which are tagged, with information identifying people in the photos. I’m sure some where there’s a clause in a license agreement that says your photos, when posted on our app, no longer belong to you alone.

As security cameras become more pervasive, camera data will readily be used to train even more advanced facial recognition models without your say so, approval or even appreciation that it is happening. And this is in the first world, with data privacy and identity security protections paramount, imagine how the rest of the world’s data will be used.

With AI DL models, it’s all about the data. Yes much of it is messy and has to be cleaned up, massaged and sometimes annotated to be useful for DL training. But the origins of that training data are typically not disclosed to the AI data engineers nor the people that created it.

We all thought China would have a lead in AI DL because of their unfettered access to data, but the west has its own way to gain unconstrained access to vast amounts of data. And we are living through it today.

Yes AI DL models have the potential to drastically help the world, humanity and government do good things better. But a dark side to AI DL models also exist to help bad actors, organizations and even some government agencies do evil.

Caveat usor (May the user beware)

~~~~

Comments?

Photo Credit(s): “Still Watching You” by jhcrow is licensed under CC BY-NC 2.0 

Computational Photography Homework 1 Results.” by kscottz is licensed under CC BY-NC 2.0 

From Language models are unsupervised multi-task learners OpenAI research paper

IT in space

Read an article last week about all the startup activity that’s taking place in space systems and infrastructure (see: As rocket companies proliferate … new tech emerges leading to a new space race). This is a consequence of cheap(er) launch systems from SpaceX, Blue Origin, Rocket Lab and others.

SpaceBelt, storage in space

One startup that caught my eye was SpaceBelt from Cloud Constellation Corporation, that’s planning to put PB (4X library of congress) of data storage in a constellation of LEO satellites.

The LEO storage pool will be populated by multiple nodes (satellites) with a set of geo-synchronous access points to the LEO storage pool. Customers use ground based secure terminals to talk with geosynchronous access satellites which communicate to the LEO storage nodes to access data.

Their main selling points appear to be data security and availability. The only way to access the data is through secured satellite downlinks/uplinks and then you only get to the geo-synchronous satellites. From there, those satellites access the LEO storage cloud directly. Customers can’t access the storage cloud without going through the geo-synchronous layer first and the secured terminals.

The problem with terrestrial data is that it is prone to security threats as well as natural disasters which take out a data center or a region. But with all your data residing in a space cloud, such concerns shouldn’t be a problem. (However, gaining access to your ground stations is a whole different story.

AWS and Lockheed-Martin supply new ground station service

The other company of interest is not a startup but a link up between Amazon and Lockheed Martin (see: Amazon-Lockheed Martin …) that supplies a new cloud based, satellite ground station as a service offering. The new service will use Lockheed Martin ground stations.

Currently, the service is limited to S-Band and attennas located in Denver, but plans are to expand to X-Band and locations throughout the world. The plan is to have ground stations located close to AWS data centers, so data center customers can have high speed, access to satellite data.

There are other startups in the ground station as a service space, but none with the resources of Amazon-Lockheed. All of this competition is just getting off the ground, but a few have been leasing idle ground station resources to customers. The AWS service already has a few big customers, like DigitalGlobe.

One thing we have learned, is that the appeal of cloud services is as much about the ecosystem that surrounds it, as the service offering itself. So having satellite ground stations as a service is good, but having these services, tied directly into other public cloud computing infrastructure, is much much better. Google, Microsoft, IBM are you listening?

Data centers in space

Why stop at storage? Wouldn’t it be better to support both storage and computation in space. That way access latencies wouldn’t be a concern. When terrestrial disasters occur, it’s not just data at risk. Ditto, for security threats.

Having whole data centers, would represent a whole new stratum of cloud computing. Also, now IT could implement space native applications.

If Microsoft can run a data center under the oceans, I see no reason they couldn’t do so in orbit. Especially when human flight returns to NASA/SpaceX. Just imagine admins and service techs as astronauts.

And yet, security and availability aren’t the only threats one has to deal with. What happens to the space cloud when war breaks out and satellite killers are set loose.

Yes, space infrastructure is not subject to terrestrial disasters or internet based security risks, but there are other problems besides those and war that exist such as solar storms and space debris clouds. .

In the end, it’s important to have multiple, non-overlapping risk profiles for your IT infrastructure. That is each IT deployment, may be subject to one set of risks but those sets are disjoint with another IT deployment option. IT in space, that is subject to solar storms, space debris, and satellite killers is a nice complement to terrestrial cloud data centers, subject to natural disasters, internet security risks, and other earth-based, man made disasters.

On the other hand, a large, solar storm like the 1859 one, could knock every data system on the world or in orbit, out. As for under the sea, it probably depends on how deep it was submerged!!

Photo Credit(s): Screen shots from SpaceBelt youtube video (c) SpaceBelt

Screens shot from AWS Ground Station as a Service sign up page (c) Amazon-Lockheed

Screen shots from Microsoft’s Under the sea news feature (c) Microsoft

The wizardry of StorMagic

We talked with Hans O’Sullivan, CEO and Chris Farey, CTO of StorMagic during Storage Field Days 6 (SFD6, view videos of their session) a couple of weeks back and they presented some interesting technology, at least to me.

Their SvSAN, software defined storage  solution has been around since 2009, and was originally intended to provide shared storage for SMB environments but was changed in 2011 to focus more on remote offices/branch offices (ROBO) for larger customers.

What makes the SvSAN such an appealing solution is that it’s a software-only storage solution that can use a minimum of 2 servers to provide a high availability, shared block storage cluster which can all be managed from one central site. Their SvSAN installs as a virtual storage appliance that runs as a virtual machine under a hypervisor and you can assign it to manage as much or as little of the direct access or SAN attached storage available to the server.

SvSAN customers

As of last count they had 30K licenses, in 64 countries, across 6 continents, were managing over 57PB of data, and had one (large retail) customer with over 2000 sites managed from one central location.  They had pictures of one customer in their presentation which judging by the color was obvious who it was but they couldn’t actually say.

One customer with a 1000’s of sites had prior storage that was causing 100’s of store outages a year, each of which averaged 6 hours to recover which cost them $6K each. Failure cost could be much larger and much longer, if there was a data loss.  They obviously needed a much more reliable storage system and wanted to reduce their cost of maintenance. Turning to SvSAN saved them lot’s of $s and time and eliminated their maintenance downtime.

Their largest vertical is retail but StorMagic does well in most ROBO environments which have limited IT staff, and limited data requirements. Other verticals they mentioned included defense (they specifically mentioned the German Army who have a parachute deployable, all-SSD SvSAN storage/data center), manufacturing (with small remote factories), government with numerous sites around the world, financial services (banks with many remote offices), restaurant and hotel chains, large energy companies, wind farms, etc.  Hans mentioned one a large wind farm operator that said their “field” data centers were so remote it took 6 days to get someone out to them to solve a problem but they needed 600GBs of shared storage to manage the complex.

SvSAN architecture

SvSAN uses synchronous mirroring between pairs of servers so that the data is constantly available in both servers of a pair. Presumably the amount of storage available to the SvSAN VSA’s running in the two servers have to be similar in capacity and performance.

An SvSAN cluster can grow by adding pairs of servers or by adding storage to an already present SvSAN cluster. One can have as many pairs of servers in an SvSAN local cluster as you want (probably some maximum here but I can’t recall what they said). The cluster interconnect is 1GbE or 10GbE. Most (~90%) of SvSAN implementations are under 2TB of data but their largest single clustered configuration is 200TB.

SvSAN supplies iSCSI storage services and runs inside a Linux virtual machine. But SvSAN can support both bare metal as well as virtualized server environments.

All the storage within a server that is assigned to SvSAN is pooled together and carved out as iSCSI virtual disks.  SvSAN can make use of raid controller with JBODs, DAS or even SAN storage, anything that is accessible to a virtual machine can be configured as part of SvSAN’s storage pool.

Servers that are accessing the shared iSCSI storage may access either of the servers in a synchronous mirrored pair. As it’s a synchronous mirror, any writes written to one of the servers is automatically mirrored to the other side before an acknowledgement is sent back to the host. Synchronous mirroring depends on multi-pathing software at the host.

As in any solution that supports active-active read-write access there is a need for a Quorum service to be hosted somewhere in the environment. Hopefully, at some location distinct from where a problem could potentially occur, but it doesn’t have to be. In StorMagic’s case this could reside on any physical server, even in the same environment. The Quorum service is there to decide which of the two copies is “more” current when there is some sort of split brain scenario. That is when the two servers in a synchronized pair lose communication with one another. At that point the Quorum service declares one dead and the other active and from that point on all IO activity must be done through the active SvSAN server. The Quorum service can also run on Linux or Windows and remotely or locally. Any configuration changes will need to be communicated to the Quorum service.

They have a bare metal recovery solution. Specifically, when one server fails, customers can ship out another server with a matching configuration to be installed in the remote site. When the new server comes up, it auto-configures it’s storage and networking by using the currently active server in the environment and starts a resynchronization process with that server. Which all means it can be brought up into a high availability mode with almost no IT support other than what it takes to power the server and connect some networking ports. This was made for ROBO!

Code upgrades can be done by taking one of the pair of servers down and loading the new code and resynching it’s data. Then once resynch completes you can do the same with the other server.

They support a fast-resynch service for when one of the pair goes down for any reason. At that point the active server starts tracking any changes that occur in a journal and when the other server comes up it just resends the changes that have occurred since the last time it was up.

SvSAN has support for SSDs and just released an SSD write back caching feature to help improve disk write speeds. They also support an all SSD configuration for harsh environments.

StorMagic also offers an option for non-mirrored disk but I can’t imagine why anyone would use it.

They can dynamically move one mirrored iSCSI volume from one pair of servers to another, without disrupting application activity.

Minimum hardware configuration requires a single core server but can use as many cores that you can give it. StorMagic commented that a single core maxes out at 50-60K IOPS but you can always just add more cores to the solution.

The SvSAN cluster can be managed in VMware vCenter or Microsoft System Center (MSSC) and it maintains statistics which help monitor the storage clusters in the remote office environments.

They also have a scripted recipe to help bring up multiple duplicate remote sites where local staff only need to plug in minimal networking and some storage information and they are ready to go.

SvSAN pricing and other information

Their product lists for 2 servers and 2TB of data storage is $2K and they have standard license options for 4, 8, and 16TB across a server pair after which it’s unlimited amounts of storage for the same price of $10K. This doesn’t include hardware or physical data storage this is just for the SvSAN software and management.

They offer a free 60 day evaluation license on their website (see link above).

There was a lot of twitter traffic and onsite discussion as to how this compared to HP’s StorVirtual VSA solution. The contention was that StorVirtual required more nodes but there was no-one from HP there to dispute this.

Didn’t hear much about snapshot, thin provisioning, remote replication, deduplication or encryption. But for ROBO office environments, that are typically under 2TB most of these features are probably overkill, especially when there’s no permanent onsite IT staff to support the local storage environment.

~~~~

I had talked with StorMagic previously at one or more of the storage/IT conferences we have attended at the past and had relegated them to SMB  storage solutions. But after talking with them at SFD6, their solution became quite clearer. All of the sophisticated functionality they have developed together with their software only solution, seems to be  very appealing solution for these ROBO environments.

 

 

 

Protest intensity, world news database and big data – chart of the month

Read an article the other day on the analysis of the Arab Spring (Did the Arab Spring really spark a wave of global protests, in Foreign Policy) using a Google Ideas sponsored project, the GDELT ProjectTime domain run chart showing protest intensity every month for the last 30 years, with running average (Global Database of Events, Language and Tone) file of  events extracted from worldwide media sources.  The GDELT database uses sophisticated language processing to extract “event” data from news media streams and supplies this information in database form.  The database can be analyzed  to identify  trends in world events and possibly to better understand what led up to events that occur on our planet.

GDELT Project

The GDELT database records over 300 categories of events that are geo-referenced to city/mountaintop and time-referenced. The event data dates back to 1979.  The GDELT data captures 60 attributes of any event that occurs, generating a giant spreadsheet of event information with location, time, parties, and myriad other attributes all identified, and cross-referenceable.

Besides the extensive spreadsheet of world event attribute data the GDELT project also supplies a knowledge graph oriented view of its event data. The GDELT knowledge graph “compiles a list of every person, organization, company, location and over 230 themes and emotions from every news report” that can then be used to create network diagrams/graphs to be better able to visualize interactions between events. 

For example see the Global Conversation in Foreign Policy, for a network diagram of every person mentioned in the news during 6 months of 2013.  You can zoom in or out to see how people identified in news reports are connected during the six months. So if you we’re interested, in let’s say the Syrian civil war, one could easily see at a glance any news item that mentioned Syria or was located in Syria since 1979 to now. Wow!

Arab Spring and Worldwide Protest

Getting back to the chart-of-the-month, the graphic above shows the “protest intensity” by month for the last 30 years with a running average charted in black using GDELT data.  (It’s better seen in the FP article/link above or just click on it for an expanded view. ).

One can see from the chart that there was a significant increase in protest activity after January 2011, which corresponds to the beginning of the Arab Spring.  But the amazing inference from the chart above is that this increase has continued ever since. This shows that the Arab Spring has had a lasting contribution that has significantly increased worldwide protest activity.

This is just one example of the types of research available with the GDELT data.

~~~~

I have talked in the past about how (telecom, social media and other) organizations should deposit their corporate/interaction data sets in some public repository for the better good of humanity so that any researcher could use it (see my Data of the world, lay down your chains post for more on this). The GDELT Project is Google Ideas doing this on a larger scale than I ever thought feasible. Way to go.

Comments?

 Image credits: (c) 2014 ForeignPolicy.com, All Rights Reserved

 

 

Replacing the Internet?

safe 'n green by Robert S. Donovan (cc) (from flickr)
safe ‘n green by Robert S. Donovan (cc) (from flickr)

Was reading an article the other day from TechCrunch that said Servers need to die to save the Internet. This article talked about a startup called MaidSafe which is attempting to re-architect/re-implement/replace the Internet into a Peer-2-Peer, mesh network and storage service which they call the SAFE (Secure Access for Everyone) network. By doing so, they hope to eliminate the need for network servers and storage.

Sometime in the past I wrote a blog post about Peer-2-Peer cloud storage (see Free P2P Cloud Storage and Computing if  interested). But it seems MaidSafe has taken this to a more extreme level. By the way the acronym MAID used in their name stands for Massive Array of Internet Disks, sound familiar?

Crypto currency eco-system

The article talks about MaidSafe’s SAFE network ultimately replacing the Internet but at the start it seems more to be a way to deploy secure, P2P cloud storage.  One interesting aspect of the MaidSafe system is that you can dedicate a portion of your Internet connected computers’ storage, computing and bandwidth to the network and get paid for it. Assuming you dedicate more resources than you actually use to the network you will be paid safecoins for this service.

For example, users that wish to participate in the SAFE network’s data storage service run a Vault application and indicate how much internal storage to devote to the service. They will be compensated with safecoins when someone retrieves data from their vault.

Safecoins are a new BitCoin like internet currency. Currently one safecoin is worth about $0.02 but there was a time when BitCoins were worth a similar amount. MaidSafe organization states that there will be a limit to the number of safecoins that can ever be produced (4.3Billion) so there’s obviously a point when they will become more valuable if MaidSafe and their SAFE network becomes successful over time. Also, earned safecoins can be used to pay for other MaidSafe network services as they become available.

Application developers can code their safecoin wallet-ids directly into their apps and have the SAFE network automatically pay them for application/service use.  This should make it much easier for App developers to make money off their creations, as they will no longer have to use advertising support, or provide differenct levels of product such as free-simple user/paid-expert use types of support to make money from Apps.  I suppose in a similar fashion this could apply to information providers on the SAFE network. An information warehouse could charge safecoins for document downloads or online access.

All data objects are encrypted, split and randomly distributed across the SAFE network

The SAFE network encrypts and splits any data up and then randomly distributes these data splits uniformly across their network of nodes. The data is also encrypted in transit across the Internet using rUDPs (reliable UDPs) and SAFE doesn’t use standard DNS services. Makes me wonder how SAFE or Internet network nodes know where rUDP packets need to go next without DNS but I’m no networking expert. Apparently by encrypting rUDPs and not using DNS, SAFE network traffic should not be prone to deep packet inspection nor be easy to filter out (except of course if you block all rUDP traffic).  The fact that all SAFE network traffic is encrypted also makes it much harder for intelligence agencies to eavesdrop on any conversations that occur.

The SAFE network depends on a decentralized PKI to authenticate and supply encryption keys. All SAFE network data is either encrypted by clients or cryptographically signed by the clients and as such, can be cryptographically validated at network endpoints.

The each data chunk is replicated on, at a minimum, 4 different SAFE network nodes which provides resilience in case a network node goes down/offline. Each data object could potentially be split up into 100s to 1000s of data chunks. Also each data object has it’s own encryption key, dependent on the data itself which is never stored with the data chunks. Again this provides even better security but the question becomes where does all this metadata (data object encryption key, chunk locations, PKI keys, node IP locations, etc.) get stored, how is it secured, and how is it protected from loss. If they are playing the game right, all this is just another data object which is encrypted, split and randomly distributed but some entity needs to know how to get to the meta-data root element to find it all in case of a network outage.

Supposedly, MaidSafe can detect within 20msec. if a node is no longer available and reconfigure the whole network. This probably means that each SAFE network node and endpoint is responsible for some network transaction/activity every 10-20msec, such as a SAFE network heartbeat to say it is still alive.

It’s unclear to me whether the encryption key(s) used for rUDPs and the encryption key used for the data object are one and the same, functionally related, or completely independent? And how a “decentralized PKI”  and “self authentication” works is beyond me but they published a paper on it, if interested.

For-profit open source business model

MaidSafe code is completely Open Source (available at MaidSafe GitHub) and their APIs are freely available to anyone and require no API key. They also have multiple approved and pending patents which have been provided free to the world for use, which they use in a defensive capacity.

MaidSafe says it will take a 5% cut of all safecoin transactions over the SAFE network. And as the network grows their revenue should grow commensurately. The money will be used to maintain the core network software and  MaidSafe said that their 5% cut will be shared with developers that help develop/fix the core SAFE network code.

They are hoping to have multiple development groups maintaining the code. They currently have some across Europe and in California in the US. But this is just a start.

They are just now coming out of stealth, have recently received $6M USD investment (by auctioning off MaidSafeCoins a progenitor of safecoins) but have been in operation now, architecting/designing/developing the core code now for 8+ years now, which probably qualifies them for the longest running startup on the planet.

Replacing the Internet

MaidSafe believes that the Internet as currently designed is too dependent on server farms to hold pages and other data. By having a single place where network data is held, it’s inherently less secure than by having data spread out, uniformly/randomly across a multiple nodes. Also the fact that most network traffic is in plain text (un-encrypted) means anyone in the network data path can examine and potentially filter out data packets.

I am not sure how the SAFE network can be used to replace the Internet but then I’m no networking expert. For example, from my perspective, SAFE is dependent on current Internet infrastructure to store and forward rUDPs on along its trunk lines and network end-paths. I don’t see how SAFE can replace this current Internet infrastructure especially with nodes only present at the endpoints of the network.

I suppose as applications and other services start to make use of SAFE network core capabilities, maybe the SAFE network can become more like a mesh network and less dependent on the current hub and spoke current Internet we have today.  As a mesh network, node endpoints can store and forward packets themselves to locally accessed neighbors and only go out on Internet hubs/trunk lines when they have to go beyond the local network link.

Moreover, the SAFE can make any Internet infrastructure less vulnerable to filtering and spying. Also, it’s clear that SAFE applications are no longer executing in data center servers somewhere but rather are actually executing on end-point nodes of the SAFE network. This has a number of advantages, namely:

  • SAFE applications are less susceptible to denial of service attacks because they can execute on many nodes.
  • SAFE applications are inherently more resilient because the operate across multiple nodes all the time.
  • SAFE applications support faster execution because the applications could potentially be executing closer to the user and could potentially have many more instances running throughout the SAFE network.

Still all of this doesn’t replace the Internet hub and spoke architecture we have today but it does replace application server farms, CDNs, cloud storage data centers and probably another half dozen Internet infrastructure/services I don’t know anything about.

Yes, I can see how MaidSafe and its SAFE network can change the Internet as we know and love it today and make it much more secure and resilient.

Not sure how having all SAFE data being encrypted will work with search engines and other web-crawlers but maybe if you want the data searchable, you just cryptographically sign it. This could be both a good and a bad thing for the world.

Nonetheless, you have to give the MaidSafe group a lot of kudos/congrats for taking on securing the Internet and making it much more resilient. They have an active blog and forum that discusses the technology and what’s happening to it and I encourage anyone interested more in the technology to visit their website to learn more

~~~~

Comments?

Computational Anthropology & Archeology

7068119915_732dd1ef63_zRead an article this week from Technology Review on The Emerging Science of Computational Anthropology. It was about the use of raw social media feeds to study the patterns of human behavior and how they change over time. In this article, they had come up with some heuristics that could be used to identify when people are local to an area and when they are visiting or new to an area.

Also, this past week there was an article in the Economist about Mining for Tweets of Gold about the startup DataMinr that uses raw twitter feeds to supply information about what’s going on in the world today. Apparently DataMinr is used by quite a few financial firms, news outlets, and others and has a good reputation for discovering news items that have not been reported yet. DataMinr is just one of a number of commercial entities doing this sort of analysis on Twitter data.

A couple of weeks ago I wrote a blog post on Free Social and Mobile Data as a Public Good. In that post I indicated that social and mobile data should be published, periodically in an open format, so that any researcher could examine it around the world.

Computational Anthropology

Anthropology is the comparative study of human culture and condition, both past and present. Their are many branches to the study of  Anthropology including but not limited to physical/biological, social/cultural, archeology and linguistic anthropologies. Using social media/mobile data to understand human behavior, development and culture would fit into the social/cultural branch of anthropology.

I have also previously written about some recent Computational Anthropological research (although I didn’t call it that), please see my Cheap phones + big data = better world and Mobile phone metadata underpins a new science posts. The fact is that mobile phone metadata can be used to create a detailed and deep understanding of a societies mobility.  A better understanding of human mobility in a region can be used to create more effective mass transit, more efficient  road networks, transportation and reduce pollution/energy use, among other things.

Social media can be used in a similar manner but it’s more than just location information, and some of it is about how people describe events and how they interact through text and media technologies. One research paper discussed how tweets could be used to detect earthquakes in real time (see: Earthquake Shakes Twitter Users: Real-time Event Detection by Social Sensors).

Although the location information provided by mobile phone data is more important to governments and transportation officials, it appears as if social media data is more important to organizations seeking news, events, or sentiment trending analysis.

Sources of the data today

Recently, Twitter announced that it would make its data available to a handful of research organizations (see: Twitter releasing trove of user data …).

On the other hand Facebook and LinkedIn seems a bit more restrictive in allowing access to their data. They have a few data scientists on staff but if you want access to their data you have to apply for it and only a few are accepted.

Although Google, Twitter, Facebook, LinkedIn and Telecoms represent the lions share of social/mobile data out there today, there are plenty of others sources of information that could potentially be useful that come to mind. Notwithstanding the NSA, currently there is limited research accessibility to the actual texts of mobile phone texts/messaging, and god forbid, emails.  Although privacy concerns are high, I believe ultimately this needs to change.

Imagine if some researchers had access to all the texts of a high school student body. Yes much of it would be worthless but some of it would tell a very revealing story about teenage relationships, interest and culture among other things. And having this sort of information over time could reveal the history of teenage cultural change. Much of this would have been previously available through magazines but today texts would represent a much more granular level of this information.

Computational Archeology

Archeology is just anthropology from a historical perspective, i.e, it is the study of the history of cultures, societies and life.  Computational Archeology would apply to the history of the use of computers, social media, telecommunications, Internet/WWW, etc.

There are only few resources that are widely available for this data such as the Internet Archive. But much of the history of WWW, social media, telecom, etc. use is in current and defunct organizations that aside from Twitter, continue to be very stingy with their data.

Over time all such data will be lost or become inaccessible unless something is done to make it available to research organizations. I believe sooner or later humanity will wise up to the loss of this treasure trove of information and create some sort of historical archive for this data and require companies to supply this data over time.

Comments?

Photo Credit(s): State of the Linked Open Data Cloud (LOD), September 2011 by Duncan Hull