Storywrangler, ranking tweet ngrams over time

Read a couple of articles the past few weeks on a project in Vermont that has randomly selected 10% of all tweets (150 Billion) since the beginning of Twitter (2008) and can search and rank this tweet corpus for ngrams (1-, 2-, & 3-word phrases). All of these articles were reporting on a Science Advances article: Storywrangler: A massive exploratorium for sociolinguistic, cultural, socioeconomic, and political timelines using Twitter.

Why we need Storywrangler

The challenge with all social media is that it is transient, here now, (mostly) gone tomorrow. That is once posted, if it’s liked/re-posted/re-tweeted it can exist in echoes of the original on the service for some time, and if not, it dies out very quickly never to be seen (externally ever) again. While each of us could potentially see every tweet we have ever created (when this post is published it should be my 5387th tweet on my twitter account) but most of us cannot see this history for others.

All that makes viewing what goes on on social media impossible which leads to a lot of mis-understanding and makes it difficult to analyze. It would be great if we had a way of looking at social media activity in more detail to understand it better.

I wrote about this before (see my Computational anthropology & archeology post) and if anything, the need for such capabilities has become even more important in today’s society.

If only there was a way to examine the twitter-verse. What’s mainly lacking is a corpus of all tweets that have ever been tweeted. A way to slice, dice, search, and rank this text data would be a godsend to understanding (twitter and maybe social) history, in real time.

Storywrangler, has a randomized version of 10% of all tweets since twitter started. And it provides ngram searching and ranking over a specified time interval. It’s not everything but it’s a start.

Storywrangler currently has over 1 trillion (1- to 3- word) ngrams and they support ngram rankings for over 150 different languages.

Google books ngram viewer

The idea for the Storywrangler project came from Google’s books ngram viewer. Google’s ngram viewer has a corpus of Google books, over a time period (from 1800 to 2019) and allows one to search for ngrams (1- to 5-word phrases) over any time period they support.

Google’s ngram viewer charts ngrams with a vertical axis that is the % of all ngrams in their book corpus. One can see the rise and fall of ngrams, e.g., “atomic power”. The phrase “atomic power” peaked in Google books around 1960 at a height of 0.000260% of all 2 word ngrams. The time period level of granularity is a year.

The nice thing about Google books ngram data is you can download their book ngram data yourself. The data is of the form of tab separated list of rows with ngram text (1 to 5 words), year, how many times it occurred that year, on how many pages, on how many books on each row. Google books ngram data is generally about 2 years old.

Unclear just how much data is in Google’s books ngram database but for instance in the 1 gram English fiction list, they show a sample of two rows (the 3,000,000 and 3,000,001 rows) which are the 1978 and 1979 book counts for the word “circumvallate”.

Storywrangler tweet ngram viewer

The usage tab on the Storywrangler website provides a search engine that one can use to input N-grams that you want to search the corpus for and can visualize how their rank changes over time. For example, one can do a similar search on the “atomic power” ngram only for tweets.

From Storywrangler search one can see that peak tweet use of “Atomic Power” and “ATOMIC POWER” occurred somewhere in July of 2020 (only way to see the month is to hover over that line) and it’s rank reached somewhere around ~10,000 highest used tweet 2 word ngram during that time.

It’s interesting to see that ngram books and ngram twitter don’t seem to have any correlation. For example the prior best ranking for atomic power (~200Kth highest) was in June of 2015. There was no similar peak for book ngrams of the phrase.

For Storywrangler you can download a JSON or CSV version of the charts displayed. It’s not the complete ngram history that Google book ngram viewer provides. Storywrangler data is generally about 2 days old.

The other nice thing about Storywrangler is under the real-time tab it will show you ngram rankings at 15 minute intervals for whatever timeline you wish to see. Also under the trending tab it will show you the changing ranks for the top 5 ngrams over a selected time period. And the languagetab will do tracking for tweet language use for select languages. The common tab will track the ranking of most common ngrams (pretty boring mostly articles/prepositions) over time. And for any of these searches one can turn on or off retweet counting, which can help to eliminate bot activity.

Storywrangler provides a number of other statistics for ngrams other than just ranking such as odds (of occurring) and frequency (of occurrence). And one can also track rank change, old (years) rank vs. current (year) rank, rank (turbulence) divergence.



Photo Credit(s):

The beginning of the end of cancer

Read an article today about new research done to apply big data analytics against multiple cancer strains to identify key control mechanisms that allow cancer to survive in the body and multiply. The article Big data analysis find cancer’s key vulnerabilities discusses their discovery of 24 “master regulators” that are present in a number of different cancers. The original research article is in Cell (behind paywall) but I managed to find a preprint on BiorXiv.

From a (software) coding perspective, it’s almost like a majority of cancers are re-using the same modules to perform functions that are needed by the cancer cells. Not all cancers exhibit all master regulator blocks but all the cancers that they have examined have some of them.

The researchers examined the regulatory/signaling networks of proteins in 112 cancer cell lines. They identified 407 master regulatory proteins and further analysis showed that these protiens were associated with 24 master regulatory architectures (oncotectures). A decent laymen description of a cancer oncotecture can be found in an old (2016) Economist Article Cancer’s master criminals…

Master regulatory proteins

According to the Economist article master regulatory proteins are proteins that regulate processes in a cancer cell that cause other proteins to be made, which cause other proteins to be made, etc. which affect the way a cancer cell lives and propagates inside a body.

Biologists call these sorts of proteins transcription factors which controls the copying of DNA information into mRNA which are then taken to protein factories to create proteins from that blueprint.

The research team believe they24 master regulatory (MR) blocks, if they could be disabled somehow, would disrupt the cancer cell and ultimately eliminate that cancer from a body.

It’s almost like a DevOps script that automates the deployment of software inside the cloud. The fact that they have identified 24 master regulatory (MR blocks) architectures (sequences of proteins that are occur) that apply to a wide set of cancer tumor sub-types implies that these could be needed to regulate the functionality of these cancers. If drugs could be devised to interrupt, change or deactivate these master regulatory blocks it’s quite possible that these cancers would be eliminated.

Identifying MR Blocks using (Bio/Life Sciences) Big Data

It all starts with VIPER analysis (GitHub repo) that measures a specific proteins transcriptional activity level. In this fashion they were able to analyze the 112 tumor subtype proteome (the total complement of all proteins active in a cell). And whittle these down, using cluster analysis to those that were especially relevant for the cancer cell transcription activity.

They then used DIGGIT analysis (GitHub repo of R implementation) to identify the MR proteins and identify cellular mutations that led to them. The types of mutations can be copy number, single point or gene fusion. DIGGIT analysis can help identify which of the mutations are responsible for the protein being analyzed. The DIGGIT process is a multi-step, analytical approach to identifying candidate MR proteins.

Then using tumor checkpoint hypothesis and Bayesian analysis/integration they further ranked the MR candidate proteins. Tumor checkpoints are state transitions in the life of a cancer cell where the cell assesses its environment and then determines what actions to take next.

The tumor checkpoint hypothesis says that during the life cycle of a cancer cell it goes through various state transitions. The researchers have shown that these state transitions are managed by the MR blocks they have identified.

In the final step in their analysis, they used tumor checkpoint hypothesis and modularity with saturation & modularity analysis to identify top MR proteins and the MR blocks active in the 112 tumor subtypes.

At the end of their analysis, they had identified 24 MR blocks which solely or in some combination are present in each of the 112 tumor subtypes. If these MR blocks could be attacked by specific drugs then each of these 112 tumor subtypes could essentially be eliminated from a body or rather cure that cancer.

Photo Credit(s):