Visionary leadershp – Page 4 – Silverton Consulting

Better core allocation for congested web apps

Posted on February 23, 2019February 23, 2019 by Ray in Distributed computing, Energy efficiency, Processing performance, Strategic Inflection Points, System effectiveness, Visionary leadershp

Read an article in ScienceDaily (Achieving greater efficiency for fast datacenter operations) today that discussed some research done at MIT CSAIL to be presented next week at NSDI’19 discussing Shenango, a new algorithm to allocate idle CPU cores to process latency sensitive transaction workloads. The paper is to be presented on February 27th. (I may update this with more details on Shenango after the paper is published)

t appears that for many web-scale applications, response time is driven mostly by tail latencies (slowest service determines web page response). For these 10K-100K server environments, they have always had to over provision CPU cores to support reducing service tail latency. This has led to 100s to 1000s of cores, mostly sitting idle (but powered on) for much of the time.

here’s been some solutions that try to better use idle cores, but their core allocation responsiveness has been in the milliseconds. With 10-100s of threads that make up web service , allocating CPU resources in milliseconds was too slow

Arachne, a core aware thread scheduler

One approach to better core allocation uses Arachne: Core Aware Thread Management, out of Stanford.

With Arachne, threads are assigned to an application and each is given a priority. Arachne attempts to schedule them in priority order across an array of cores at its disposal.

Arachne’s Core Arbiter code is what assigns application threads to cores and runs under Linux at the user level. Some of its timings seem pretty fast. In the paper cited above, Arachne was able to schedule a thread to a core in under 300nsec.

Under Arachne, there are two sets of cores, managed and unmanaged cores and applications. Unmanaged cores run normal (non-Arachne, unmanaged) applications and threads. Managed cores or applications use Arachne to assign cores.

Arachne uses a Linux construct called cpusets, a collection of cores and memory banks, to allocate resources to run application threads. Cores and memory banks move between managed and unmanaged based on applications being run. Arachne assumes that managed apps have higher priority than unmanaged apps.

That is at the start of Arachne, all cores exist in the unmanaged set. The Core Arbiter executes here as well. As applications are scheduled to run, the Arbiter grabs cpusets from unmanaged applications or a free pool and assigns them to run application threads. When the application completes the cpusets are returned to the unmanaged pool.

Arachne allocates cores based on a priority scheme with 8 levels. Highest priority managed applications/threads get cpusets first, lower priority managed application threads next, and unmanaged applications last

There’s a set of APIs that applications must use to request and free cores when no longer in use. Arachne seems pretty general purpose, and as it operates with both normal (unmanaged) Linux applications as well as (Arachne) managed applications is appealing.

Shenango core allocation

Untitled by johnwilson1969 (cc) (from Flickr)

Not much technical information on Shenango was available as we published this post, but their is some information in the MIT/ScienceDaily piece and some in the Arachne paper.

It appears as if Shenango detects applications suffering from high tail latency by interfacing with the network stack and seeing if packets have been waiting to be processed. It does this every 5 usecs and if a packet has been waiting since last time, it’s considered a candidate for more cores, has tail latency problems and is congested.

IIt seems to do the same for computational processes that have been waiting for some service response. Shenango implements an IOKernel that handles core allocation to apps. Shenango IO

Shenango apps use an API to indicate when they are not processing time sensitive services and when they are. If they are not, their cores can be released to more time sensitive apps that are encountering congestion

Presumably Shenango does not execute at the user level. And it’s unclear whether it can operate with both (Linux) normal and Shanango managed applications. And it also appears to be tied tightly to the network stack. Whether any of this matters to web-scale application users/developers is subject to debate.

However, the fact that it only alters core allocations when applications are congested seems a nice feature.

~~~~

The Arachne paper said it “improved SLO MemCached by 37% and reduced tail latency by 10X” . The only metric available in the Shenango discussion was that they increased typical web-scale server CPU core allocation from 60% to 100%

f Shenango or Arachne can reduce over provisioning of CPU cores and memory, it could lead to significant energy and server savings. Especially for customers running 10K servers or more.

perspective by anomalous4 (cc) (from Flickr)

Data banks, data deposits & data withdrawals in the data economy – part 1

Posted on September 5, 2018March 13, 2023 by Ray in Artificial Intelligence, Data, Data banks, Data deposits, Data economy, Data ownership, Data withdrawals, Initial Data Offerings (IDOs), Market dynamics, Neural network, Strategic Inflection Points, Visionary leadershp

Big data visualization, Facebook friend connections — Facebook friend carrousel by antjeverena (cc) (from flickr)

Read an interesting article this week in The Atlantic, Why Technology Favors Tyranny by Yuvai Noah Harari, about the inevitable future of technology and how the use of data will drive it.

At the end of the article Harari talks about the need to take back ownership of our data in order to gain some control over the tech giants that currently control our data.

In part 3, Harari discusses the coming AI revolution and the impact on humanity. Yes there will still be jobs, but early on less jobs for unskilled labor and over time less jobs for skilled labor.

Yet, our data continues to be valuable. AI neural net (NN) accuracy increases as a function of the amount of data used to train it. As a result, he who has the most data creates the best AI NN. This means our data has value and can be used over and over again to train other AI NNs. This all sounds like data is just another form of capital, at least for AI NN training.

If only we could own our data, then there would still be value from people’s (digital) exertions (labor), regardless of how much AI has taken over the reigns of production or reduced the need for human work.

What we need is data (savings) banks. These banks would hold people’s data, gathered from social media likes/dislikes, cell phone metadata, app/web history, search history, credit history, purchase history, photo/video streams, email streams, lab work, X-rays, wearables info, etc. Probably many more categories need to be identified but ultimately ALL the digital data we generate today would need to be owned by people and deposited in their digital bank accounts.

Data deposits?

Social media companies, telecom, search companies, financial services app companies, internet providers, etc. anywhere you do business should supply a copy of the digital data they gather for a person back to that persons data bank account.

There are many technical problems to overcome here but it could be as simple as an object storage bucket, assigned to each person that each digital business deposits (XML versions of) our digital data they create for everyone that uses their service. They would do this as compensation for using our data in their business activities.

How to change data ownership?

Today, we all sign user agreements which essentially gives a company the rights to our data in perpetuity. That needs to change. I see a few ways that this change could come about

Countries could enact laws to insure personal data ownership resides in the person generating it and enforce periodic distribution of this data
Market dynamics could impel data distribution, e.g. if some search firm supplied data to us, we would be more likely to use them.
Societal changes, as AI becomes more important to profit making activities and reduces the need for human work, and as data continues to be an important factor in AI success, data ownership becomes essential to retaining the value of human labor in society.

Probably, all of the above and maybe more would be required to change the ownership structure of data.

How to profit from data?

Technical entities needing data to train AI NNs could solicit data contributions through an Initial Data Offering (IDO). IDO’s would specify types of data required and a proportion of AI NN ownership, they would cede to all data providers. Data providers would be apportioned ownership based on the % identified and the number of IDO data subscribers.

Data banks would extract the data requested by the IDO and supply it to the IDO entity for use. For IDOs, just like ICO’s or IPO’s, some would fail and others would succeed. But the data used in them would represent an ownership share sort of like a stock (data) certificate in the AI NN.

Data bank responsibilities

Data banks would have various responsibilities and would need to collect fees to perform them. For example, data banks would be responsible for:

Protecting data deposits – to insure data deposits are never lost, are never accessed without permission, are always trackable as to how they are used..
Performing data deposits – to verify that data is deposited from proper digital entities, to validate that data deposits are in a usable form and to properly store the data in a customers object storage bucket.
Performing data withdrawals – upon customer request, to extract all the appropriate data requested by an IDO, anonymize it, secure it, package it and send it to the IDO originator.
Reconciling data accounts – to track data transactions, data banks would supply a monthly statement that identifies all data deposits and data withdrawals, data revenues and data expenses/fees.
Enforcing data withdrawal types – to enforce data withdrawal types, as data withdrawals can have many different characteristics, such as exclusivity, expiration, geographic bounds, etc. Data banks would need to enforce withdrawal characteristics, at least to the extent they can
Auditing data transactions – to insure that data is used properly, a consortium of data banks or possibly data accountancies would need to audit AI training data sets to verify that only data that has been properly withdrawn is used in trying the NN. .

AI NN, tools and framework responsibilities

In order for personal data ownership to work well, AI NNs, tools and frameworks used today would need to change to account for data ownership.

Generate, maintain and supply immutable data ownership digests – data ownership digests would be a sort of stock registry for the data used in training the AI NN. They would need to be a part of any AI NN and be viewable by proper data authorities
Track data use – any and all data used in AI NN training should be traceable so that proper data ownership can be guaranteed.
Identify AI NN revenues – NN revenues would need to be isolated, identified and accounted for so that data owners could be rewarded.
Identify AI NN data expenses – NN data costs would need to somehow be isolated, identified and accounted for so that data expenses could be properly deducted from data owner awards. .

At some point there’s a need for almost a data profit and loss statement as well as a data balance sheet for at an AI NN level. The information supplied above should make auditing data ownership, use and rewards much more feasible. But it all starts with identifying data ownership and the data used in training the AI.

~~~~

There are a thousand more questions that come to mind. For example

Who owns earth sensing satellite, IoT sensors, weather sensors, car sensors etc. data? Everyone in the world (or country) being monitored is laboring to create the environment sensed by these devices. Shouldn’t this sensor data be apportioned to the people of the world or country where these sensors operate.
Who pays data bank fees? The generators/extractors of the data could pay in addition to providing data deposits for the privilege to use our data. I could also see the people paying. Having the company pay would give them an incentive to make the data load be as efficient and complete as possible. Having the people pay would induce them to use their data more productively.
What’s a decent data expiration period? Given application time frames these days, 7-15 years would make sense. But what happens to the AI NN when data expires. Some way would need to be created to extract data from a NN, or the AI NN would need to cease being used and a new one would need to be created with new data.
Can data deposits be rented/sold to data aggregators? Sort of like a AI VC partnership only using data deposits rather than money to fund AI startups.
What happens to data deposits when a person dies? Can one inherit a data deposits, would a data deposit inheritance be taxable as part of an estate transfer?

In the end, as data is required to train better AI, ownership of our data makes us all be capitalist (datalists) in the creation of new AI NNs and the subsequent advancement of society. And that’s a good thing.

Comments?

AI processing at the edge

Posted on August 16, 2018April 8, 2021 by Ray in Artificial Intelligence, Cognitive computing, Decision making, Distributed computing, Neural network, Processing performance, R&D measures, Strategic Inflection Points, System effectiveness, Visionary leadershp

Read a couple of articles over the past few weeks (TechCrunch: Google is making a fast, specialized TPU chip for edge devices … and IEEE Spectrum: Two startups use processing in flash for AI at the edge) about chips for AI at the IoT edge.

The two startups, Syntiant and Mythic, are moving to analog only or analog-digital solutions to provide AI processing needed at the edge while Google is taking their TPU technology to the edge. We have written about Google’s TPU before (see: TPU and hardware vs. software innovation (round 3) post).

But first please take our new poll:

The major challenge in AI processing at the edge is power consumption. Both startups attack the power problem by using flash and other analog circuitry to provide power efficient compute.

Google attacked the power problem with their original TPU by reducing computational precision from 64- to 8-bits. By reducing transistor counts, they lowered power requirements proportionally.

AI today is based on neural networks (NN), that connect simulated neurons via simulated synapses with weights attached to indicate whether to boost or decrease the signal being transmitted. AI learning is done by setting those weights and creating the connections between simulated neurons and the synapses. So learning is setting weights and establishing connections. Actual inferences (using AI to do something) is a process of exciting input simulated neurons/synapses and letting the signal flow through the NN with each weight being used to determine output(s).

AI with standard compute

The problem with doing AI learning or inferencing with normal CPUs or even CUDAs is that the NN does thousands if not millions of multiplication-accumulation actions at each simulated synapse-neuron connection. Doing all these multiplication-accumulation takes power. CPUs and CUDAs can do these sorts of operations on 32 or 64 bit numbers or even floating point but it still takes power.

AI processing power

AI processing power is measured in trillions of (accumulate-multiply) operations per second per watt (TOPS/W). Mythic believes it can perform 4 TOPS/W and Syntiant says it can do 20 TOPS/W. In comparison, the NVIDIA Volta V100 can do about 0.4 TOPS/W (according to the article). Although comparing Syntiant-Mythic TOPS to NVIDIA TOPS is a little like comparing apples to oranges.

A current Intel Xeon Platinum 8180M (2.5Ghz, 28 Core processors, 205 W) can probably do (assuming one multiplication-accumulation per hertz) about 2.5 Billion X 28 Cores = 70 Billion Ops Second/205 W or 0.3 GOPS/W (source: Platinum 8180M Data sheet).

As for Google’s TPU TOPS/W, TPU2 is rated at 45 GFLOPS/chip and best guess for power consumption is between 160W and 200W, let’s say 180W. With power at that level, TPU2 should hit 0.25 GFLOPS/W. TPU3 is coming out with 8X the power but it uses water cooling (read LOTS MORE POWER).

Nonetheless, it appears that Mythic and Syntiant are one to two orders of magnitude better than the best that NVIDIA and TPU2 can do today and many orders of magnitude better than Intel X86.

Improving TOPS/W

Using NAND, as an analog memory to read, write and hold NN weights is an easy way to reduce power consumption. Combine that with analog circuitry that can do multiplication and addition with those flash values and you have a AI NN processor. This way you reduce the need to hold weights in memory and do compute in registers by collapsing both compute and memory into the same componentry.

The major difference between Syntiant and Mythic seems to be the amount of analog circuitry they use. Mythic seems to relegate the analog circuitry to an accelerator while Syntiant has a more extensive use of analog circuitry throughout their chip. Probably why it can perform 5X the TOPS/W of Mythic’s IPU.

IBM and others have been working on neuromorphic chips some of which are analog based and others which are all digital based. We’ve written extensively on IBM and some on MIT’s approaches (for the latest on IBM see: More power efficient deep learning through IBM and PCM, and for MIT see: MIT builds an analog synapse chip) and follow the links there to learn more.

~~~~

Special purpose AI hardware is emerging from the labs and finally reaching reality. IBM R&D has been playing with it for a long time. Google is working on TPU3 so there’s no stopping them. And startups are seeing an opening and are taking everyone on. Stay tuned, were in for a good long ride before the someone rises above the crowd and becomes the next chip giant.

Comments?

Photo Credit(s): TechCrunch Google is making a fast, specialized TPU chip for edge devices … article

Introduction to Digital Design Verification at Mythic, Medium.com Article

Images from Google Cloud Platform Blog on the TPU

Two startups use processing in flash for AI at the edge, IEEE Spectrum article courtesy of Mythic

Infinidat’s ~90% (average read hit) solution – #SFD16

Posted on July 15, 2018August 7, 2018 by Ray in Block Storage, Disk storage, Market dynamics, SSD storage, Storage performance, System effectiveness, Visionary leadershp

We attended a SFD16 session at Infinidat‘s offices in Waltham, MA a couple of weeks ago.

The session was led off by Doc D’Errico (@docderrico) but CTO, Brian Carmody (@initzero) did most of the talking and Erik Kaulberg (@ekaulberg) discussed some of their new products. I have known Doc, Erik and Brian for years and all of them are industry heavyweights, having worked at major and startup storage companies for decades.

Infinidat’s challenge

The challenge that Infinidat has is how to perform as well as (or better than) an all flash array when you have hybrid flash – disk storage.

The advantage of spinning disk is that it’s relatively cheap ($/GB) storage with good throughput, great reliability and reasonable volumetric density (GB/mm**3). However, its random read access time is much, two orders of magnitude worse than flash (~10 msec vs. ~100 µsec).

Fortunately, DRAM has a random access time of ~100 nsec which is 3 orders of magnitude better than flash. A manager I had once said everyone wants their data stored on tape but accessed out of memory.

So the problem comes down to insuring that application data is sitting in DRAM (cache) when requested. This is called a read (cache) hit. [Write hits are easier because they are typically written directly into memory and later destaged. So essentially all writes are cache hits.]

Mainframes vs. open systems

In the old days with MVS on mainframes, read hit rates of ~70%+ were considered good and doable. Over time MVS and z/OS, its followon, started providing hints about what IO’s were coming next. With mainframe hints, storage systems started hitting ~90% read hit rates.

Open systems never seemed to come close to those hit rates. A typical open system hit rate was ~50% for well behaved applications. For VMware and it’s IO mixer, a 50% read hit rate was aspirational and hardly ever achieved in reality. This has improved over time, but nothing comes close to 90% read hit rates over extended periods.

So, for a storage systems in open system environments to average an 90% DRAM read hit cache hit rate is unheard of, and only seen for brief intervals at best, with especially well behaved applications and not under virtualization. For a customer to see an average DRAM hit rate exceeding 90%, over the course of multiple days was inconceivable,.

And yet that’s exactly what Brian’s showing in the photo above, an average over 1 week of over 95% DRAM read hit rate over a whole week, for an major retail/e-commerce/ERP customer during one of the heaviest, if not the heaviest activity weeks of the year, The data set size was 1.5PB.

How is this possible, what does Infinidat do differently to predict which data applications need moment to moment, over the course of the heaviest retail week in the year.

Infinidat’s Neural caching algorithm

It all starts with writes. When data is written to Infinidat, it records a terrain map of all the other data that has been written recently. I suppose one could think of this as a 2 dimensional map, with spots on the map being the equivalent of data in the storage system that have recently been written. This map changes over time so it’s more like a movie stream of frames showing, from frame to frame all the recently written data in the system at any point in time. Of course the frame rate for this stream is the IO rate.

When a IO request comes in for a specific record, Infinidat uses an index to locate the last time (frame) the record was written, and using this snapshot of all data written at the same time, it reads into cache all the other data written at that time.

This additional data could be read from disk or SSD. Read throughput is slightly better for flash over disk but not orders of magnitude.

In any case, any system could do this sort of caching algorithm, iff they had the processing power needed, had the metadata layout which made recording the IO stream frame by frame space efficient, had the metadata indexing which would enable them to locate the last frame a record was written in AND had the IO parallelism required to do a whole lot of IO all the time to keep that DRAM cache filled with hit candidates. Did I mention that Infinidat uses a three controller storage system, unlike the rest of the industry that uses a two controller system. This gives them 50% more horse power and data paths to get data into cache.

Brian goes into some depth on Neural caching implementation but there’s plenty of secret sauce behind it.

~~~~

Somewhere in his presentation, Brian stated that across their entire customer base (~3.4EB @ end of last quarter), they average a 90% read (DRAM) cache hit rate, inconceivable in the old days, nigh impossible today. Of course it only gets better from here.

If a hybrid system can continuously average a 95% read cache hit rate for a customer over weeks of IO, there’s no reason that system couldn’t outperform an AFA, even an NVMeoF AFA storage system.

I suppose cache hit rates like these could be application dependent but they didn’t seem to say anything about specific verticals they were targeting. And at 3+ EB it doesn’t appear to be application specific.

Comments?

For more information you may want to see these other SFD16 participants write ups on Infinidat:

Infinidat, and #SFD16, Data is the new currency by Matt Leib (@MBLeib)

Skyrmion and chiral bobber solitons for racetrack storage

Posted on July 7, 2018 by Ray in R&D measures, SSD storage, Storage architecture, Strategic Inflection Points, System effectiveness, Visionary leadershp

Read an article this week in Science Daily (Magnetic skyrmions: Not the only one of their class; …) about new magnetic structures that could lend themselves to creating a new type of moving, non-volatile storage. (There’s more information in the press release and the Nature paper [DOI: 10.1038/s41565-018-0093-3], behind a paywall).

Skyrmions and chiral bobbers are both considered magnetic solitons, types of magnetic structures only 10’s of nm wide, that can move around, in sort of a race track configuration.

Delay line memories

Early in computing history, there was a type of memory called a delay line memory which used various mechanisms (mercury, magneto-resistence, capacitors, etc.) arranged along a circular line such as a wire, and had moving pulses of memory that raced around it. .

One problem with delay line memory was that it was accessed sequentially rather than core which could be accessed randomly. When using delay lines to change a bit, one had to wait until the bit came under the read/write head . It usually took microseconds for a bit to rotate around the memory line and delay line memories had a capacity of a few thousand bits 256-512 bytes per line, in today’s vernacular.

Delay lines predate computers and had been used for decades to delay any electronic or acoustic signal before retransmission.

A new racetrack

Solitons are being investigated to be used in a new form of delay line memory, called racetrack memory. Skyrmions had been discovered a while ago but the existence of chiral bobbers was only theoretical until researchers discovered them in their lab.

Previously, the thought was that one would encode digital data with only skyrmions and spaces. But the discovery of chiral bobbers and the fact that they can co-exist with skyrmions, means that chiral bobbers and skyrmions can be used together in a racetrack fashion to record digital data. And the fact that both can move or migrate through a material makes them ideal for racetrack storage.

Unclear whether chiral bobbers and skyrmions only have two states or more but the more the merrier for storage. I am assuming that bit density or reliability is increased by having chiral bobbers in the chain rather than spaces.

Unlike disk devices with both rotating media and moving read-write heads, the motion of skyrmion-chiral bobber racetrack storage is controlled by a very weak pulse of current and requires no moving/mechanical parts prone to wear/tear. Moreover, as a solid state devices, racetrack memory is not sensitive to induced/organic vibration or shock, So, theoretically these devices should have higher reliability than disk devices.

There was no information comparing the new racetrack memory reliability to NAND or 3D Crosspoint/PCM SSDs, but there may be some advantage here as well. I suppose one would need to understand how to miniaturize the read-erase-write head to the right form factor for nm racetracks to understand how it compares.

And I didn’t see anything describing how long it takes to rotate through bits on a skyrmion-chiral bobber racetrack. Of course, this would depend on the number of bits on a racetrack, but some indication of how long it takes one bit to move, one postition on the racetrack would be helpful to see what its rotational latency might be.

~~~~

At the moment, reading and writing skyrmions and the newly discovered chiral bobbers takes a lot of advanced equipment and is only done in major labs. As such, I don’t see a skyrmion-chiral bobber racetrack storage device arriving on my desktop anytime soon. But the fact that there’s a long way to go before, we run out of magnetic storage options, even if it is on a chip rather than magnetic media, is comforting to know. Even if we don’t ever come up with an economical way to produce it.

I wonder if you could synchronize rotational timing across a number of racetrack devices, at least that way you could be reading/erasing/writing a whole byte, word, double word etc, at a time, rather than a single bit.

Comments?

Photo Credit(s): From Experimental observation of chiral magnetic bobbers in B20 Type FeGe paper

From Experimental observation of chiral magnetic bobbers in B20 Type FeGe paper

From Timeline of computer history Magnetoresistive delay lines

From Experimental observation of chiral magnetic bobbers in B20 Type FeGe paper

MIT’s new Navion chip for better Nano drone navigation

Posted on June 21, 2018 by Ray in Artificial Intelligence, Data compression, Drones, Strategic Inflection Points, Visionary leadershp

Read an article this week in Science Daily (Chip upgrade help’s bee-sized drones navigate) about a recent chip created by MIT, called Navion, that reduces size and power consumption for electronics used in drone navigation. The chip is also documented on MIT’s Navion project homepage and in a technical paper describing the new VIO (Visual-Inertial Odometry ) Navion chip.

The Navion chip can perform inertial measurement at 52Khz as well as process video streams of 752×480 stereo images at 171 frames per second in a 20 sqmm package consuming only 24mW of power. The chip was fabricated on a 65nm CMOS process line.

Navion is the result of a collaborative design process which optimized electronics required to perform drone navigation processing. By placing all the memory required for inertial measurement and image analysis and all the processing hardware on the same chip, they have substantially reduced power consumption and space requirements for drone navigation.

Navion architecture

Navion uses a state of the art, non-linear factor graph optimization algorithm to navigate in space. It doesn’t sound like DL neural net image recognition but more like a statistical/probabilistic approach to image mapping and place estimation. The chip uses image compression, two stage memory, and sparse linear solver memory to reduce image processing memory requirements from 3.5MB to less than 1MB.

The chip uses 3 inputs: two images (right & left image) and IMU (inertial management unit sensor) and has one (complex output), its estimate of the current state of where it is on the map.

Navion processing creates and maintains a 3D map using stereo images and provides navigational support to move through that space. According to the paper, the Navion chip updates the state(s) and sparse 3D map at a KF (Kalman filter) rate of between 16 and 90 fps. Navion also offers configurations options to maximize accuracy, throughput or energy efficiency.

Navion compares well to other navigation electronics

The table shows comparisons of the Navion chip against other traditional navigational systems that use Xeon, ARM or FPGA chips. As far as I can tell it’s either much better or at least on a par with these other larger, more complex, power hungry systems.

Nano drones are coming to our space, sooner than anyone expects.

Comments?

Stanford Data Lab students hit the ground running…

Posted on June 6, 2018 by Ray in Data, Data analytics, Data discovery, Data science, Information economy, Visionary leadershp

Read an article (Students confront the messiness of data) today about Stanford’s Data Lab and how their students are trained to cleanup and analyze real world data.

The Data Lab teaches two courses the Data Challenge Lab course and the Data Impact Lab course. The Challenge Lab is an introductory course in data gathering, cleanup and analysis. The Impact Lab is where advanced students tackle real world, high impact problems through data analysis.

Data Challenge Lab

Their Data Challenge Lab course is a 10 week course with no pre-requisites that teaches students how to analyze real world data to solve problems.

Their are no lectures. You’re given project datasets and the tools to manipulate, visualize and analyze the data. Your goal is to master the tools, cleanup the data and gather insights from the data. Professors are there to provide one on one help so you can step through the data provided and understand how to use the tools.

In the information provided on their website there were no references and no information about the specific tools used in the Data Challenge Lab to manipulate, visualize and analyze the data. From an outsiders’ viewpoint it would be great to have a list of references or even websites describing the tools being used and maybe the datasets that are accessed.

Data Impact Lab

The Data Impact lab course is an independent study course, whose only pre-req is the Data Challenge Lab.

Here students are joined into interdisplinary teams with practitioner partners to tackle ongoing, real world problems with their new data analysis capabilities.

There is no set time frame for the course and it is a non-credit activity. But here students help to solve real world problems.

Current projects in the Impact lab include:

The California Poverty Project to create an interactive map of poverty in California to supply geographic guidance to aid agencies helping the poor
The Zambia Malaria Project to create an interactive map of malarial infestation to help NGOs and other agencies target remediation activity.

Previous Impact Lab projects include: the Poverty Alleviation Project to provide a multi-dimensional index of poverty status for areas in Kenya so that NGOs can use these maps to target randomized experiments in poverty eradication and the Data Journalism Project to bring data analysis tools to breaking stories and other journalistic endeavors.

~~~~

Courses like these should be much more widely available. It’s almost the analog to the scientific method, only for the 21st century.

Science has gotten to a point, these days, where data analysis is a core discipline that everyone should know how to do. Maybe it doesn’t have to involve Hadoop but rudimentary data analysis, manipulation, and visualization needs to be in everyone’s tool box.

Data 101 anyone?

Photo Credit(s): Big_Data_Prob | KamiPhuc;

Southbound traffic speeds on Masonic avenue on different dates | Eric Fisher;

Unlucky Haiti (1981-2010) | Jer Thorp;

Bristol Cycling Level by Wards 2011 | Sam Saunders

Information flows everywhere – part 1

Posted on May 28, 2018 by Ray in Internet of Things, Strategic Inflection Points, Visionary leadershp

Read an article today from Scientific American (Sewage is helping cities flush out the opioid crisis) about how using chemical analysis of wastewater can be used to assess the extent of the opioid crisis in their city.

Wastewater information highway

There’s a lab at ASU (Arizona State University) that chemically analyzes samples of wastewater to determine the amount of drugs that a city’s population excretes. They can provide a near real-time assessment of the proportion of drugs in city sewage and thereby, in a city’s population.

The problem with public drug use surveys and hospital data gathering is that they take time. Moreover, surveys and hospital data gathering typically come long after drugs problem have become a serious problem in a city’s population.

Wastewater sample drug analysis can be done in a matter of days and can be redone as often as needed. Such data could be used to track intervention activities and see if they have a real impact (positive or negative) on drug use in a population.

Neighborhood health

In addition, by sampling sewage at a neighborhood level, one can gain an assessment of drug problems at any sub-division of a city that’s needed.

The above article talks about an MIT program with Cary, NC (from Biobot.io) that is designing robots to traverse sewer pipes and analyze wastewater chemical makeup in real time, reporting this back to ground stations around the city.

With such an approach, one could almost zero in (depending on sewer pipe networks) on any neighborhood in a city, target specific interventions at that level and measure impact in (digestion delayed) real time. Doing so, cities or states for that matter, could experiment with different interventions on a neighborhood by neighborhood basis and gain statistical evidence on drug problem intervention effectiveness.

But, you can analyze wastewater for any number of variables, such as viruses, bacteria, enzymes, etc. Any of which can lead to a better understanding of a population’s health.

~~~~

Two things I want to leave you with:

First, public health has had a major impact on human health and has doubled our lifespan in 200 years. All modern cities have water treatment plants today to insure water quality and thereby, have reduced the incidence of cholera and other waterborne epidemics in their cities. Wastewater analysis has the potential for significant improvements in population health monitoring. Just like water treatment, wastewater analysis will someday become common public health practice in modern cities throughout the world.

Second, I was at a conference this week which presented a slide that there was no cold data anymore (Pure//Accelerate 2018). This was in reference to re-analyzing old, cold data can often lead to insights and process improvements that were not obvious at first glance.

But it’s not just data anymore. Any activity done by man needs to be analyzed for (inherent & invisible) information flows that could be extracted to make the world a better place.

Photo Credit(s):

DSCN7989: Albuquerque Film Office
Prescription prices Ver3 by Chris Potter
Barcelona by SirisVisual