Google releases new Cloud TPU & Machine Learning supercomputer in the cloud

Last year about this time Google released their 1st generation TPU chip to the world (see my TPU and HW vs. SW … post for more info).

This year they are releasing a new version of their hardware called the Cloud TPU chip and making it available in a cluster on their Google Cloud.  Cloud TPU is in Alpha testing now. As I understand it, access to the Cloud TPU will eventually be free to researchers who promise to freely publish their research and at a price for everyone else.

What’s different between TPU v1 and Cloud TPU v2

The differences between version 1 and 2 mostly seem to be tied to training Machine Learning Models.

TPU v1 didn’t have any real ability to train machine learning (ML) models. It was a relatively dumb (8 bit ALU) chip but if you had say a ML model already created to do something like understand speech, you could load that model into the TPU v1 board and have it be executed very fast. The TPU v1 chip board was also placed on a separate PCIe board (I think), connected to normal x86 CPUs  as sort of a CPU accelerator. The advantage of TPU v1 over GPUs or normal X86 CPUs was mostly in power consumption and speed of ML model execution.

Cloud TPU v2 looks to be a standalone multi-processor device, that’s connected to others via what looks like Ethernet connections. One thing that Google seems to be highlighting is the Cloud TPU’s floating point performance. A Cloud TPU device (board) is capable of 180 TeraFlops (trillion or 10^12 floating point operations per second). A 64 Cloud TPU device pod can theoretically execute 11.5 PetaFlops (10^15 FLops).

TPU v1 had no floating point capabilities whatsoever. So Cloud TPU is intended to speed up the training part of ML models which requires extensive floating point calculations. Presumably, they have also improved the ML model execution processing in Cloud TPU vs. TPU V1 as well. More information on their Cloud TPU chips is available here.

So how do you code a TPU?

Both TPU v1 and Cloud TPU are programmed by Google’s open source TensorFlow. TensorFlow is a set of software libraries to facilitate numerical computation via data flow graph programming.

Apparently with data flow programming you have many nodes and many more connections between them. When a connection is fired between nodes it transfers a multi-dimensional matrix (tensor) to the node. I guess the node takes this multidimensional array does some (floating point) calculations on this data and then determines which of its outgoing connections to fire and how to alter the tensor to send to across those connections.

Apparently, TensorFlow works with X86 servers, GPU chips, TPU v1 or Cloud TPU. Google TensorFlow 1.2.0 is now available. Google says that TensorFlow is in use in over 6000 open source projects. TensorFlow uses Python and 1.2.0 runs on Linux, Mac, & Windows. More information on TensorFlow can be found here.

So where can I get some Cloud TPUs

Google is releasing their new Cloud TPU in the TensorFlow Research Cloud (TFRC). The TFRC has 1000 Cloud TPU devices connected together which can be used by any organization to train machine learning algorithms and execute machine learning algorithms.

I signed up (here) to be an alpha tester. During the signup process the site asked me: what hardware (GPUs, CPUs) and platforms I was currently using to training my ML models; how long does my ML model take to train; how large a training (data) set do I use (ranging from 10GB to >1PB) as well as other ML model oriented questions. I guess there trying to understand what the market requirements are outside of Google’s own use.

Google’s been using more ML and other AI technologies in many of their products and this will no doubt accelerate with the introduction of the Cloud TPU. Making it available to others is an interesting play but this would be one way to amortize the cost of creating the chip. Another way would be to sell the Cloud TPU directly to businesses, government agencies, non government agencies, etc.

I have no real idea what I am going to do with alpha access to the TFRC but I was thinking maybe I could feed it all my blog posts and train a ML model to start writing blog post for me. If anyone has any other ideas, please let me know.


Photo credit(s): From Google’s website on the new Cloud TPU


TPU and hardware vs. software innovation (round 3)

tpu-2At Google IO conference this week, they revealed (see Google supercharges machine learning tasks …) that they had been designing and operating their own processor chips in order to optimize machine learning.

They called the new chip, a Tensor Processing Unit (TPU). According to Google, the TPU provides an order of magnitude more power efficient machine learning over what’s achievable via off the shelf GPU/CPUs. TensorFlow is Google’s open sourced machine learning  software.

This is very interesting, as Google and the rest of the hype-scale hive seem to have latched onto open sourced software and commodity hardware for all their innovation. This has led the industry to believe that hardware customization/innovation is dead and the only thing anyone needs is software developers. I believe this is incorrect and that hardware innovation combined with software innovation is a better way, (see Commodity hardware always loses and Better storage through hardware posts).
Continue reading “TPU and hardware vs. software innovation (round 3)”

Surprises from 4 years of SSD experience at Google

Flash field experience at Google 

Overview SSDsIn a FAST’16 article I recently read (Flash reliability in production: the expected and unexpected, see p. 67), researchers at Google reported on field experience with flash drives in their data centers, totaling many millions of drive days covering MLC, eMLC and SLC drives with a minimum of 4 years of production use (3 years for eMLC). In some cases, they had 2 generations of the same drive in their field population. SSD reliability in the field is not what I would have expected and was a surprise to Google as well.

The SSDs seem to be used in a number of different application areas but mainly as SSDs with a custom designed PCIe interface (FusionIO drives maybe?). Aside from the technology changes, there were some lithographic changes as well from 50 to 34nm for SLC and 50 to 43nm for MLC drives and from 32 to 25nm for eMLC NAND technology.
Continue reading “Surprises from 4 years of SSD experience at Google”

The rise of mobile and the death of rest

Read a couple of articles this week about the rise of mobile computing.  About a decade ago I was at a conference where one of the keynotes was on the inevitability of ubiquitous computing or everywhere computing.  I believe now that smart phones have arrived, we have realized that dream.

How big companies die

One article I read was from Forbes on Here’s why Google and Facebook might disappear in the next 5 years.  The central tenet of their discussion was that the rise of mobile is a new paradigm shift just like Web 1.0 and Web 2.0 emerged over time and reinvented most of the industries that went before them.

Most companies around before the internet were unable to see and understand what would constitute a viable business model in the new Web 1.0 environment. Similarly, the major players in Web 1.0 never really saw the transition that occurred to a more interactive, information sharing that became Web 2.0.

The problem is that all these companies grew up in the reigning paradigm of the day and became successful by seeing the transition as a new way of doing business. They just couldn’t conceive that another way of doing business was coming along that was strategically different and thus, highly damaging to their now outdated, business models.

Full speed ahead

But it even get’s worse. Another article I read from Tecnology Review was titled Questions for Mobile Computing.

One interesting tidbit is that time it’s taking to reach a certain adoption level in the market is shrinking. The chart (from Apple) showed that both the iPhone and iPad has drastically shrunk the time it took to attain high market adoption.

Mobile business models

The main question in the article was how web 2.0 advertising revenue business models were going to translate into a mobile environment where they no longer controlled advertising.  Many Web2.0 companies seem to be ignoring mobile at the moment but it won’t take long for companies focused on this new computing tsunami to roll over them.

Apple and Google have taken two distinctly divergent approaches to this market but at least they are (massively) engaged.  That’s more than can be said for some of the web 2.0 properties out there ignoring mobile to their long term detriment.

The fact is that mobile is a new computing platform.  It’s possibilities are truly extraordinary from mHealth (see my post on mHealth taking off in Kenya) and  mCurrency today to Google glasses of tomorrow.

I strongly believe that those companies that see this shift now and go after it with new business models to profit from mobile computing will succeed faster and mightier than we have ever seen.   The rest will be left in the dust.

The funny bit is that it’s not the developed world that’s taking the new model to new directions but the developing world.  They seem better able to see mobile computing for what it is, an relatively easy way to leapfrog from the 19th century to the 21st in one jump.

So what are profitable business models that leverage mobile computing?

OpenFlow, the next wave in networking

OpenFlow Logo (from
OpenFlow Logo (from

Read two articles recently about how OpenFlow‘s Software Defined Networking is going to take over the networking world, just like VMware and it’s brethern have taken over the server world.

Essentially, OpenFlow is a network protocol that separates the control management of a networking switch or router (control plane) from it’s data path activities (data plane).  For most current switches, control management consists of vendor supplied,  special purpose software which differs for each and every vendor and sometimes even varies  across vendor product lines.

In contrast, data path activities are fairly similar for most of today’s switches and is generally implemented in custom hardware so as to be lightening fast.

However, the main problem with today’s routers and switches is that there is no standard way to talk or even modify the control management software to modify it’s data plane activities.

OpenFlow to the rescue

OpenFlow changes all that. First it specifies a protocol or interface between a switches control plane and it’s data plane.  This allows that control plane to run on any server and still provide management for a router or switch data path activities.  By doing this OpenFlow provides Software Defined Networking (SDN).

Once OpenFlow switches and control software are in place, the SDN can better control and manage networking activity to optimize for performance, utilization or any other number of parameters.

Products are starting to come out which support OpenFlow protocols.  For example, a new OpenFlow compatible ethernet switch is available from IBM (their RackSwitch G8264 & G8264T) and HP has recently released OpenFlow software for their ethernet switches (see OpenFlow blog post).  At least some in the industry are starting to see the light.

Google implements OpenFlow

The surprising thing is that one article I read recently is about Google running an OpenFlow network on it’s data center backbone (see Wired’s Google goes with the Flow article).   In the article it discusses how a top Google scientist talked about how they implemented OpenFlow for their internal network architecture at the Open Networking Summit yesterday.

Google’s internal network connects it’s multiple data centers together to provide Google Apps and other web services.  Apparently, Google has been secretly creating/buying OpenFlow networking equipment and creating it’s own OpenFlow software. This new SDN they have constructed has given them the ability to change their internal network backbone in minutes which would have taken days, weeks or even months before. Also, OpenFlow has given Google the ability to simulate network changes ahead of time allowing them to see what potential changes will do for them.

One key metric is that Google now runs their backbone network close to 100% utilized at all times whereas before they worked hard to get it to 30-40% utilization.

Nicira revolutionizes networking

The other article I read was about a startup called Nicira out of Palo Alto, CA which is taking OpenFlow to the next level by defining a Network Virtual Platform (NVP) and Open vSwitches (OVS).

  • A NVP  is a network virtualization platform controller which consists of cluster of x86 servers running the network virtualization control software providing a RESTful web services API and defines/manages virtual networks.
  • An OVS is an Open vSwitch software designed for remote control that either runs as a complete software only service in various hypervisors or as gateway software connecting VLANs running on proprietary vendor hardware to the SDN.

OVS gateway services can be used with current generation switches/routers or be used with high performing, simple L3 switches specifically designed for OpenFlow management.

Nonetheless, with NVP and OVS deployed over your networking hardware it removes many of the limitations inherent in current networking services.  For example, Nicira network virtualization, allows the movement of application workloads across subnets while maintaining L2 adjacency, scalable multi-tenant isolation and the ability to repurpose physical infrastrucuture on demand.

By virtualizing the network, the network switching/router hardware becomes a pool of IP-switching services, available to be repurposed and/or reprogrammed at a moments notice.  Not unlike what VMware did with servers through virtualization.

Customers for Nicira include eBay, RackSpace and AT&T to name just a few.  It seems that networking virtualization is especially valuable to big web services and cloud services companies.


Virtualization takes on another industry, this time networking and changes it forever.

We really need something like OpenFlow for storage.  Taking storage administration out of the vendor hands and placing it elsewhere.  Defining an open storage management protocol that all storage vendors would honor.

The main problem with storage virtualization today is it’s kind of like VLANs, all vendor specific.   Without, something like a standard protocol, that proscribes a storage management plane’s capabilities and a storage data plane’s capabilities we can not really have storage virtualization.

Google vs. National Information Exchange Model

Information Exchange Package Documents (IEPI) lifecycle from
Information Exchange Package Documents (IEPI) lifecycle from

Wouldn’t the National information exchange be better served by deferring the National Information Exchange Model (NIEM) and instead implementing some sort of Google-like search of federal, state, and municipal text data records.  Most federal, state and local data resides in sophisticated databases using their information management tools but such tools all seem to support ways to create a PDF, DOC, or other text output for their information records.   Once in text form, such data could easily be indexed by Google or other search engines, and thus, searched by any term in the text record.

Now this could never completely replace NIEM, e.g., it could never offer even “close-to” real-time information sharing.  But true real-time sharing would be impossible even with NIEM.  And whereas NIEM is still under discussion today (years after its initial draft) and will no doubt require even more time to fully  implement, text based search could be available today with minimal cost and effort.

What would be missing from a text based search scheme vs. NIEM:

  • “Near” realtime sharing of information
  • Security constraints on information being shared
  • Contextual information surrounding data records,
  • Semantic information explaining data fields

Text based information sharing in operation

How would something like a Google type text search work to share government information.  As discussed above government information management tools would need to convert data records into text.  This could be a PDF, text file, DOC file, PPT, and more formats could be supported in the future.

Once text versions of data records were available, it would need to be uploaded to a (federally hosted) special website where a search engine could scan and index it.  Indexing such a repository would be no more complex than doing the same for the web today.  Even so it will take time to scan and index the data.  Until this is done, searching the data will not be available.  However, Google and others can scan web pages in seconds and often scan websites daily so the delay may be as little as minutes to days after data upload.

Securing text based search data

Search security could be accomplished in any number of ways, e.g., with different levels of websites or directories established at each security level.   Assuming one used different websites then Google or another search engine could be directed to search any security level site at your level and below for information you requested. This may take some effort to implement but even today one can restrict a Google search to a set of websites.  It’s conceivable that some script could be developed to invoke a search request based on your security level to restrict search results.

Gaining participation

Once the upload websites/repositories are up and running, getting federal, state and local government to place data into those repositories may take some persuasion.  Federal funding can be used as one means to enforce compliance.  Bootstrapping data loading into the searchable repository can help insure initial usage and once that is established hopefully, ease of access and search effectiveness, can help insure it’s continued use.

Interim path to NIEM

One loses all contextual and most semantic information when converting a database record into text format but that can’t be helped.   What one gains by doing this is an almost immediate searchable repository of information.

For example, Google can be licensed to operate on internal sites for a fair but high fee and we’re sure Microsoft is willing to do the same for Bing/Fast.  Setting up a website to do the uploads can take an hour or so by using something like WordPress and file management plugins like FileBase but other alternatives exist.

Would this support the traffic for the entire nation’s information repository, probably not.  However, it would be an quick and easy proof of concept which could go a long way to getting information exchange started. Nonetheless, I wouldn’t underestimate the speed and efficiency of WordPress as it supports a number of highly active websites/blogs.  Over time such a WordPress website could be optimized, if necessary, to support even higher performance.

As this takes off, perhaps the need for NIEM becomes less time sensitive and will allow it to take a more reasoned approach.  Also as the web and search engines start to become more semantically aware perhaps the need for NIEM becomes less so.  Even so, there may ultimately need to be something like NIEM to facilitate increased security, real-time search, database context and semantics.

In the mean time, a more primitive textual search mechanism such as described above could be up and available for download within a day or so. True, it wouldn’t provide real time search, wouldn’t provide everything NIEM could do, but it could provide viable, actionable information exchange today.

I am probably over simplifying the complexity to provide true information sharing but such a capability could go a long way to help integrate governmental information sharing needed to support national security.