Artistic AI

Read a couple of articles in the past few weeks on OpenAI’s Jukebox and another one on computer generated art, in Art in America, (artistically) Creative AI poses problems to art criticism. Both of these discuss how AI is starting to have an impact on music and the arts.

I can recall almost back when I was in college (a very long time ago) where we were talking about computer generated art work. The creative AI article talks some about the history of computer art, which in those days used computers to generate random patterns, some of which would be considered art.

AI painting

More recent attempts at AI creating artworks uses AI deep learning neural networks together with generative adversarial network (GANs). These involve essentially two different neural networks.

  • The first is an Art deep neural networks (Art DNN) discriminator (classification neural network) that is trained using an art genre such as classical, medieval, modern art paintings, etc. This Art DNN is used to grade a new piece of art as to how well it conforms to the genre it has been trained on. For example, an Art DNN, could be trained on Monet’s body of work and then it would be able to grade any new art on how well it conforms to Monet’s style of art.
  • The second is a Art GAN which is used to generate random artworks that can then be fed to the Art DNN to determine if it’s any good. This is then used as reinforcement to modify the Art GAN to generate a better match over time.

The use of these two types of networks have proved to be very useful in current AI game playing as well as many other DNNs that don’t start with a classified data set.

However, in this case, a human artist does perform useful additional work during the process. An artist selects the paintings to be used to train the Art DNN. And the artist is active in tweaking/tuning the Art GAN to generate the (random) artwork that approximates the targeted artist.

And it’s in these two roles that that there is a place for an (human) artist in creative art generation activities.

AI music

Using AI to generate songs is a bit more complex and requires at least 3 different DNNs to generate the music and another couple for the lyrics:

  • First a song tokenizer DNN which is a trained DNN used to compress an artist songs into, for lack of a better word musical phrases or tokens. That way they could take raw audio of an artist’s song and split up into tokens, each of which had 0-2047 values. They actually compress (encode) the artist songs using 3 different resolutions which apparently lose some information for each level but retain musical attributes such as pitch, timbre and volume.
  • A second musical token generative DNN, which is trained to generate musical tokens in the same distribution of a selected artist. This is used to generate a sequence of musical tokens that matches an artist’s musical work. They use a technique based on sparse transformers that can generate (long) sequences of tokens based on a training dataset.
  • A third song de-tokenizer DNN which is trained to take the generated musical tokenst (in the three resolutions) convert them to musical compositions.

These three pretty constitute the bulk of the work for AI to generate song music. They use data augmented with information from LyricWiki, which has the lyrics 600K recorded songs in English. LyricWiki also has song metadata which includes the artist, the genre, keywords associated with the song, etc. When training the music generator a they add the artist’s name and genre information so that the musical token generator DNN can construct a song specific to an artist and a genre.

The lyrics take another couple of steps. They have data for the lyrics for every song recorded of an artist from LyricWiki. They use a number of techniques to generate the lyrics for each song and to time the lyrics to the music. lexical text generator trained on the artist lyrics to generate lyrics for a song. Suggest you check out the explanation in OpenAI Jukebox’s website to learn more.

As part of the music generation process, the models learn how to classify songs to a genre. They have taken the body of work for a number of artists and placed them in genre categories which you can see below.

The OpenAI Jukebox website has a number of examples on their home page as well as a complete catalog behind their home page. The catalog has over a 7000 songs under a number of genres, from Acoustic to Rock and everything in between. In the fashion of a number of artists in each genre, both with and without lyrics . For the (100%) blues category they have over 75 songs and songs similar to artists from B.B. King to Taj Mahal including songs similar to Fats Domino, Muddy Water, Johnny Winter and more.

OpenAI Jukebox calls the songs “re-renditions” of the artist. And the process of adding lyrics to the songs as lyric conditioning.

Source code for the song generator DNNs is available on GitHub. You can use the code to train on your own music and have it generate songs in your own musical style.

The songs sound ok but not great. The tokenizer/de-tokenizer process results in noise in the music generated. I suppose more time resolution tokenizing might reduce this somewhat but maybe not.

~~~~

The AI song generator is ok but they need more work on the lyrics and to reduce noise. The fact that they have generated so many re-renditions means to me the process at this point is completely automated.

I’m also impressed with the AI painter. Yes there’s human interaction involved (atm) but it does generate some interesting pictures that follow in the style of a targeted artist. I really wanted to see a Picasso generated painting or even a Jackson Pollack generated painting. Now that would be interesting

So now we have AI song generators and AI painting generators but there’s a lot more to artworks than paintings and songs, such as sculpture, photography, videography, etc. It seems that many of the above approaches to painting and music could be applied to some of these as well.

And then there’s plays, fiction and non-fiction works. The songs are ~3 minutes in length and the lyrics are not very long. So anything longer may represent a serious hurdle for any AI generator. So for now these are still safe.

Photo credits:

A tale of two countries and how they controlled the Coronavirus

Read an article in IEEE Spectrum last week about Taiwan’s response to COVID-19 (see: Big data helps Taiwan fight Coronavirus) which was reporting on an article in JAMA (see Response to COVID-19 in Taiwan) about Taiwan’s success in controlling the COVID-19 outbreak in their country.

I originally intended this post to be solely about Taiwan’s response to the virus but then thought that it more instructive to compare and contrast Taiwan and South Korea responses to the virus, who both seem to have it under control now (18 Mar 2020).

But first a little about the two countries (source wikipedia: South Korea and Taiwan articles):

Taiwan (TWN) and South Korea (ROK) both enjoy close proximity, trade and travel between their two countries and China

  • South Korea (ROK) has a population of ~50.8M, an area of 38.6K SqMi (100.0K SqKm) and extends about 680 Mi (1100 Km) away from the Asian mainland (China).
  • Taiwan (TWN ) has a population of ~23.4M, an area of 13.8K SqMi (35.8K Sq Km) and is about 110 Mi (180 Km) away from the Asian mainland (China).

COVID-19 disease progression & response in TWN and ROK

There’s lots of information about TWN’s response (see articles mentioned above) to the virus but less so on ROK’s response.

Nonetheless, here’s some highlights of the progression of the pandemic and how they each reacted (source for disease/case progression from : wikipedia Coronavirus timeline Nov’19 to Jan’20, and Coronavirus timeline Feb’20; source for TWN response to virus JAMA article supplement and ROK response to virus Timeline: What the world can learn from South Korea’s COVID-19 response ).

  • Dec. 31, 2019: China Wuhan municipal health announced “urgent notice on the treatment of pneumonia of unknown cause”. Taiwan immediately tightened inbound screening processes. ==> TWN: officials board and inspect passengers for fever or pneumonia symptoms on direct flights from Wuhan
  • Jan. 8, 2020: ROK identifies 1st possible case of the disease in a women who recently returned from China Wuhan province
  • Jan 20: ROK reports 1st laboratory confirmed case ==> TWN: Central Epidemic Command Center activated, activates Level 2 travel alert for Wuhan; ROK CDC starts daily press briefings on disease progress in the nation
  • Jan. 21: TWN identifies 1st laboratory confirmed case ==> TWN: activates Level 3 travel alert for Wuhan
  • Jan 22: ==> TWN: cancels entry permits for 459 tourists from Wuhan set to arrive later in Jan
  • Jan 23: ==> TWN: bans residents from Wuhan, travelers from China required to make online health declaration before entering
  • Jan. 24 ROK reports 2nd laboratory confirmed case ==> TWN bans export of facemasks; ROK, sometime around now the gov’t started tracking confirmed cases using credit card and CCTV data to understand where patients contacted the disease
  • Jan. 25: ==> TWN: tours to china are suspended until Jan 31, activates level 3 travel alert for Hubei Province and Level 2 for rest of China, enacts export ban on surgical masks until Feb 23
  • Jan 26: ==> TWN: all tour groups from Wuhan have to leave,
  • Jan. 27: TWN reports 1st domestic transmission of the disease ==>TWN NHIA and NIA (National health and immigration authorities) integrate (adds all hospital) patients past 14-day travel history to NHIA database, all tour groups from Hubei Province have to leave
  • Jan 28: ==> TWN: activates Level 3 travel alert for all of China except Hong Kong and Macau; ROK requests inspection of all people who have traveled from Wuhan in the past 14 days
  • Jan 29: ==> TWN: institutes electronic monitoring of all quarantined patients via gov’t issued cell phones; ROK about now requests production of massive numbers of WHO approved test kits for the Coronavirus
  • Jan. 30: ROK reports 2 more (4 total) confirmed cases of the disease ==> TWN: tours to or transiting China suspended until Feb 29;
  • Jan 31: ==> TWN: all remaining tour groups from China asked to leave
  • Feb 2 ==> TWN extended school break from Feb 15 to Feb 25,gov’t facilities available for quarantine, soldiers mobilized to man facemask production lines, 60 additional machines installed daily facemask output to reach 10M facemasks a day.
  • Feb 3: ==> TWN: enacts name based rationing system for facemasks, develops mobile phone app to allow public to see pharmacy mask stocks, Wenzhou city Level 2 travel alert; ROK CDC releases enhanced quarantine guidelines to manage the disease outbreak, as of today ROK CDC starts making 2-3 press releases a day on the progress of the disease
  • Feb 5: ==> TWN: Zheijanp province Level 2 travel alert, all cruise ships with suspected cases in past 28 days banned, any cruise ship with previous dockings in China, Hong Kong, or Macau in past 14 days are banned
  • Feb 6:==> TWN: Tours to Hong Kong & Macau suspended until Feb 29, all Chinese nationals banned, all international cruise ship are banned, all contacts from Diamond Princess cruise ship passengers who disembarked on Jan 31 are traced
  • Feb 7: ==> TWN: All foriegn nationals with travel to China, Hong Kong or Macau in the past 14 days are banned, all Foreigners must see an immigration officer,
  • Feb 14:==> TWN: Entry quarantine system launched fill out electronic health declaration for faster entry
  • Feb 16: ==> TWN: NHIA database expanded to cover 30 day travel history for travelers form or transited through China, Hong Kong, Macau, Singapore and Thailand.
  • Feb 18 ==> TWN: all hospitals, clinics and pharmacies have access to patients travel history; ROK most institutions postpone the re-start of school after spring break
  • Feb 19 ==> TWN establishes gov’t policies to disinfect schools and school areas, school buses, high speed rail, railways, tour busses and taxis
  • Feb 20 ==> ROK Daegu requests all individuals to stay home
  • Feb 21 ==> TWN establishes school suspension guidelines based on cases diagnosed in school; ROK Seoul closes all public gatherings and protests
  • Feb 24 ==> TWN, travelers with history of travel to china, from countries with level 1 or 2 travel alerts, and all foreign nationals subject to 14 day quarantine (By this time many countries are in level 1-2-3 travel alert status in TWN)
  • Feb 26 ==> ROK opens drive-thru testing clinics, patients are informed via text messages (3 days later) the results of their tests
  • Mar 3? ==> ROK starts selling facemasks at post offices
  • Mar 5 ==> ROK bans the export of face masks

As of Mar 16, (as reported in Wikipedia), TWN had 67 cases and 1 death; and ROK had 8,326 cases and 75 deaths. As of Mar 13 (as reported is Our world in data article), TWN had tested 16,089 and ROK had tested 248,647 people.

Summary of TWN and ROK responses to the virus

For starters, both TWN and ROK learned valuable lessons from the last infections from China SARS-H1N1 and used those lessons to deal better with COVID-19. Also neither country had any problem accessing credit information, mobile phone location data, CCTV camera or any other electronic information to trace infected people in their respective countries.

If I had to characterize the responses to the virus from the two countries:

  1. TWN was seemingly focused early on reducing infections from outside, controlling & providing face masks to all, and identifying gov’t policies (ceasing public gathering, quarantine and disinfectant procedure) to reduce transmission of the disease. They augmented and promoted the use of public NHIA databases to track recent travel activity and used any information available to monitor the infected and track down anyone they may have contacted. Although TWN has increased testing over time, they did not seem to have much of an emphasis on broad testing. At this point, TWN seems to have the virus under control.
  2. ROK was all about public communications, policies (quarantine and openness), aggressively testing their population and quarantining those that were infected. ROK also tracked the goings on and contacts of anyone that was infected. ROK started early on broadly testing anyone that wanted to be tested. Using test results, infected individuals were asked to quarantine. A reporter I saw talking about ROK mentions 3 T’s: Target, Test, & Trace At this point, ROK seems to have the virus under control.

In addition, Asian countries in general are more prone to use face masks when traveling, which may be somewhat restrict Coronavirus transmission. Although it seems to primarily reduce transmission, most of the public in these countries (now) routinely wear face masks when out and about. And previously they routinely wore face masks when traveling to reduce disease transmission.

Also both countries took the news out of Wuhan China about the extent of the infections, deaths and ease of disease transmission as truthful and acted on this before any significant infections were detected in their respective countries

What the rest of the world can learn from these two countries

What we need to take from TWN a& ROK is that

  1. Face masks and surgical masks are a critical resource during any pandemic. National production needs to be boosted immediately with pricing and distribution controls so that they are not hoarded, nor subject to price gouging. In the USA we have had nothing on this front other than requests to the public to stop hoarding them and the lack of availability to support healthcare workers).
  2. Test kits are also a critical resource during any pandemic. Selection of the test kit, validation and boosting production of test kits needs to be an early and high priority. The USA seems to have fallen down on this job.
  3. Travel restrictions, control and quarantines need to be instituted early on from infected countries. USA did take action to restrict travel and have instituted quarantines on cruise ship passengers and any repatriated nationals from China.
  4. Limited testing can help control the virus as long as it’s properly targeted. Mass, or rather less, targeted testing can also help control the virus as well. In the USA given the lack of test kits, we are limited to targeted testing.
  5. Open, rapid and constant communications can be an important adjunct to help control virus spread. The USA seems to be still working on this. Many states seem to have set up special communications channels to discuss the latest information. But there doesn’t seem to be any ongoing, every day communications effort on behalf of the USA CDC to communicate pandemic status.
  6. When one country reports infections, death and ease of transmission of a disease start to take serious precautions immediately. Disease transmission in our travel intensive world is much too easy and rapid to stop once it takes hold in a nation. Any nation today that starts to encounter and infectious agent with high death rates and seemingly easy transmission must be taken seriously as the start of something much bigger.

Stay safe, be well.

~~~~

Comments?

Photo Credit(s):

Breaking IoT security

Earth globe within a locked cage

Read an article the other day (Researchers exploit low entropy of IoT devices to break RSA certificates) about researchers cracking IoT device security and breaking their public key encryption keys. The report focused on PKI and RSA certificates and IoT devices. The article mentioned the research paper describing the attack in more detail.

safe 'n green by Robert S. Donovan (cc) (from flickr)
safe ‘n green by Robert S. Donovan (cc) (from flickr)

RSA certificates publish a public key and the digital signature of the certificate and identify the device that owns the certificate.

What the researchers were able to show was that ~250K keys in IoT device RSA certificates were insecure. They were able to compromise the 250K RSA certificates using a single Microsoft Azure VM and about $3K of computer time.

It turns out that if two RSA certificate public keys share the same factor, it’s much easier to determine the greatest common devisor GCD) of the two public keys than it is to factor any one of them. And once you have the GCD of the two keys, it’s relatively trivial to determine the other factor in a public key. And that’s just what they did.

Public key infrastructure (PKI) encryption depends on asymmetric cryptography using a “public” key to encrypt messages (or to encrypt a one time key to be used in later encryption of messages) and the use of a “private” key to decrypt the message (or keys) and sign digital certificates. There are certificate authorities and a number of other elements used in PKI but the asymmetric cryptography at its heart, rests on the foundation of the difficulty in factoring large numbers but those large numbers need to be random and prime.

True randomness is hard

Just some of the recently donated seeds that are being added to the Reading Food Growing Network seed swap boxes, including some Polish gherkin seeds.

The problem starts with generating truly random numbers in a digital computer. Digital algorithms typically depend on a computer to perform the some set of instructions, in the same way and sequence so as to get the same answer every time we run the algorithm.

But if you want random numbers this predictability of always coming up with the same answer each time results in non-random numbers (or rather random numbers that are the same each time you run the algorithm). So to get around this, most random number generators can make use of a (random) seed which is used as an input to the algorithm to generate random numbers.

However, this seed needs to be a random number. But to create a random number it needs to be generated not with instructions but using something outside the digital computer. One approach noted above is to use a human typing keys to generate a random number to be used as a seed.

The researchers exploited the fact that most IoT devices don’t use a random (enough) seed for their PKI key generation. And they were able to use the GCD trick to figure out the factors to the PKI.

But the lack of true randomness (or entropy) is the real problem. Somehow, these devices need to have a cheap and effective way to generate a random seed. Until this can be found, they will be subject to these sorts of attacks.

… but not impossible to obtain

I remember in times past when tasked to create a public key-private key pair I had to type some random characters. The Public key encryption algorithm used the inter-character time interval of my typing to generate a random seed that was then used to generate the key pair used in the public key. I believe the two keys also need to be prime numbers.

Earth globe within a locked cage

Perhaps a better approach would be to assign them keys from a centralized key distributor. That way the randomness could be controlled by the (key) distributor.

There are other approaches that depend on the sensors available to an IoT device. If the device has a camera or mic, taking raw data from the camera or sound sensor and doing a numerical transform on them may suffice. Strain gauges, liquid levels, temperature, humidity, wind speed, etc. all of these devices have something which senses the world around them and many of these are, at their base, analog sensors. Reading and converting some portion of these analog signals from raw analog to a digital random seed could be very effective way to generate true(r) randomness.

~~~~

The paper has much more information about the attack and their results if your interested. They said that ~50% of the compromised devices were from a large network supplier. Such suppliers probably also have a vast majority of devices deployed. Still it’s troubling, nonetheless.

Until changes are made to IoT devices, they will continue to be insecure. Not as much of a problem when they are read only sensors but when the information they sense is used by robots or other automation to make decisions about actions, then having insecure IoT becomes a safety issue.

This is not the first time such an attack was attempted and each time, it’s been very successful. That alone should be cause for alarm. But IoT and similar devices are hard to patch in the field and their continuing insecurity may be more of a result of the difficulty of updating a large install base than anything else.

Photo Credit(s):

Internet of Tires

Read an article a couple of weeks back (An internet of tires?… IEEE Spectrum) and can’t seem to get it out of my head. Pirelli, a European tire manufacturer was demonstrating a smart tire or as they call it, their new Cyber Tyre.

The Cyber Tyre includes accelerometer(s) in its rubber, that can be used to sense the pavement/road surface conditions. Cyber Tyre can communicate surface conditions to the car and using the car’s 5G, to other cars (of same make) to tell them of problems with surface adhesion (hydroplaning, ice, other traction issues).

Presumably the accelerometers in the Cyber Tyre measure acceleration changes of individual tires as they rotate. Any rapid acceleration change, could potentially be used to determine whether the car has lost traction due and why.

They tested the new tires out at a (1/3rd mile) test track on top of a Fiat factory, using Audi A8 automobiles and 5G. Unclear why this had to wait for 5G but it’s possible that using 5G, the Cyber Tyre and the car could possibly log and transmit such information back to the manufacturer of the car or tire.

Accelerometers have become dirt cheap over the last decade as smart phones have taken off. So, it was only a matter of time before they found use in new and interesting applications and the Cyber Tyre is just the latest.

Internet of Vehicles

Presumably the car, with Cyber Tyres on it, communicates road hazard information to other cars using 5G and vehicle to vehicle (V2V) communication protocols or perhaps to municipal or state authorities. This way highway signage could display hazardous conditions ahead.

Audi has a website devoted to Car to X communications which has embedded certain Audi vehicles (A4, A5 & Q7), with cellular communications, cameras and other sensors used to identify (recognize) signage, hazards, and other information and communicate this data to other Audi vehicles. This way owning an Audi, would plug you into this information flow.

Pirelli’s Cyber Car Concept

Prior to the Cyber Tyre, Pirelli introduced a Cyber Car concept that is supposedly rolling out this year. This version has tyres with real time pressure, temperature, (static) vertical load and a Tyre ID. Pirelli has been working with car manufacturers to roll out Cyber Car functionality.

The Tyre ID seems to be a file that can include anything that the tyre or automobile manufacturer wants. It sort of reminds me of a blockchain data blocks that could be used to validate tyre manufacturing provenance.

The vertical load sensor seems more important to car and tire manufacturers than consumers. But for electrical car owners, knowing car weight could help determine current battery load and thereby more precisely know how much charge is left in a battery.

Pirelli uses a proprietary algorithm to determine tread wear. This makes use of the other tyre sensors to predict wear and perhaps uses an AI DL algorithm to do this.

~~~

ABS has been around for decades now and tire pressure sensors for over 10 years or so. My latest car has enough sensors to pretty much drive itself on the highway but not quite park itself as of yet. So it was only a matter of time before something like smart tires would show up.

But given their integration with car electronics systems, it would seem that this would only make sense for new cars that included a full set of Cyber Tyres. That is until all tire AND car manufacturers agreed to come up with a standard protocol to communicate such information. When that happens, consumers could chose any tire manufacturer and obtain have similar if not the same functionality from them.

I suppose someone had to be first to identify just what could be done with the electronics available today. Pirelli just happens to be it for now in the tire industry.

I just don’t want to have to upgrade tires every 24 months. And, if I have to wait a long time for my car to boot up and establish communications with my tires, I may just take a (dumb) bike.

Photo Credit(s):

Two paths to better software

Read an article last week in the Atlantic, The coming software apocalypse, about some of the problems in how we develop software today.

Most software development today is editing text files. Some of these text files have 1,000s of lines and are connected to other text files with 1,000s of more lines which are connected to other text files with 1,000s of lines, etc. Pretty soon you have millions of lines of code all interacting with one another.

The problem

Been there done that and it’s not pretty. We even spent some time trying to reduce the code bloat by macro-izing some of it, and that just made it harder to understand, but reduced the lines of code.

The problem is much worse now where . we have software everywhere you look, from the escalator-elevator you take up and down between floors, to the cars you drive around town, to the trains and airplanes you travel between cities.

All of these literally have millions of lines of code controlling them and are many more each year. How can they all possibly be correct.

Well you can test the s&*t out of them. But you can’t cover every path in a lifetime or ten of testing a million line program. And even if you could, changing a single line would generate another 100K or more paths to test. So testing was never a true answer.

Two solutions

The article talks about two approaches that have some merit to solve the real problem.

  • Model based development, a new development and coding environment. In this approach your not so much coding as playing with a model of the behavior your looking for. Say you were coding robot control logic, rather than editing 1000s of lines of Java text, you work with a model of your robot and its environment on 1/2 a screen and on the other half, model parameters (dials, sliders, arrow keys, etc) and logic (sequences) that you  manipulate to do what the robot needs to do. Sort of like Scratch on steroids (see my post on 10 years of Scratch) with the sprite being whatever you need to code for be it a jet engine, automobile, elevator, whatever. The playground would be a realtime/real life simulation of the entity under control of the code and you would code by setting parameters  and defining sequences. But the feedback would be immediate!
  • TLA+ a formal design verification approach. Formal methods have been around since the early 70s. They are used to rigorously specify a design of  some code or a whole system. The idea is that if you can specify a  provably correct design, then the code (derived from that design) has the potential to be more correct. Yes there’s still the translation from code to design that’s error prone but the likelihood is that these errors will be smaller in scope than having a design that wrong.

Model based  development

One can find model based development already in the Apple new application development language, Swift, ANSYS SCADE suite based on Esterel Technologies, and Light Table software development environment.

I have never used any of them but they all look interesting. Esterel was developed for safety critical, real-time aerospace applications. Light Table was a kickstarter project started by a leading engineer of Microsoft’s Visual Studio, the leading IDE. Apple Swift was developed to make it much easier to develop IOS apps.

TLA+

TLA+ takes a bit getting used to. All formal methods depend on advanced mathematics and sophisticated logic and requires an adequate understanding of these in order to use properly. TLA+ was developed by Leslie Lamport and stands for temporal logic of actions.

TLA+ specifications identify the set of all correct system actions. I would call it a formal pseudo code.

There’s apparently a video course , a hyperbook and a book on the language It’s being used in AWS and Microsoft XBOX and Azure. (See the wikipedia TLA+ article for more information).

There’s PlusCal algorithm (specification) language which is translated into a TLA+ specification which can then be checked by the automated TLC model checker.  There’s also an automated TLAPS, a TLA+ proof system although it doesn’t support all of the TLA+ primitives.  There’s a whole TLA+ toolbox that has these and other tools that can make TLA+ easier to use.

~~~~

We dabbled in formal specifications methods for on our million+ line storage system at a former employer. It worked well and cleaned up a integrity critical area of the product. Alas, we didn’t expand it’s use to other areas of the product and it sort of fell out of favor. But it worked when and where we applied it.

Of course this was before automated formal methods of today, but even manual methods of specification precision can be helpful to think out what a design has to do to be correct.

I have no doubt that both TLA+ formal methods and model based development approaches and more are required to truly vanquish the coming software apocalypse.

At least until artificial intelligence starts developing all our code for us.

Comments?

Photo Credits: Six easy pieces of quantitatively analyzing open source, SAP Research;

Spaghetti code still existed, Toolbox.com;

How to write apps with Swift, MacWorld;

Modeling the dining philosophers problem in TLA+, Metadata blog

 

A college course on identifying BS

Read an article the other day from Recode (These University of Washington professors teaching a course on Calling BS) that seems very timely. The syllabus is online (Calling Bullshit — Syllabus) and it looks like a great start on identifying falsehood wherever it can be found.

In the beginning, what’s BS?

The course syllabus starts out referencing Brandolini’s Bullshit Asymmetry Principal (Law): the amount of energy needed to refute BS is an order of magnitude bigger than to produce it.

Then it goes into a rather lengthy definition of BS from Harry Frankfort’s 1986 On Bullshit article. In sum, it starts out reviewing a previous author’s discussions on Humbug and ends up at the OED. Suffice it to say Frankfurt’s description of BS runs the gamut from: Deceptive misrepresentation to short of lying.

They course syllabus goes on to reference two lengthy discussions/comments on Frankfurt’s seminal On Bullshit article, but both Cohen’s response, Deeper into BS and Eubank & Schaeffer’s A kind word for BS: …  are focused more on academic research rather than everyday life and news.

How to mathematically test for BS

The course then goes into mathematical tests for BS that range from Fermi’s questions, the Grim Test and Benford’s 1936 Law of Anomalous Numbers. These tests are all ways of looking at data and numbers and estimating whether they are bogus or not. Benford’s paper/book talks about how the first page of logarithms is always more used than others because numbers that start with 1 are more frequent than any other number.

How rumors propagate

The next section of the course (week 4) talks about the natural ecology of BS.

Here there’s reference to an article by Friggeri, et al, on Rumor Cascades, which discusses the frequency with which patently both true, false and partially true/partially false rumors are “shared” on social media (Facebook).

The professors look at a website called Snopes.com which evaluates the veracity of publishes rumors uses this to classify the veracity of rumors. Next they examine how these rumors are shared over time on Facebook.

Summarizing their research, both false and true rumors propagate sporadically on Facebook. But even verified false or mixed true/mixed false rumors (identified by Snopes.com) continue to propagate on Facebook. This seems to indicate that rumor sharers are ignoring the rumor’s truthfulness or are just unaware of the Snopes.com assessment of the rumor.

Other topics on calling BS

The course syllabus goes on to causality (correlation is not causation, a common misconception used in BS), statistical traps and trickery (used to create BS), data visualization (which can be used to hide BS), big data (GiGo leads to BS), publication bias (e.g., most published research presents positive results, where’s all the negative results research…), predatory publishing and scientific misconduct (organizations that work to create BS for others), the ethics of calling BS (the line between criticism and harassment), fake news and refuting BS.

Fake news

The section on Fake News is very interesting. They reference an article in the NYT, The Agency about how a group in Russia have been reaping havoc across the internet with fake news and bogus news sites.

But there’s more another article on NYT website, Inside a fake news sausage factory, details how multiple websites started publishing bogus news and then used advertisement revenue to tell them which bogus news generated more ad revenue – apparently there’s money to be made in advertising fake news. (Sigh, probably explains why I can’t seem to get any sponsors for my websites…).

Improving the course

How to improve their course? I’d certainly take a look at what Facebook and others are doing to identify BS/fake news and see if these are working effectively.

Another area to add might be a historical review of fake rumors, news or information. This is not a new phenomenon. It’s been going on since time began.

In addition, there’s little discussion of the consequences of BS on life, politics, war, etc. The world has been irrevocably changed in the past  on account of false information. Knowing how bad this has been this might lend some urgency to studying how to better identify BS.

There’s a lot of focus on Academia in the course and although this is no doubt needed, most people need to understand whether the news they see every day is fake or not. Focusing more on this would be worthwhile.

~~~~

I admire the University of Washington professors putting this course together. It’s really something that everyone needs to understand  nowadays.

They say the lectures will be recorded and published online – good for them. Also, the current course syllabus is for a one credit hour course but they would like to expand it to a three to four credit hour course – another great idea

Comments?

Photo credit(s): The Donation of ConstantineNew York World – Remember the Maine, Public Domain; Benjamin Franklin’s Bag of Scalps letter;  fake-news-rides-sociales by Portal GDA

Insecure SHA-1 imperils Internet security, PKI, and most password systems

safe 'n green by Robert S. Donovan (cc) (from flickr)
safe ‘n green by Robert S. Donovan (cc) (from flickr)

I suppose it’s inevitable but surprising nonetheless.  A recent article Faster computation will damage the Internet’s integrity in MIT Technology Review indicates that by 2018, SHA-1 will be crackable by any determined large  organization. Similarly, just a few years later,  perhaps by 2021 a much smaller organization will have the computational power to crack SHA-1 hash codes.

What’s a hash?

Cryptographic hash functions like SHA-1 are designed such that, when a string of characters is “hash”ed they generate a binary value which has a couple of great properties:

  • Irreversibility – given a text string and a “hash_value” generated by hashing “text_string”, there is no way to determine what the “text_string” was from its hash_value.
  • Uniqueness – given two or more text strings, “text_string1” and “text_string2” they should generate two unique hash values, “hash_value1” and “hash_value2”.

Although hash functions are designed to be irreversible that doesn’t mean that they couldn’t be broken via a brute force attack. For example, if one were to try every known text string, sooner or later one would come up with a “text_string1” that hashes to “hash_value1”.

But perhaps even more serious, the SHA-1 algorithm is prone to hash collisions  which makes fails the uniqueness property above.  That is, there are a few “text_string1″s that hash to the same “hash_value1”.

All this wouldn’t be much of a problem except that with Moore’s law in force and continuing for the next 6 years or so we will have processing power in chips capable of doing a brute force attack against SHA-1 to find text_strings that match any specific hash value.

So what’s the big deal?

Well it turns out that SHA-1 algorithms underpin almost all secure data transmissions today. That is, most Public-key infrastructure (PKI) depend on SHA-1 to sign digital certificates.  And although that’s pretty bad, what’s even worse is that Secure Socket Layer/Transport Layer Security (SSL/TLS) used by “https://” websites the world over also depend on SHA-1 to send key information used to encrypt/decrypt secure Internet transactions.

On top of all that, many of today’s secure systems with passwords, use SHA-1 to hash passwords and instead of storing actual passwords in plain-text on their password files, they only store the SHA-1 hash of the passwords.  As such, by 2021, anyone that can read the hashed password file can retrieve any password in plain text.

What all this means is that by 2018 for some and 2021 or thereabouts for just about anybody else, todays secure internet traffic, PKI and most system passwords will no longer be secure.

What needs to be done

It turns out that NSA knew about the failings of SHA-1 quite awhile ago and as such, NIST released SHA-2 as a new hash algorithm and its functional replacement.  Probably just in time, this month, NIST announced a winner for a new SHA-3 algorithm as a functional replacement for SHA-2.

This may take awhile, what needs to be done is to have all digital certificates that use SHA-1, be invalidated with new ones generated using SHA-2 or SHA-3.  And of course, TLS and SSL Internet functionality all have to be re-coded to recognize and use SHA-2 or SHA-3, instead of SHA-1.

Finally, for most of those password systems, users will need to re-login and have their password hashes changed over from SHA-1 to SHA-2 or SHA-3.

Naturally, in order to use SHA-2 or SHA-3 many systems may need to be upgraded to later levels of code.  Seems like Y2K all over again, only this time it’s security that’s going to crash.  It’s good to be in the consulting business, again.

~~~~

But the real problem IMHO, is Moore’s law.  If it continues to double processing power/transistor density every two years or so, how long before SHA-2 or SHA-3 succumb to same sorts of brute force attacks?  Given that, we appear destined to change hashing, encryption and other security algorithms every decade or so until Moore’s law slows down or god forbid, stops altogether.

Comments?

 

Shingled magnetic recording disks

A couple of weeks ago I attended a day of the SNIA Storage Developers Conference (SDC) where Garth Gibson of Carnegie Mellon University Parallel Data Lab (CMU PDL) and Panasas was giving a talk of what they are up to at CMU’s storage lab.  His talk at the conference was on shingled magnetic recording (SMR) disks. We have discussed this topic before in our posts on Sequential only disks?!  and in Disk trends revisited.  SMR may require a re-thinking of how we currently access disk storage.

Recall that shingled magnetic recording uses a write head that overwrites multiple tracks at a time (see graphic above), with one track being properly written and the adjacent (inward) tracks being overwritten. As the head moves to the next track, that track can be properly written but more adjacent (inward) tracks are overwritten, etc. In this fashion data can be written sequentially, on overlapping write passes.  In contrast, read heads can be much narrower and are able to read a single track.

In my post, I assumed that this would mean that the new shingled magnetic recording disks would need to be accessed sequentially not unlike tape. Such a change would need a massive rewrite to only write data sequentially.  I had suggested this could potentially work if one were to add some SSD or other NVRAM to the device to help manage the mapping of the data to the disk.  Possibly that plus a very sophisticated drive controller, not unlike SSD wear leveling today, could handle mapping a physically sequentially accessed disk to a virtually randomly accessed storage protocol.

Garth’s approach to the SMR dilemma

Garth and his team of researchers are taking another tack at the problem. In his view there are multiple groups of tracks on an SMR disk (zones or bands).  Each band can be either written sequentially or randomly but all bands can be read randomly.  One can break up the disk to include sections of multiple shingled bands, that are sequentially written and less, non-shingled bands that can be randomly written. Of course there would be a gap between the shingled bands in order not to overwrite adjacent bands. And there would also be gaps between the randomly written tracks in a non-shingled partition to allow for the wider track writing that occurs with the SMR write head.

His pitch at the conference dealt with some characteristics of such a multi-band disk device.  Such as

  • How to determine the density for a device that has multiple bands of both shingled write data and randomly written data.
  • How big or small a shingled band should be in order to support “normal” small block and randomly accessed file IO.
  • How many randomly written tracks or what the capacity of the non-shingled bands would need to be to support “normal” file IO activity.

For maximum areal density one would want large shingled bands.  There are other interesting considerations that were not as obvious but I won’t go into here.

SCSI protocol changes for SMR disks

The other, more interesting section of Garth’s talk was on recent proposed T10 and T13 changes to support SMR disks that supported shingled and non-shingled partitions and what needed to be done to support SMR devices.

The SCSI protocol changes being considered to support SMR devices include:

  • A new write cursor for shingled write bands that indicates the next LBA to be written.  The write cursor starts out at a relative band address of 0 and as each LBA is written consecutively in the band it’s incremented by one.
  • A write cursor can be reset (to zero) indicating that the band has been erased
  • Each drive maintains the band map and current cursor position within each band and this can be requested by SCSI drivers to understand the configuration of the drive.

Probably other changes are required as well but these seem sufficient to flesh out the problem.

SMR device software support

Garth and his team implemented an SMR device, emulated in software using real random accessed devices.  They then implemented an SMR device driver that used the proposed standards changes and finally, implemented a ShingledFS file system to use this emulated SMR disk to see how it would work.  (See their report on Shingled Magnetic Recording for Big Data Applications for more information.)

The CMU team implemented a log structured file system for the ShingledFS that only wrote data to the emulated SMR disk shingled partition sequentially, except for mapping and meta-data information which was written and updated randomly in a non-shingled partition.

You may recall that a log structured file system is essentially written as a sequential stream of data (not unlike a log).  But there is additional mapping required that indicates where file data is located in the log which allows for randomly accessing the file data.

In their report and at the conference, Garth presented some benchmark results for a big data application called Terasort (essentially Teragen, Terasort and Teravalidate) which seems to use Hadoop to sort a large body of data.   Not sure I can replicate this information here but suffice it to say at the moment the emulated SMR device with ShingledFS did not beat a base EXT3 or FUSE using the same hardware for these applications.

Now the CMU project wAs done by a bunch of smart researchers but it’s still relatively new and not necessarily that optimized.  Thus, there’s probably some room for improvement in the ShingledFS and maybe even the emulated SMR device and/or the SMR device driver.

At the moment Garth and his team seem to believe that SMR devices are certainly feasible and would take only modest changes to the SCSI protocols to support such devices.  As for file system support there is plenty of history surrounding log structured file systems so these are certainly doable but would require probably extensive development to implemented in various OS to support an SMR device.  The device driver changes don’t seem to be as significant.

~~~~

It certainly looks like there’s going to be SMR devices in our future.  It’s just a question whether they will be ever as widely supported as the randomly accessed disk device we know and love today.  Possibly, this could all be behind a storage subsystem that makes the technology available as networked storage capacity and over time maybe SMR devices could be implemented in more standard OS device drivers and file systems.  Nevertheless, to keep capacity and areal density on their current growth trajectory, SMR disks are coming, it’s just a matter of time.

Comments?

Image: (c) 2012 Hitachi Global Storage Technologies, from IEEE SCV Magnetics Society presentation by Roger Wood