Google vs. National Information Exchange Model

Information Exchange Package Documents (IEPI) lifecycle from www.niem.gov
Information Exchange Package Documents (IEPI) lifecycle from http://www.niem.gov

Wouldn’t the National information exchange be better served by deferring the National Information Exchange Model (NIEM) and instead implementing some sort of Google-like search of federal, state, and municipal text data records.  Most federal, state and local data resides in sophisticated databases using their information management tools but such tools all seem to support ways to create a PDF, DOC, or other text output for their information records.   Once in text form, such data could easily be indexed by Google or other search engines, and thus, searched by any term in the text record.

Now this could never completely replace NIEM, e.g., it could never offer even “close-to” real-time information sharing.  But true real-time sharing would be impossible even with NIEM.  And whereas NIEM is still under discussion today (years after its initial draft) and will no doubt require even more time to fully  implement, text based search could be available today with minimal cost and effort.

What would be missing from a text based search scheme vs. NIEM:

  • “Near” realtime sharing of information
  • Security constraints on information being shared
  • Contextual information surrounding data records,
  • Semantic information explaining data fields

Text based information sharing in operation

How would something like a Google type text search work to share government information.  As discussed above government information management tools would need to convert data records into text.  This could be a PDF, text file, DOC file, PPT, and more formats could be supported in the future.

Once text versions of data records were available, it would need to be uploaded to a (federally hosted) special website where a search engine could scan and index it.  Indexing such a repository would be no more complex than doing the same for the web today.  Even so it will take time to scan and index the data.  Until this is done, searching the data will not be available.  However, Google and others can scan web pages in seconds and often scan websites daily so the delay may be as little as minutes to days after data upload.

Securing text based search data

Search security could be accomplished in any number of ways, e.g., with different levels of websites or directories established at each security level.   Assuming one used different websites then Google or another search engine could be directed to search any security level site at your level and below for information you requested. This may take some effort to implement but even today one can restrict a Google search to a set of websites.  It’s conceivable that some script could be developed to invoke a search request based on your security level to restrict search results.

Gaining participation

Once the upload websites/repositories are up and running, getting federal, state and local government to place data into those repositories may take some persuasion.  Federal funding can be used as one means to enforce compliance.  Bootstrapping data loading into the searchable repository can help insure initial usage and once that is established hopefully, ease of access and search effectiveness, can help insure it’s continued use.

Interim path to NIEM

One loses all contextual and most semantic information when converting a database record into text format but that can’t be helped.   What one gains by doing this is an almost immediate searchable repository of information.

For example, Google can be licensed to operate on internal sites for a fair but high fee and we’re sure Microsoft is willing to do the same for Bing/Fast.  Setting up a website to do the uploads can take an hour or so by using something like WordPress and file management plugins like FileBase but other alternatives exist.

Would this support the traffic for the entire nation’s information repository, probably not.  However, it would be an quick and easy proof of concept which could go a long way to getting information exchange started. Nonetheless, I wouldn’t underestimate the speed and efficiency of WordPress as it supports a number of highly active websites/blogs.  Over time such a WordPress website could be optimized, if necessary, to support even higher performance.

As this takes off, perhaps the need for NIEM becomes less time sensitive and will allow it to take a more reasoned approach.  Also as the web and search engines start to become more semantically aware perhaps the need for NIEM becomes less so.  Even so, there may ultimately need to be something like NIEM to facilitate increased security, real-time search, database context and semantics.

In the mean time, a more primitive textual search mechanism such as described above could be up and available for download within a day or so. True, it wouldn’t provide real time search, wouldn’t provide everything NIEM could do, but it could provide viable, actionable information exchange today.

I am probably over simplifying the complexity to provide true information sharing but such a capability could go a long way to help integrate governmental information sharing needed to support national security.

Latest SPC-2 results – chart of the month

SPC-2* benchmark results, spider chart for LFP, LDQ and VOD throughput
SPC-2* benchmark results, spider chart for LFP, LDQ and VOD throughput

Latest SPC-2 (Storage Performance Council-2) benchmark resultschart displaying the top ten in aggregate MBPS(TM) broken down into Large File Processing (LFP), Large Database Query (LDQ) and Video On Demand (VOD) throughput results. One problem with this chart is that it really only shows 4 subsystems: HDS and their OEM partner HP; IBM DS5300 and Sun 6780 w/8GFC at RAID 5&6 appear to be the same OEMed subsystem; IBM DS5300 and Sun 6780 w/ 4GFC at RAID 5&6 also appear to be the same OEMed subsystem; and IBM SVC4.2 (with IBM 4700’s behind it).

What’s interesting about this chart is what’s going on at the top end. Both the HDS (#1&2) and IBM SVC (#3) seem to have found some secret sauce for performing better on the LDQ workload or conversely some dumbing down of the other two workloads (LFP and VOD). According to the SPC-2 specification

  • LDQ is a workload consisting of 1024KiB and 64KiB transfers whereas the LFP consists of 1024KiB and 256KiB transfers and the VOD consists of only 256KiB, so transfer size doesn’t tell the whole story.
  • LDQ seems to have a lower write proportion (1%) while attempting to look like joining two tables into one, or scanning data warehouse to create output whereas, LFP processing has a read rate of 50% (R:W of 1:1) while executing a write-only phase, read-write phase and a read-only phase, and apparently VOD has a 100% read only workload mimicking streaming video.
  • 50% of the LDQ workload uses 4 I/Os outstanding and the remainder 1 I/O outstanding. The LFP uses only 1 I/O outstanding and VOD uses only 8 I/Os outstanding.

These seem to be the major differences between the three workloads. I would have to say that some sort of caching sophistication is evident in the HDS and SVC systems that is less present in the remaining systems. And I was hoping to provide some sort of guidance as to what that sophistication looked like but

  • I was going to say they must have a better sequential detection algorithm but the VOD, LDQ and LFP workloads have 100%, 99% and 50% read ratios respectively and sequential detection should perform better with VOD and LDQ than LFP. So thats not all of it.
  • Next I was going to say it had something to do with I/O outstanding counts. But VOD has 8 I/Os outstanding and the LFP only has 1, so the if this were true VOD should perform better than LFP. While LDQ having two sets of phases with 1 and 4 I/Os outstanding should have results somewhere in between these two. So thats not all of it.
  • Next I was going to say stream (or file) size is an important differentiator but “Segment Stream Size” for all workloads is 0.5GiB. So that doesn’t help.

So now I am a complete loss as to understand why the LDQ workloads are so much better than the LFP and VOD workload throughputs for HDS and SVC.

I can only conclude that the little write activity (1%) thrown into the LDQ mix is enough to give the backend storage a breather and allow the subsystem to respond better to the other (99%) read activity. Why this would be so much better for the top performers than the remaining results is not entirely evident. But I would add that, being able to handle lots of writes or lots of reads is relatively straight forward, but handling a un-ballanced mixture is harder to do well.

To validate this conjecture would take some effort. I thought it would be easy to understand what’s happening but as with most performance conundrums the deeper you look the more confounding the results often seem to be.

The full report on the latest SPC results will be up on my website later this year but if you want to get this information earlier and receive your own copy of our newsletter – email me at SubscribeNews@SilvertonConsulting.com?Subject=Subscribe_to_Newsletter.

I will be taking the rest of the week off so Happy Holidays to all my readers and a special thanks to all my commenters. See you next week.

ESRP results over 5K mbox-chart of the month

ESRP Results, over 5K mailboxr, normalized (per 5Kmbx) read and write DB transfers as of 30 October 2009
ESRP Results, over 5K mailbox, normalized (per 5Kmbx) read and write DB transfers as of 30 October 2009

In our quarterly study on Exchange Solution Reviewed Program (ESRP) results we show a number of charts to get a good picture of storage subsystem performance under Exchange workloads. The two that are of interest to most data centers are both the normalized and un-normalized database transfer (DB xfer) charts. The problem with un-normalized DB xfer charts is that the subsystem supporting the largest mailbox count normally shows up best, and the rest of the results are highly correlated to mailbox count. In contrast, the normalized view of DB xfers tends to discount high mailbox counts and shows a more even handed view of performance.

 

We show above a normalized view of ESRP results for the category that were available last month. A couple of caveats are warranted here:

  • Normalized results don’t necessarily scale – results shown in the chart range from 5,400 mailboxes (#1) to 100,000 mailboxes (#6). While normalization should allow one to see what a storage subsystem could do for any mailbox count. It is highly unlikely that one would configure the HDS AMS2100 to support 100,000 mailboxes and it is equally unlikely that one would configure the HDS USP-V to support 5,400 mailboxes.
  • The higher count mailbox results tend to cluster when normalized – With over 20,000 mailboxes, one can no longer just use one big Exchange server and given the multiple servers driving the single storage subsystem, results tend to shrink when normalized. So one should probably compare like mailbox counts rather than just depend on normalization to iron out the mailbox count differences.

There are a number of storage vendors in this Top 10. There are no standouts here, the midrange systems from HDS, HP, and IBM seem to hold down the top 5 and the high end subsystems from EMC, HDS, and 3PAR seem to own the bottom 5 slots.

However, Pillar is fairly unusual in that their 8.5Kmbx result came in at #4 and their 12.8Kmbx result came in at #8. In contrast, the un-normalized results for this chart appear exactly the same. Which brings up yet another caveat, when running two benchmarks with the same system, normalization may show a difference where none exists.

The full report on the latest ESRP results will be up on our website later this month but if you want to get this information earlier and receive your own copy of our newsletter – just subscribe by emailing us.