Wednesday, March 21, 2007

a programming post, odd.

At work, I've been working on a data processing pipeline. The meta-data for the pipeline (e.g. status of pipeline phases, data locations, etc) at runtime is held in a LRU cache and flushed to persistent storage. Prior to a couple weeks ago, the storage was just a bunch of files on the filesystem. It worked, but it was pretty slow to collect aggregate information out of it -- e.g. if you wanted a global view of all status within the pipeline, you had to traverse the meta-data graph loading each individual file as the traversal progressed.

I'm allergic to relational databases, and we're already using Marklogic/XQuery to store and transform our data, so we decided to try a new implemenation of the LRU cache with the storage in Marklogic. Performance wasn't impacted much (maybe an extra couple minutes for a 1500+ file pipeline that takes 10+ hours), and now status reports take a couple seconds rather than a few minutes. Hooray!

Since that was less painful, I decided I'd also try getting some performance statistics out of the meta-data in the database. There are a few helper functions to do sum/count/average, but I wanted to avoid iterating over the list multiple times. I found some psuedocode here, and implemented an XQuery version:

(:
This is an XQuery implementation of the variance algorithm on this page:
http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance

credited to Donald Knuth, The Art of Computer Programming, vol 2: Seminumerical Algorithms, 3rd edn., p. 232 Boston: Addison-Wesley.

It relies on Marklogic's extensions to XQuery, so it probably won't
work on other XQuery implementations.
:)
define function putil:getSumMeanVariance($srcvals as xs:double*) as node()* {
let $n := 0
let $mean := 0
let $S := 0
let $sum := 0

let $throwAway :=
for $x in $srcvals
let $delta := $x - $mean
return (
xdmp:set($n,$n + 1),
xdmp:set($mean,$mean + ( $delta div $n ) ),
xdmp:set($S, $S + $delta * ( $x - $mean )),
xdmp:set($sum, $sum + $x )
)
let $variance :=
if ($n gt 1) then
$S div ($n - 1)
else 0
return (
<count>{$n}</count>,
<sum>{$sum}</sum>,
<mean>{$mean}</mean>,
<variance>{$variance} </variance>,
<stdev>{math:sqrt($variance)}</stdev>
)
}


let $elapsedTimes :=
for $pnode in collection($collName)[ .... uninteresting XPath predicate stuff here .... ]
return $pnode/stats/time/end - $pnode/stats/time/start
return <stats>{getSumMeanVariance($elapsedTimes)}</stats>