I'm allergic to relational databases, and we're already using Marklogic/XQuery to store and transform our data, so we decided to try a new implemenation of the LRU cache with the storage in Marklogic. Performance wasn't impacted much (maybe an extra couple minutes for a 1500+ file pipeline that takes 10+ hours), and now status reports take a couple seconds rather than a few minutes. Hooray!
Since that was less painful, I decided I'd also try getting some performance statistics out of the meta-data in the database. There are a few helper functions to do sum/count/average, but I wanted to avoid iterating over the list multiple times. I found some psuedocode here, and implemented an XQuery version:
(:
This is an XQuery implementation of the variance algorithm on this page:
http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance
credited to Donald Knuth, The Art of Computer Programming, vol 2: Seminumerical Algorithms, 3rd edn., p. 232 Boston: Addison-Wesley.
It relies on Marklogic's extensions to XQuery, so it probably won't
work on other XQuery implementations.
:)
define function putil:getSumMeanVariance($srcvals as xs:double*) as node()* {
let $n := 0
let $mean := 0
let $S := 0
let $sum := 0
let $throwAway :=
for $x in $srcvals
let $delta := $x - $mean
return (
xdmp:set($n,$n + 1),
xdmp:set($mean,$mean + ( $delta div $n ) ),
xdmp:set($S, $S + $delta * ( $x - $mean )),
xdmp:set($sum, $sum + $x )
)
let $variance :=
if ($n gt 1) then
$S div ($n - 1)
else 0
return (
<count>{$n}</count>,
<sum>{$sum}</sum>,
<mean>{$mean}</mean>,
<variance>{$variance} </variance>,
<stdev>{math:sqrt($variance)}</stdev>
)
}
let $elapsedTimes :=
for $pnode in collection($collName)[ .... uninteresting XPath predicate stuff here .... ]
return $pnode/stats/time/end - $pnode/stats/time/start
return <stats>{getSumMeanVariance($elapsedTimes)}</stats>