As I mentioned in my last post I'm working on a data-processing pipeline, and have spent the last couple days investigating why the performance (records/second) of a particular pipeline stage drops off like it does.
The software we're working on analyzes bibliographic records and creates a representative sample. Our users create mappings between one XML schema and another. They have to go through much fewer iterations if they've got a good sample to work with. And iterations are much shorter if they can test their mapping against a sample that's <500 records, rather than 16+ million records.
The oddly performing pipeline stage normalizes XML entities by replacing them with a numeric reference to their Unicode equivalent. Yeah, the DTD should do that... but it doesn't help when the data provider escapes the entities. Or when their DTD replaces the entity... with the entity... argh.
The graph below shows the records/second performance of three stages of a pipeline (y-axis). The x-axis is time. The yellow nodes are each file's entity-replacement performance. The blue nodes are the files getting split into 5000-record chunks and inserted into a Marklogic database. The pink nodes are each 5000-record chunk being analyzed. The first 500 or so files all have 30,000 records; the rest of the files are daily updates that vary in size -- that's why the tail end of each gets a little kooky.The initial steep drop-off for the first few files was understandable. It's one of the first stages of the pipeline, and the threads doing entity-replacement don't have to compete for I/O with other pipeline stages until the following stages have entity-replaced-output to work on. But the continued decline made me worry that there was a problem in the code.
Reviewed the code. Each file is getting processed by its own thread. Each thread should create SAX handling objects that aren't shared between threads. No static fields or other stuff to interact between threads...
But! I notice that the SAX handler uses a StringBuffer to accumulate a records-worth of text as it replaces entities. When endElement is reached it calls sbuf.setLength(0), which my fancy new book learnin' suggested could cause memory/performance problems.
And I set off on my non-adventure....
There are four series on the graph above.
- Dark blue is the original run.
- Pink is the result of replacing sbuf.setLength(0) with sbuf = new StringBuffer(). No fabulous improvement. Stupid book learnin'.
- I decide that 'new StringBuffer()' meant that memory is getting re-allocated from initial the default StringBuffer size, which isn't as efficient as it could be. Yellow is the result replacing sbuf = new StringBuffer(INIT_LENGTH). INIT_LENGTH is a couple thousand bytes. Still no major improvement.
- Re-review the code. Realize that the FileInputStream/FileOutputStream objects aren't wrapped with BufferedInputStream/BufferedOutputStream. The light blue nodes are the result of changing that... Still no major improvement. I assume this was because these input files are in Zip files, and output to GZIP files -- so were already buffered (I did increase their buffer sizes along with adding the buffering to the uncompressed-file handling).
I fiddle with my statistics exporting code to include the file-size in the exported spreadsheet.
*@$*&!
And the KB/second graph looks like:Moral of the story: Get the most complete picture of a performance problem you can, and don't be swayed by the kooky little performance trivia you read.