Friday, April 06, 2007

lesson of the day: just assume Co-Worker #1 is correct

I've been reading Java 2 Performance and Idiom Guide, it's quite good. But, I think I'm too interested in experimenting with the tweaks it suggests.

As I mentioned in my last post I'm working on a data-processing pipeline, and have spent the last couple days investigating why the performance (records/second) of a particular pipeline stage drops off like it does.

The software we're working on analyzes bibliographic records and creates a representative sample. Our users create mappings between one XML schema and another. They have to go through much fewer iterations if they've got a good sample to work with. And iterations are much shorter if they can test their mapping against a sample that's <500 records, rather than 16+ million records.

The oddly performing pipeline stage normalizes XML entities by replacing them with a numeric reference to their Unicode equivalent. Yeah, the DTD should do that... but it doesn't help when the data provider escapes the entities. Or when their DTD replaces the entity... with the entity... argh.

The graph below shows the records/second performance of three stages of a pipeline (y-axis). The x-axis is time. The yellow nodes are each file's entity-replacement performance. The blue nodes are the files getting split into 5000-record chunks and inserted into a Marklogic database. The pink nodes are each 5000-record chunk being analyzed. The first 500 or so files all have 30,000 records; the rest of the files are daily updates that vary in size -- that's why the tail end of each gets a little kooky.The initial steep drop-off for the first few files was understandable. It's one of the first stages of the pipeline, and the threads doing entity-replacement don't have to compete for I/O with other pipeline stages until the following stages have entity-replaced-output to work on. But the continued decline made me worry that there was a problem in the code.

Reviewed the code. Each file is getting processed by its own thread. Each thread should create SAX handling objects that aren't shared between threads. No static fields or other stuff to interact between threads...

But! I notice that the SAX handler uses a StringBuffer to accumulate a records-worth of text as it replaces entities. When endElement is reached it calls sbuf.setLength(0), which my fancy new book learnin' suggested could cause memory/performance problems.

And I set off on my non-adventure....
There are four series on the graph above.
  1. Dark blue is the original run.
  2. Pink is the result of replacing sbuf.setLength(0) with sbuf = new StringBuffer(). No fabulous improvement. Stupid book learnin'.
  3. I decide that 'new StringBuffer()' meant that memory is getting re-allocated from initial the default StringBuffer size, which isn't as efficient as it could be. Yellow is the result replacing sbuf = new StringBuffer(INIT_LENGTH). INIT_LENGTH is a couple thousand bytes. Still no major improvement.
  4. Re-review the code. Realize that the FileInputStream/FileOutputStream objects aren't wrapped with BufferedInputStream/BufferedOutputStream. The light blue nodes are the result of changing that... Still no major improvement. I assume this was because these input files are in Zip files, and output to GZIP files -- so were already buffered (I did increase their buffer sizes along with adding the buffering to the uncompressed-file handling).
Then I remember the comment from Co-Worker #1 when I showed the graph to her before I began my journey to nowhere: "Maybe its just file-size"

I fiddle with my statistics exporting code to include the file-size in the exported spreadsheet.
*@&#$*&!

And the KB/second graph looks like:Moral of the story: Get the most complete picture of a performance problem you can, and don't be swayed by the kooky little performance trivia you read.

2 comments:

Anonymous said...

Ha-ha. Your co-worker had to clue you in. I would never admit to something so imbecellic...the key word there being "admit."

I actually just went through
something similar with a simple PHP upload application I was working on. I dug through server settings, trapped the variables being set at every step of the process, scoured the internet for ideas that might point me in the right direction...when it all turned out to be an attribute of the FORM tag on the calling page. Duh. But at least nobody had to tell me I was being dumb. I figured it out all on my own. Ha!!!

Have you or the little woman read Dress Your Family in Corduroy and Denim? I just picked it up from Borders tonight when I saw it on the bargain table. I like to keep my "to read" shelf impossibly overloaded with books. And speaking of books, what does the author of The Road have against punctuation?

And tell my sister that her hotmail account is dead.

neeber said...

Some people would say its the mark of a mature person to admit their mistakes, and give credit where credit is due.

Those people are suckers.

Also, I think I'm immune to typographic errors. They roll off me like ducks off a water buffalo's back.

Her hotmail is kaput. Try... her abbreviated middle name... followed by the initial of her married name... followed by @xmission.com

I feel so clandestine.

And yes, we've read all of David Sedaris' stuff. Even got the audio books for road trips. And saw him at Kingsbury Hall twice.