Wednesday, July 02, 2008

further lessons in patience

I've been tracking down why a run of the rules engine on the Windows platform produced different results for a particular bill was different from Unix.

After debugging the problem, I tell the business peeps the same thing the other 2 developers have been saying for the past month:
  • I can't reproduce that unexpected behavior on Windows or Unix, using my builds or the official release builds.
  • Also like the other 2 devs, I tell them there's a bug in the code we need to fix. The unexpected behavior is actually the desired behavior and fixing the bug will result in the desired results all the time.
Through pure chance... I realize the order in which bills work their way through the system will affect the outcome of the regression test. If bill A is part of a claim X, and claim X has one or more bills that haven't yet been reprocessed at the time bill A is pulled out of the FIFO bill processing queue... when the rules engine queries the database for the bills that are part of claim X, the engine won't retrieve the yet-to-be-processed bills.

This fits a big symptom of our inconsistent results for this and other bills. We see an unexpected result, and we try to reproduce it after-the-fact... no luck.

I can finally reproduce the original problem, but only if I run the regression test prep scripts, and then manually process a handful of bills from the claim and leave some stuck in limbo.

I naively ask the person that runs the test suite, "I think I found the problem, but to confirm I need to look at the log files from the regression test run on Windows. Do we still have the logs?"

I'm told, "I don't know, I don't think there is one. We were told that on Windows the log files aren't produced consistently."

I look into it for 5 minutes. It turns out no one bothered to configure a log file in the INI file. So, no logs were produced.

I'm dumbstruck. It's the latest in a long string of stupendous tales the previous developers have told the business people and test team. Lots of mythology around why the system acts oddly ... memory leaks, bad casts and pointer references, 'Windows can't do the log files', 'Windows sorts differently', 'You need to enter the password in all caps... No, not with the shift key, you have to enter it with CAPS LOCK'

WTF?

To be fair... I am hearing all these anecdotes second-hand and filtered through the non-technical team. But, taken as a whole I can't decide whether it lessens my confidence in the BAs and dev management, in that they couldn't successfully call the engineers on their bullshit... or if it just confirms my opinion that the engineers were lazy and outright lied about how things worked because they didn't want to take the time to correct the problems.