Tuesday, September 30, 2008

lessons in conflation : compare != sort

For months, the business people have complained that the rules engine sporadically produces different results on Windows than HP-UX. They'd complained previous developers said, "sorting is different on Windows and Unix". I did my best to restrain any reply more caustic than, "That's silly."

Until this month they couldn't point me to bills that had the problem at the same time as we had log files from our regression tests (or find me any archived e-mails from the original developers). Given that we're probably the only people running the application on both HP-UX and Windows -- during our regression testing on the two platforms -- the bug hasn't been that high a priority.

Finally tracked down the cause of the issue to be differences in the qsort() algorithm of the C runtime lib -- specifically how qsort() handles two elements that compare as equal (!). For some reason, the Windows qsort() performs some extra comparisons and 1 extra swap.

The list we're sorting looks something like :

1. $23.96
2. $23.96
3. $23.96
4. $23.96
5. $40.95

On Unix, qsort() and this particular compare (descending order by amount) results in the following :

5. $40.95
1. $23.96
2. $23.96
3. $23.96
4. $23.96

On Windows, its qsort() and the same compare function resulted in:

5. $40.95
3. $23.96
1. $23.96
2. $23.96
4. $23.96

The values we're comparing are equal... but what we're actually sorting is the value and a reference to where it started in the list. Because the sorting produces different results we end up annotating the bill lines differently on Windows than Unix.

Changing how we do the comparison fixed the problem. Now we compare left/right values, if they are equal we compare the originating line number of the values.

And now we get consistent results no matter the platform:

5. $40.95
1. $23.96
2. $23.96
3. $23.96
4. $23.96


Next time, in Steve's job rants:
  • Typedefs on top of typedefs on top of typedefs. Oh yeah, and because of typedefs A, B, C, and D it turns out 95% of calls within the system look like pass-by-value but are in fact pass-by-reference. It took me ~2 months to realize that was going on. 4 months later and it still gives me the heebie jeebies.