Tuesday, July 21, 2009
argh
Monday, July 13, 2009
further adventures in horrible code -- effin assert
Saturday, January 10, 2009
best weird debugging experience ever
typedef struct RMAP
{
bill* pbill;
line* pline;
rule* prule;
} rmap;
I'm not sure what I find the most interesting... The fact that somebody thought it was fine to copy that struct definition throughout 15+ .c files.
Or, the fact that somebody slightly modified a couple of those instances.
typedef struct RMAP
{
bill* pbill;
line* pline;
rule* prule;
Bool overrideFlg;
} rmap;
typedef struct RMAP
{
bill* pbill;
line* pline;
rule* prule;
double accum;
} rmap;
Or... the best subtle weirdness, in one instance someone reordered the pointers.
typedef struct RMAP
{
rule* prule;
bill* pbill;
line* pline;
} rmap;
So... when you debug the following code in any of the _other_ .c files...
void myCrazyFunction(rmap* a_rmap)
{
rule* r = a_rmap->prule;
double reduction = r->reduction;
... blahblahblahblahblah ...
}
At runtime, everything works -- the definition of rmap used by the compiler is the definition within the .c file, and the third set of pointer-size bits within the a_rmap struct are assigned to 'r'.
But, when I try to debug the code in Visual Studio and hover over the a_rmap->prule I see a bunch of garbage. I hover of 'r', and I see it's got the values I expect.
I'm guessing the debugger finds the first instance of the rmap type in the .pdb file, and the one oddly ordered struct just so happens to be in the first .c file alphabetically (and also first in build order).
But, that realization doesn't come until about 2 hours of other wild goose chases through the rest of the call stack to eliminate the other 'more likely' possibilities.
Good times.
Tuesday, November 25, 2008
further adventures in wasting time
Monday, October 27, 2008
2 days I'll never get back
Very large as in 123,456,789,236
Guess which horrible application uses a 'long' type to hold those IDs? Guess which horrible application is supported on both HP-UX and Windows? Guess what type might be 32bit on Windows and 64bit on HP-UX? Ding! Ding! Ding!
Spent all day Friday and today rejiggering the code to use 'long long'. Most of today was spent un-rejiggering and re-rejiggering the code to work around Oracle Pro*C's inability to use 'long long' as a host variable -- today I had to swap all that out of the embedded SQL in order to use character strings and stroll/sprintf to move values from the strings to 'long long' and vice-versa.
I had thought the code was 'OK' on HP-UX... after all these recent changes, I'm not so sure. The code to transfer the values from the query results to their 'long' variable instances seems wrong. In any case... it'll work the same on both platforms now.
I take some solace from finding the problem before any customer. We have at least 3 customers nearing 1-billion bills -- although the horrible app is 10+ years old, the rate they're adding bills to their systems grows every year. They can also start/change their sequences whenever they feel like it. So, maybe we had another 1-5 years before it became a major emergency.
Tuesday, September 30, 2008
lessons in conflation : compare != sort
Until this month they couldn't point me to bills that had the problem at the same time as we had log files from our regression tests (or find me any archived e-mails from the original developers). Given that we're probably the only people running the application on both HP-UX and Windows -- during our regression testing on the two platforms -- the bug hasn't been that high a priority.
Finally tracked down the cause of the issue to be differences in the qsort() algorithm of the C runtime lib -- specifically how qsort() handles two elements that compare as equal (!). For some reason, the Windows qsort() performs some extra comparisons and 1 extra swap.
The list we're sorting looks something like :
1. $23.96
2. $23.96
3. $23.96
4. $23.96
5. $40.95
On Unix, qsort() and this particular compare (descending order by amount) results in the following :
5. $40.95
1. $23.96
2. $23.96
3. $23.96
4. $23.96
On Windows, its qsort() and the same compare function resulted in:
5. $40.95
3. $23.96
1. $23.96
2. $23.96
4. $23.96
The values we're comparing are equal... but what we're actually sorting is the value and a reference to where it started in the list. Because the sorting produces different results we end up annotating the bill lines differently on Windows than Unix.
Changing how we do the comparison fixed the problem. Now we compare left/right values, if they are equal we compare the originating line number of the values.
And now we get consistent results no matter the platform:
5. $40.95
1. $23.96
2. $23.96
3. $23.96
4. $23.96
Next time, in Steve's job rants:
- Typedefs on top of typedefs on top of typedefs. Oh yeah, and because of typedefs A, B, C, and D it turns out 95% of calls within the system look like pass-by-value but are in fact pass-by-reference. It took me ~2 months to realize that was going on. 4 months later and it still gives me the heebie jeebies.
Saturday, September 27, 2008
loving/hating Fit
It's typically some error-prone manual process -- run some SQL in Toad, copy results to Excel, run the bill, re-run SQL, compare to the old results with eyeballs, pray.
I read about Fit a few months ago, and it seemed like it'd solve a lot of our problems. Blackbox testing would help build tests up around our legacy codebase. It'd also help bridge the communication gaps between onshore/offshore development, and development/business.
Finally took the time to play with it this week -- specifically, DbFit. Without any customization, I was able to write some simple SQL in a wiki page to generate a new test case for the issue I was working on this week.
I'm in love. It seems like with some minor customization we can make some huge leaps in our productivity. But I really hate trying to navigate the Fit/Fitnesse web sites for reference information. Ugh.
additional offshoring adventures
Monday this week, I realize our configuration editor doesn't give the user the ability to edit a certain set of port #s. I ask the offshore team to add that ability to the config UI.
When I come in the next day the offshore lead has decided the existing way we do things is too error prone. He's made a great change so that the configurable settings are just a range of port #'s and the application(s) will pick unused ports within the range.
After reviewing the changes, the only issue I found was in some output files created by the apps once they pick a port number. Once the port # is selected, an output file will be written. It was implemented with a DataOutputStream to write an int primitive to the file, and read back in with DataInputStream. I'd prefer it to be human readable, so I respond with some notes saying to make the file a text file rather than a data file.
I come in the following day, and find that the only thing he's changed is to give the file a ".txt" file extension.
Wah?
It's the 'little' things like this that drive me crazy. Things I assume don't need much specification. It's particularly annoying because this engineer is so awesome -- he's always quick to turn around any solution, and always asks great questions on our calls. And, he did a great job of taking the initiative to fix a major flaw just a day earlier.
Tuesday, September 16, 2008
further adventures in offshoring
aiiiiiiiieeeee.
- bitwise '&' operator used instead of boolean '&&' operator
- '==' operator used for Long/Integer/many-other-reference-types
- String.equals() comparing a String to a DO object
- Enum classes with their own extra-special equals() methods. Ick.
- Classes with equals() but no hashCode()
- Who implements finalize()?
- many, many more
Saturday, August 16, 2008
an afternoon I'll never get back, thanks Eclipse!
Finally managed the right phrase in Google that found me the answer: http://dev.eclipse.org/newslists/news.eclipse.platform/msg74562.html
Apparently, Eclipse plugins may update one of the preferences that tells the builders which file extensions it shouldn't copy to the build output path. To fix it, you have to go to 'Preferences... -> Java -> Compiler -> Building' and then check all the extensions entered into the Filtered Resources text field.
I'm not sure if it was a plugin change, or (probably more likely) changes I was making to my launch configurations to not run out of the 'dist' directory into which the inter-project Ant build packages everything.
I'm not sure if I should be happy that the offshore team I've now inherited at least got the Ant build right. Or, if I should be concerned about whether they've actually been running the code / configuration they think they are when running under Eclipse.
One more in a long line of complaints I've had regarding how this set of applications is configured. Normally, I'm the one on whatever-team-I'm-on pushing for extensive runtime configurability. I don't know why this project's configuration setup -- not the Spring/Hibernate, the home-grown .properties file reading and command-line handling stuff -- rubs me the wrong way. The nonsensical or nonexistent documentation, the obfuscated search paths, the unhelpful error messages, or the fragility of the whole system when you're missing one piece... all of the above?
Even with that lost afternoon, I still love Eclipse. After all, it's free. Oooh, free. One plugin that now I can't live without: Remote System Explorer.
I've got to run Windows at work, and the corporation-approved SSH client sucks balls. When you resize a terminal window... what would you expect to happen? More columns/rows? Bah! That's too obvious and old-school for Attachmate Reflection's SSH client-- instead the mother f*&#$*#&er resizes the font. Badly.
PuTTY is mediocre. The corporation's security agent can't be disabled, and it prevents the cygwin install from running completely.
Enter Remote System Explorer. Multiple terminal windows. Does the 'right' thing when resizing or scrolling. Gives me a file-tree view of the remote system via SFTP. I can edit remote files within Eclipse and it automagically saves them to the remote system.
Eclipse is a beast, and a bit of overkill for a shell window. But, I've already got it open. And, unlike my Ubuntu system running under VMWare... I don't feel like the corporation's jackbooted thugs will re-educate me if/when they find I'm using it.
Tuesday, August 05, 2008
so many resignations, none of them mine.
One friend at old employer resigned. Finally. Sucks for those that are left behind, but it's a great move for him.
As for me, I'm still trying to decide how long to stick it out at the current employer. Finally feel like I'm getting the hang of my current project. We're continuing to refine our processes, and we've got two new engineers that seem to be very good. Now that I've settled into the rut, I've been knocked out of it to take over as tech lead for one of the people that's leaving. More responsibility, higher profile, more interesting work... I should want that right? And not the cozy rut?
In less wishy-washy-I-am-lame news, things I'm enjoying:
- Dexter
- Dr. Horrible
- 'A Fire Upon the Deep' by Vernor Vinge. I couldn't put it down. Read until 4am night before last, and 2am last night. Very good.
Wednesday, July 02, 2008
further lessons in patience
After debugging the problem, I tell the business peeps the same thing the other 2 developers have been saying for the past month:
- I can't reproduce that unexpected behavior on Windows or Unix, using my builds or the official release builds.
- Also like the other 2 devs, I tell them there's a bug in the code we need to fix. The unexpected behavior is actually the desired behavior and fixing the bug will result in the desired results all the time.
This fits a big symptom of our inconsistent results for this and other bills. We see an unexpected result, and we try to reproduce it after-the-fact... no luck.
I can finally reproduce the original problem, but only if I run the regression test prep scripts, and then manually process a handful of bills from the claim and leave some stuck in limbo.
I naively ask the person that runs the test suite, "I think I found the problem, but to confirm I need to look at the log files from the regression test run on Windows. Do we still have the logs?"
I'm told, "I don't know, I don't think there is one. We were told that on Windows the log files aren't produced consistently."
I look into it for 5 minutes. It turns out no one bothered to configure a log file in the INI file. So, no logs were produced.
I'm dumbstruck. It's the latest in a long string of stupendous tales the previous developers have told the business people and test team. Lots of mythology around why the system acts oddly ... memory leaks, bad casts and pointer references, 'Windows can't do the log files', 'Windows sorts differently', 'You need to enter the password in all caps... No, not with the shift key, you have to enter it with CAPS LOCK'
WTF?
To be fair... I am hearing all these anecdotes second-hand and filtered through the non-technical team. But, taken as a whole I can't decide whether it lessens my confidence in the BAs and dev management, in that they couldn't successfully call the engineers on their bullshit... or if it just confirms my opinion that the engineers were lazy and outright lied about how things worked because they didn't want to take the time to correct the problems.
Saturday, June 21, 2008
triumph! ... oh, dammit
On our horrible project-released-every-month, they are frequently late starting their regression test, and very often late with the release. In 2007, it was only released on-time twice.
As bad as that is, the good(?) thing is they don't release until their regression test says the only differences are the ones they expect. Unfortunately, that typically means a team of 12+ BA's, developers and testers work the weekend.
But not this month. We started the regression test when we were scheduled to on Monday. And, rather than posting the release on the drop-dead Monday date, we passed the release off to customer-facing site on Friday. This is the first time in years that that has happened.
The process-improvements Jeff and I have been insisting on deserve some of the credit.
Odd things. You know, completely out-of-left-field things like code-reviews and communicating. It's unclear how much of this month's success was the amount of work included in this release, and how much of it was our process changes.
Unfortunately, the day of triumph was made bittersweet by Jeff's resignation. Probably not accurate to call it a resignation, since he was a contractor and decided not to pursue extending his contract. Or, to be precise... not wait until the final day of his contract to find out if they were going to extend his contract.
I don't blame him for finding something else, and I may follow him soon enough. The monthly release meat-grinder can be made less painful... but it'll never be fun.
Wednesday, June 18, 2008
ready. set. go?
More mixed feelings today after a meeting discussing our project's migration from usage under the current application framework to its replacement over the next 2-3 years. Felt good that our hard work to stabilize the crumbling infrastructure is recognized by the business people. Felt worse when I realized that 2+ years into a multi-million dollar project, they're finally discussing the actual mechanics and workflow of how both systems are going to be developed and maintained together. "Oh yeah... we want to release in October. We need to figure that out."
Nearly all the positives about the new job have been snuffed out. There is the potential for it to get better, but it's hard to hold on to the glimmer of hope. Other than feeling like I'm giving up, it's getting harder to find reasons to stay.
Friends at 4 different employers are looking for people... positions with varying levels of awesomeness. Or, I could strike out on my own and find something different.
But, what do I want? That's the million dollar question.
Someplace more engineer-y than my most recent work in content publishing or healthcare. Some of the work was great and challenging... other parts have been mind-numbing or downright eeeeeeeeviiiiiil. Not that the previous work in simulation was fluffy kittens.
Someplace that views R&D as a vital part of their business plan, rather than whining by engineers for 'fun' work. Maintenance can be fun -- debugging an issue, finding the cause of the crazy and esoteric problems is like unlocking a puzzle and can be very rewarding. But, it's not something that you look forward to doing as your sole activity for the next 3+ years.
Something more back-end rather than UI or web-applications. I'm pretty sure that new fangled interwebernet thing will never work. It's all TUBES! TUBES I TELLS YA!
It'd also be nice to land someplace where I could use Python for more than just throw-away utilities, and not be viewed as a rabble-rouser.
Thursday, May 29, 2008
waaaaaah?
So far I've resisted the urge to go fix all the potential instances of the problems, and only fixed the problems exposed by the regression suite. Maybe for next month's release.
Then I spent about 4 hours in conference calls. At about the third hour in I realize the crick in my neck that I'd blamed on sleeping oddly was more in fact from the previous day's conference calls. Jeff helpfully points out over IM, "Yes, this meeting is a pain in the neck."
Then I finally get a response from technical support on 2 of my tickets, after they'd spent ~3 weeks in a black hole.
"Due to ... blahblahlbha... we cannot install open source applications on any servers."
The offending applications? Zip. Unzip. GNU Make.
I can't decide to laugh or cry. The HP-UX server in question already has the GCC toolchain, GNU tar, ant, CVS, CruiseControl and the list goes on. Not to mention widespread use of Hibernate, Spring, ehcache, and tons of Apache and Jakarta projects throughout our Java and .Net products. WTF?
I've learned my lesson. Attempts at following procedure will only be made as a last resort. I need anti-action-item-Wonder-Woman-bracelets to deflect IT-related action items onto my coworkers.
Speaking of playing dress up...
I have to keep reminding myself I have better things to do than spend $250 on t-shirts.
Sunday, May 25, 2008
further lessons in not believing anyone
More whining about work.
It's becoming increasingly apparent that the previous development staff of Project A were lazy. Definitely not stupid, only people confident in their brilliance could produce such a poorly documented and nightmarish code base.
Lazy in terms of, "Hey, that's acting weird... it seems hard to figure out too. I'll blame X and the business people will live with it."
The latest example is the supposedly 'inconsistent' regression test results. When Jeff and I joined the project, we were repeatedly warned in reverent tones of the odd non-deterministic regression test. The first time the 10k+ bills in the regression suite are run, some bills fail. If you run those bills through again, they work.
On the face of it, it seems very odd. The way it was presented to us it seemed that the failing bills were random, and on re-run the identical bill would produce different results. Various things were blamed: memory leaks, buffer overrun, bad casts.
As we dug into it, it became clear that wasn't the case:
- the bill processing engine wasn't re-started between the failure and success.
- for at least one large set of the failing bills, the same bills fail every time the regression test is run.
- The regression test's preparatory script actually modifies the bill in such a way that they become illegal bills
- the engine sees the illegal bill, and tweaks it such that works again
- but, many of the failing bills are only partially tweaked because they have manual override codes. The partial tweaking sets the bill up to fail the first time, but corrects things enough that it will work the second time through the system.
Definitely a bug in the engine. The data that's reset by the prep script shouldn't make it behave that badly.
Saturday, May 24, 2008
python + elixir + pyyaml == yay!
At the moment, the developers manually run SQL scripts to peek at various bits of data. Definitely a huge waste of time.
So, I'm playing with Python + Elixir + PyYAML to get a script that I can just pass in the bill ID and it'll query the 5-6 tables and serialize the bill to something more human readable. If I play my cards right, I'll never have to deal with Toad again.
Elixir works its magic and I can query my database. The autoload didn't work, and there are a couple tables with 50-100+ columns. Serializing it to YAML without repeating myself is now the trick.
I'm sure there's a better way to do this, but here's what I came up with... applied to the Elixir tutorial. Not the most exciting thing ever. But, it's the start of building a better testing framework. Doing the same for the real tables will make it much easier to compare a bill before and after running through our application.
#!/usr/bin/python
# -*- coding: latin-1 -*-
from elixir import *
from yaml import load, dump
try:
from yaml import CLoader as Loader
from yaml import CDumper as Dumper
except ImportError:
from yaml import Loader, Dumper
def _toYamlRep(ent):
"""
Given an elixir entity, query the entity's members via its __dict__
and return a dict
"""
ret = {}
for (k,v) in ent.__dict__.items():
if k.startswith('_') or k == 'row_type':
# don't print out the 'hidden' keys
continue
if v:
ret[k] = str(v)
return ret
class YamlEntity(Entity):
def toYamlRep(self):
"""
Wrap the _toYamlRep() dict in another dict, use the class' name as the header.
"""
return {self.__class__.__name__ : _toYamlRep(self) }
class Movie(YamlEntity):
title = Field(String(30))
year = Field(Integer())
description = Field(Text())
director = ManyToOne('Director')
def __repr__(self):
return '<Movie: "%s" (%d)>'%(self.title,self.year)
class Director(YamlEntity):
name = Field(String(60))
movies = OneToMany('Movie')
def __repr__(self):
return '<Director: "%s">'%(self.name)
def main():
metadata.bind = "sqlite://"
setup_all()
create_all()
rscott = Director(name="Ridley Scott")
glucas = Director(name="George Lucas")
alien = Movie(title="Alien", year=1979, director=rscott)
swars = Movie(title="Star Wars", year=1977, director=glucas)
brunner = Movie(title="Blade Runner", year=1982, director=rscott)
session.flush()
for m in Movie.query().all():
print dump(m.toYamlRep(),Dumper=Dumper,default_flow_style=False)
cleanup_all()
if __name__ == '__main__':
main()
And, here's the output:
Movie:
director: '<Director: "George Lucas">'
director_id: '2'
id: '3'
title: Star Wars
year: '1977'
Movie:
director: '<Director: "Ridley Scott">'
director_id: '1'
id: '4'
title: Alien
year: '1979'
Movie:
director: '<Director: "Ridley Scott">'
director_id: '1'
id: '5'
title: Blade Runner
year: '1982'
cruisecontrol == teh suck
Wednesday, May 21, 2008
can it be? finally done? ... WTF?
after the following loop through the tech support desk:
- go to system A
- go to system B
- go to system C
- go to system D
- go to system F
- no, go to system C. enter text 'XYZ A'
- FAIL
- no, go to system C. enter text 'XYZ B'
- SUCCESS
The admin rights have been granted.
It seems obvious now, but if you're not given a definitive answer when you ask "Doesn't Umbrella Corp already have a pool of licenses we can pull from?", keep asking.
Turns out, we had the licenses all along. SWEET ZOMBIE JESUS, that was good times.
Monday, May 19, 2008
progress?
- I reply, "I don't know what group name you're asking for? How do I find that information?"
- Tech-support person #6 replies, "Global group used by your team to access servers."
- I reply, "I have a 'global group to access servers'? I log in using my XYZ domain account. I don't know what group you're asking for. Is the global group one of the access groups listed in service request system F?"
- Tech-support person #6 replies, "Yes, global group is a access group."
I don't know if this is progress or not.
Further complicating matters... the process improvements Jeff and I have suggested appear to have helped us meet this month's release. The group's communication is improving, the code reviews caught some subtle problems, and it seems like meeting the deadline may not only be because of the handful of items they dropped off the release plan. And, Jeff and I are spitballing some ideas to for a unit testing framework so the developers can test their modifications without manually running SQL statements to verify the result of their changes.
It seems like we could be successful. If only we're not ground up by the monthly release chaos.
But, is it worth it? I'd hate to run away after only 2 months, but it gets very discouraging. There are at least two friends who have jobs to fill, one unknown but potentially awesome and another known and good-to-awesome.
I don't feel like I'm growing technically. But, I'm definitely growing in terms of dealing with projects, customers, and other developers onshore-and-offshore. Maybe the pain is worth it. On the plus side, Jeff and I have the complete support of the business-side to make the changes we're planning. They're suprisingly enthusiastic.