Improving your error handling
I started thinking about the sorts of error messages that I report to vendors on an almost daily basis. It reminded me of when I used to play Dungeons and Dragons – a thief character to be exact.
If you knock on a door and no one inside answers, does that mean no one is home, that someone inside didn’t hear you, or that someone inside is ignoring you? We simply don’t know, so we need another method besides knocking on a door to get more information. In DnD we’d just roll the dice a few times and the Game Master would usually tell us what we needed to know based on our skill level.
I find in I.T. we have similar issues where we’re given tools to run some software, connect to a server, or perform some other function, and either nothing happens or something unusual happens. What does that mean? The thing that amazes me how often I need to petition and debate with engineers to get the information I need in order to continue.
If no one answers a door, do we knock louder? Walk away? Wait a while longer? For software, do we continue/retry/quit? We’re given the option but rarely are we given enough information to tell us the probable outcome of any one of the three choices. I loathe "do you wish to continue?" options because I have no idea of the ramifications of continuing after most errors have occurred. If the error was benign enough that there are no ramifications, then why do the software bother to give me an error? Does "connection refused" mean the server is down, that no process is listening to a socket, or that we haven’t properly authenticated or authorized? In some cases a terse response is desirable – often not.
If some data in a transmission seems inconsistent enough to be refused by one end of a connectivity pipe, what’s wrong with it? Show me some bad bytes. Give me a checksum. Toss me a bone or give me a clue so that I don’t need to spend hours re-trying in futility to diagnose an issue with nothing but cryptic text. There are many times that we are asked by vendors "did you do this, that, and the other thing" (often implying "RTFM", and sometimes one gets the feeling that each question is followed with "…you idiot…". So for those of us who actually do attempt to vet issues, it helps to have all of the information possible to either diagnose the problem ourselves or to send to the vendor for quicker resolution.
When you folks are writing your error handling code, ask yourself if the error message you’re giving the user, or the developer building upon your tools, enough information to diagnose the underlying problem. "The spooler is in a truly bizarre state", "DEAD BEEF", "An unambiguous problem has occurred". These messages would have been good targets for a follow-up to minimize head scratching time spent by Support and software users. Oh yes, and I got a Windows installation error the other day. I’d tell you what the error message was but there was none. It was a dialog box with a standard exclamation mark in a yellow triangle symbolizing "Warning" – but there was absolutely no text. I relied on my uncanny ability to infer complex information from minimal data that this meant something was wrong with the installation.
If you’re not giving people enough info right there when you’re writing code, then at least add a comment for yourself or someone else to come back and enhance the error handling later. Do something that turns an error message into something more constructive toward resolving problems. One sort of enhancement would be to log details of errors, encrypted if you wish, and maybe give users the option to transmit errors to a support provider (yourself?). The more you know about common errors the more of them you can fix. You can get permission from your users to transmit internal errors that they never see – even from your character apps. Please wait as I put on my salesman shoes … OK, now : I can help with that if you wish. Just pass error messages and local data to a subroutine and we can call a web service that will store the info on a website for you or email it to you. Your customers may appreciate getting a patch for an issue they had which they didn’t need to report.
Another thing you can do is to provide web pages which you update based on information that you gather from prior issues. Does the period-end close get out of balance once in a while? The next time the close is out of balance, rather than just showing unequal debit and credit values, show a link to a web page that provides possible causes and remedies for the problem. Still wearing my salesman shoes, I’ll tell you that it’s possible to push a patch back to an end-user’s system without any manual intervention at all – update links on end-user systems as you get more information about things that can go wrong.
One other area where I find error management to be very light has to do with testing specific, difficult to resolve issues. I think we’ve all been there – you report an error and the vendor can’t reproduce it. Unfortunately that’s where a lot of vendors stop providing support "works for me, you must be imagining things" or "I don’t doubt that you’re having problems but I can’t do anything about it." To put it gently stuff that! I have never found a piece of software that could not be retrofitted with diagnostic code to take special action when specific errors occur. If you’re displaying "Something bad just happened", modify the code to provide more helpful information and get it on the client’s system! Why does it take months to resolve issues when the symptoms occur frequently enough to be annoying without being repeatable on demand? Most developers I deal with are looking right at the code that exhibits problems but because they don’t see anything in the code I can’t get a fix for weeks or months at a time. Is that code literally carved in stone? Will recompiling it with diagnostics cost any thing but CPU slices? Ahem – I think you get the message.
What do I do personally? My programs have lots of inactive debug code built in. If you’re having an issue I send a time-stamped encrypted key that unlocks the debug code. It generates an encrypted log that allows me to see exactly what the code is doing. Over time I have had to improve on some of that code (and update client systems immediately with the enhancements), but the diagnostics are as much a part of the code as the functional code itself. As time goes on from one release to another and specific pieces of code become "rock solid" I’ll remove some of the code (usually by just commenting it out) to avoid having the precautionary code become a performance issue.
Your mileage may vary of course. I don’t recommend that every piece of your code have extensive diagnostics embedded in case something should happen. But we all know the more critical areas where things can fail, and frankly it’s usually pretty obvious what segments of code are good candidates for extended error handling and what segments don’t need it. The thing I’m barking about here is the obvious, the common, the bit of code that’s likely to break even if it hasn’t done so in your normal testing. This is what Unit Testing is all about in the rest of the world – something almost never done in MV. (Yes, I do write unit tests for BASIC code, and maybe that should be an article by itself .)
If anything sticks with you from this blog entry, let it be this: Look at an error message in your code. Picture a scenario where someone calls and tells you they got that error. Will you have to ask for more information? Will you have to ask the user to "do it again" so you can see what data triggered the issue? If so then add that helpful info into the code ahead of time so that you don’t force your software users to unnecessarily endure ongoing pain.