Diagnosing Connectivity Errors
When a UniObjects.NET operation fails, how do you know what happened? What about mv.NET? Heck, it doesn’t matter what tool you use. Sometimes things just stop working. Here are some tips that may help in the detection and recovery process for any MV DBMS or connectivity tools…
In the U2 forum, the question was asked: “Is there any way for UniObjects to know if the database is paused?” I’ll focus my comments on UO.NET, but the answers really apply to any scenario where a client using any connectivity library is unable to access the server.
Suggestion 1: Use the exception to help determine what happened
UO.NET returns different errors depending on whether the DBMS is unavailable or whether the uvcs/udcs service is unavailable. The errors are also different for connected and unconnected sessions. (I’m not sure if this applies to UO .NOT, errrr, UO COM.) Other tools may behave similarly.
When an unconnected client attempts to connect to a stopped DBMS it will get a socket exception 10061, stating that the server refused the connection. But when it attempts to connect to a dead socket it returns socket exception 10060. This is a timeout error.
When a connected session attempts an operation on a UniSession object that has already been connected, in most cases one of the IBMU2.UODOTNET.UniZZZ exceptions will be thrown. This depends on exactly what the code is doing. One library might assume that it has a valid account and file, so an error on Read just throws an unhandled exception. Other libraries will provide a unique exception type, or at least a unique message, for every type of exception that can be reasonably expected.
Using that information, you can Catch an exception, check the Source, NativeErrorCode, and Message, and that should give you some idea of where the problem is.
But how do you know if the DBMS is unavailable because it crashed, or if it was intentionally taken down?
A digression for Exception handling…
Error handling should be robust enough to catch failures and go back to get the connectivity back in sync. So for example:
if ( Connected() ) { // do something }
The Connected() method should attempt to connect if the connection fails and the result of the method should always be true unless things really have fallen apart. Ideally, the Connected() method should be smart enough to recognize a case, for example, when you logged into an account and opened some required files, and then suddenly the connection was dropped. The method should reconnect, open the files, and return True, so that the “do something” code has the proper context available to do what’s required.
Some might say “if there is an error, stop and report it”. What good does that do for the end-user entering a transaction? Sure, it would be a good idea for Connected() to log an error for the sysadmin as a part of it’s error handling, but unless the error is really critical, I’d say handle it internally and keep moving.
Here’s another tip regarding exception handling. Try/Catch Everything! Let’s say you have a file operation nested with a try/catch where you open a file and just read a single record. You open the file, and right before you read the record network drops, so the Read fails. You catch the error, and let’s say you don’t want to re-attempt the read, you just want to get out of there. So you have a Finally block where you execute some housekeeping like file.Close(). Well, that’s good coding if no errors occurred, but the network is down and the Close() method is going to throw it’s own exception. So yes, even within Catch and Finally blocks you should be nested Try/Catch blocks.
OK, so let’s say you’ve encountered an error, either the DBMS or the network is down. How does the client code figure out what’s up?
Suggestion 2: Check status with a different service
Yup, if you’re not sure if the connection you have is valid, it probably won’t be productive to connect back to the same server to ask it why another connection broke.
Create a web service or other socket server and run it as a Windows service. When you get a dead connection, poll that server as part of your error handling to see what’s up. If you can’t connect to that process either then chances are good that the server has gone down or that the network has suddenly started to block connections.
But assuming you can connect to that service, how does it know “what’s up?”
Let’s assume you’re pausing the DBMS for a snapshot (for products that support that feature). Right before you pause the DBMS, set the text in some status.txt file like “DBMS is Paused” or “DBMS is down for maintenance”, then clear that file after you restart the DBMS. You can set and clear that file by executing a simple BAT macro, or you can write something more sophisticated.
Some of you use BAT files to take down your services. Here is an example of such a file and how it would update the status file:
ECHO DBMS is paused > status.txt
net stop “UniVerse Telnet Service”
net stop “UniVerse Resource Service”
net stop “UniData Telnet Service 7.2”
net stop “UniData Database Service 7.2”
net stop “Uni RPC Service”
And to restart services, the startup is done in the opposite order:
net start “Uni RPC Service”
net start “UniVerse Telnet Service”
net start “UniVerse Resource Service”
net start “UniData Telnet Service 7.2”
net start “UniData Database Service 7.2”
erase status.txt
I’d think the above solution would only take about 30 minutes to code and test.
If you think about it, the server status components don’t even need to be set on the same server. Rather than loading text to a local file, the process that sets the message might actually call a web service on a remote server, and clients that have difficulties can poll that remote server. This way you can even reboot the OS and your clients will still be able to get some indication of why their DBMS transactions aren’t going through.
Indeed, the above suggestions may not be the best way to detect that a DBMS has been paused, or other platform-specific conditions, but it should help. Someone might trump all of this and say there’s a flag that gets set when the DBMS is paused or shutdown. OK, when the DBMS is paused, run a proggie that reads that flag and notifies a status server. Remember that it doesn’t matter what information is available if the client can’t get at it – and we can’t always assume that the code accessing the DBMS is on the same server or even in the same subnet as the DBMS. So some setting on the DBMS server does nothing unless someone can get at that information … and that’s why I proposed the above solution.
Summary
As you see, the process of handling errors can take on a little life of its own. Hey, it’s important code and it shouldn’t be ignored or left for a day when you don’t have anything better to do … which is how a lot of error handling code is managed.
My personal approach is never to let unhandled exceptions bubble up to the client. Handle everything, either by refreshing the application state, or by messaging the admin and maybe the end-user. But end-user transactions should never be interrupted unless it’s really necessary, and end-users should never be left hanging with ugly messages or without any indication that for some reason they can’t continue. The only thing that tempers all of this is budget constraints on time and funding.
There are no absolutes. There are any number of ways to solve the problems discussed here and I’m not attempting to propose the “best” practices, only some ideas for people who aren’t used to this sort of coding. Some approaches to error handling are more elegant than others. Please feel free to suggest alternatives.