Keys to success … or rather … successful keys
A discussion in the OpenQM forum got me thinking about keys. I’d like to share some quick and mostly unedited thoughts with you on the topic. I always wonder if I’ll regret an off-the-cuff posting like this…
Over the years I’ve learned the “prime directive”, that the ID identifies a record as a unique database entity, that’s all. IDs should not contain meaningful application data. While this directive has occasional exceptions, I think most MV developers regard the exception as the rule, and because this frequently leads to problems I think personal practices need to be more thoroughly considered.
The following are examples of common scenarios. I’m sure many of my colleagues would argue with each example. But there are no absolutes, and development based on an assumption of absolutes will absolutely lead to problems. I’m just trying to demonstrate how very common key mechanisms can be problematic, and I’m encouraging you to think more about these mechanisms before building an application around them.
A customer or vendor number should generally be a unique numeric sequence. If it includes chars that visually relate to the client’s name, phone number, location, or anything else, the key will need to change if and when any of that data changes – or when two entities happen to share the same key qualifiers. USA Social Security Number and Driver’s License ID can also change despite insistence to the contrary in initial development.
ID’s should not be limited to N digits or based on a hardcoded leading digit.
Example: NEW.ID=”100000″+(NEXT.ID)
I’ve had to do a major app update to accommodate more orders, invoices, customers, etc. A key like that embeds “meaningful” significance to the ID, as in “the ID is defined as the digit 1 followed by a number”. That violates the prime directive.
Same goes for fixed-length fields beginning with a zero: 000002134. Most of us have encountered the issue where some data operation may coerce these keys into a number which then loses the significant digits. Actually, this is why people prepend a digit like “1” as seen above. If the other points above are considered and the key is not defined as being “meaningful” then this scenario can be avoided.
A purely sequential ID isn’t ideal either as it doesn’t allow for efficient hashing. A key creation algorithm should allow for “random” first characters with a table or other mechanism to avoid duplication.
The ID you present to end-users and trading partners doesn’t need to be the actual key for the file. While most of us do this as a matter of course, an invoice number like 0283461934 might be stored in an invoice record like 835273737395. Again, the ID used on documents is meaningful to the data consumer but this data may not be the best choice in the context of a database file key – these are two different concepts.
User-entered data should not be in the key. For example, a customer purchase order number might be mis-keyed or even non-unique amongst a number of trading partners. To combat this many developers use CUST*PO as a key but this may need to be corrected if you get a customer who manually generates their own PO numbers or they do a lot of “verbal” POs. If you follow the prime directive there aren’t any gotcha’s that need to be addressed with later (costly) code updates.
Finally, when it comes to ID=ThisIsMyItem (as seen in the OpenQM discussion), that’s very wrong unless the file itself is an index intended to refer to another unique record or to identify duplicate records. Even then, as we see (in the forum discussion), long keys are problematic, and if you don’t base the key on the application data you don’t have to worry about the possibility that data will mess up your key system. This could also be used for optimistic locking checks. But in that case, it’s probably better to keep a lock table with unique keys and a single attribute with the MD5, then index that file. But don’t use the MD5 as a unique key because there is a chance that it may not be unique depending on the application data used to generate the value. I’m honestly not 100% sure that’s accurate, but “randomly” generated keys always subject us to a potential conflict where generated keys should not. If you simply don’t use the MD5 as a key, you don’t need to worry about key conflicts. You may still need to consider code to resolve a case where the index returns more than one item for a search, but at least that won’t require a revision of the key system.
Well, there ya have it. That’s not scientific but I hope some of these hard-learned lessons will be of use to someone – or at least fodder for a good argument. 🙂