Tuesday, June 21, 2005

Natural vs Synthetic keys

Primary keys should be:
1. Unique
2. Never changed (or very infrequently)

This is because they are used in other tables to reference a single row.

(Definitions from www.orafaq.com)
Natural Key: A key made from existing attributes.
Surrogate Key: A system generated key with no business value. Usually implemented with database generated sequences.

This topic has been discussed at great length many times. We recently re-hashed it on the Dizwell Forum.


EDIT: Please note: There are some potentially important corrections and clarifications to the below. Please see Howard's comments and follow the links to his blog or the forum discussion.

Natural keys:
- May require the concatentation of many columns, so it gets very long
- Sometimes a natural key can be hard to find (in example: names can change)
- In a sense, you are duplicating data
- Can lead to future conflicts when the database expands beyond original requirements
- Care must be taken to avoid block-splitting
- Care must be taken to avoid index range scans
- Tend to change, albeit infrequently

"You nearly always end up having to append a number to guarantee both uniqueness and existence. If I always have to append a guaranteed unique existing number to ensure I meet my requirements, why not make it the key itself."
- Niall Litchfield

Synthetic/Surrogate keys:
- Meaningless by itself (ie. no relationship to the data to which it is assigned)
- Implementation is database-dependent
- Care must be taken to avoid contention (monotically increasing and smaller indexes)
- Similar rows would not have similar keys

Gauss suggested considering using a GUID for the surrogate key. No trigger is required, uniqueness is guaranteed over space and time, but it is 16 bytes.

2 a_str VARCHAR2(32));

Table created.

SQL> INSERT INTO atable (a_str) VALUES ('One');

1 row created.

SQL> INSERT INTO atable (a_str) VALUES ('Two');

1 row created.

SQL> SELECT * FROM atable;

-------------------------------- --------------------------------
5C3BCF77D55B41E78DE4016DFBE25FFA One
D3B959F3010745D3854F8FC2B09A18F3 Two

Howard Rogers responded to this discussion on his blog, mostly to assert that there is legitimacy to the "natural keys" opinion (some people felt there wasn't).

The item "Care must be taken to avoid index range scans" is wrong. The point is that if you use a synthetic key, you will tend to have a 'right-hand index', for which the "usual" cure is to rebuild the index as a reverse-key index.

But if you then issue a query against the contents of that index (unlikely, if the synthetic key is a meaningless sequence number, but it has been done), the reversing of the index means that what ought to have been a quick range scan turns into a full scan on the index... though the optimiser at that point likely gives up and just scans the table instead.

A range scan on a natural selection of data is *not* a problem at all, of course.

So, you need to remove that item from your 'disadvantages with natural keys' selection, and add it to the disadvantages of synthetic keys. (A section which, unaccountably, seems to be missing from your summation!)

Similarly, the block-splitting item is not a legitimate entry. The issue there was that a reversed synthetic index will block split, whereas a non-reversed synthetic index will not. It is an issue with synthetic keys that you "must carefully decide whether to reverse them or not", because block splitting is either a non-event or an issue depending on your decision.

It is never an issue with natural keys, because they inevitably and intrinsically block-split anyway, assuming only that data arrives at your table in a fairly random order. There is no "will it/won't it" dilemma, therefore, with natural keys. The answer is "It will". Sure, you then have to set a PCTFREE to deal with that. But a natural key-er knows that. The concern is that a synthetic key-er might not know, or realise that it's important to care, that the decision to reverse or not has major implications for the setting of PCTFREE.

I also disagree with your "You are duplicating data" statement. That is again a disadvantage of synthetic keys, not natural ones. Synthetic keys require the introduction of an ID column into the table which wasn't "naturally" there. Extra data in the table for a start. Then there's an index on that column because it's the primary key. Extra data storage requirement Number 2. And *then* you probably have to put a unique constraint on the natural columns, because otherwise there's no protection against duplication. So now you have a new index containing large amounts of data. That's two indexes to both provide a primary key and protect against duplication of natural data.

In a natural key setting, you don't have a spurious ID column for starters. There's no index on that extra column, therefore. And the natural primary key intrinsically indexes the natural columns and protects them from data duplication. There is one table, one index.

So which scenario requires more storage?

I'm not seeking to re-open the debate in the pages of your blog, just trying to point out areas of the Forum debate which I think have been misunderstood, or mischaracterised.
Thanks for the corrections, Howard, Your clarifications are very welcome.

Rather than try to explain them myself, I will add a note to the blog pointing people to your comments, and your related blog (there is already a pointer to the forum discussion).

There is already a summation of the disadvantages of synthetic/surrogate keys immediately following Niall's quote. You probably got too hot to read that far. :)

This is what I meant by duplication of data:

Last_name: Gretzky
First_name: Wayne
DOB: 26-Jan-61
ID: Gretzky_Wayne_260161

The ID just duplicated 3 columns worth of data.
I really opened the pickle jar this time!

David Aldridge has a contrasting viewpoint to Howard's. Here is his blog article:

I think that the topics of the size of the required index and the rate of block splits are related, are they not? Whether the index is built on a natural key or on a synthetic key (RKI there for the purpose of this, of course) you increase the chances of block splits by increasing the amount of free space per index block, either by reducing the size of the indexed values or by spreading them over more blocks (hence the index itself is larger).

Or have I been thinking about this too hard?
Here is Pete's take on the topic, and look in the comments for a link to Tom Kyte's.

One disadvantage of synthetic keys I don't see mentioned here: In many cases they require an extra join. Suppose, for example, I have a table of U.S. states, where a natural key would be two-letter state abbreviation. I have another table of tax rates with a foreign key that ties it back to the state. If I want a query that gives information about each tax rate including the state it applies to, but I don't care about any other information from the state table, with a natural key I would just display the key because it is already a human-comprehensible identification of the state. With a synthetic key I would have to join on the state table to find the state abbreviation. For production queries this can be a major performance hit. For ad hoc queries it can be a big hassle to have to construct more complex queries and make sure they're correct.

Another advantage of natural keys is that they make it much easier to resolve database corruption problems. Suppose in the above example I used a synthetic key, and we find that due to database corruption state records 17, 19, and 24 have been lost. When we try to repair the problem, we can't just add back in those three records in any order. We have to study the records with the state-id posted as a foreign key to figure out which state was #17, which was #19, and which was #24, so that we can keep everything pointed to the correct record. It may not be at all obvious looking at the tax rate table what state any given record should apply to. But if we use a natural key, this would be zero problem.

And I must object to the unfairness of your "Gretzky_Wayne_260161" example. If this was a political debate I think I'd be screaming "propaganda tactics". If the question is, "Do database designers who prefer natural keys sometimes do things that are totally stupid and pointless in a vain attempt to avoid having to create a synthetic key in situations when there is no natural key available?" then okay, this is an example that proves that the answer is "yes". But the fact that some people attempt to apply a principle in an incompetent way doesn't mean the principle is flawed. I think every introductory database design book ever written points out that a name is not a suitable primary key because we cannot be sure it is unique: we might well have many people named "Jim Smith" in our database. And about 90% of them point out that concatenating multiple fields in order to create some combination that is unlikely to be duplicated is inadequate, because "unlikely" isn't good enough in database design. Even if we were talking about a set of fields that would make a suitable key, duplicating data by concatenating several fields into one is just ... why would you want to do this? If your answer is, "So I have a single key field rather than having to mess with multiple fields", then I think you failed the class on data normalization. I tend to prefer natural keys, but I readily recognize that there are cases where there is no natural key, or where the only available natural key is impractical to use (like, it requires six fields totalling 500 bytes) and so the only practical choice is a synthetic key.
Isn't the two-letter abbreviation for a state a synthetic key?

Isn't it, too, a quasi-duplication of data since it is based on the state's name (which would be a natural key)?
"Isn't the two-letter abbreviation for a state a synthetic key?"

No, not in this context. A natural key isn't necessarily all that natural: it's a value in a domain defined and controlled outside the boundaries of the system we are designing.
This comment has been removed by a blog administrator.
Nice article! I also liked the one right here on natural keys:

Natural versus surrogate keys
If you have any problems or just want to deal then only text loans urgent tax work can really help in such situations, is to give from 100 to 1000 pounds of funding to clarify their needs quick funds. To help find some quick money, you can get free on these loans to borrow money urgently to 100-1000 pounds for the repayment tenure of a couple of days, which can range from 7 days to 30 days. Mobile Text loans are a division of payday loans that is provided by you make less effort.
Post a Comment

<< Home

This page is powered by Blogger. Isn't yours?