Reprinted with Permission by Quest Software Nov.  2001

Data Modeling: Common Mistakes and Their Impact

 By Bert Scalzo, PhD

“Dew knot trussed yore spell chequer two fined awl yore mistakes.” – Brendan Hills

Although data modeling has been around for nearly 25 years, it ranks among the top areas from whence database application problems arise. Moreover, the severity of the problems ranges from totally incorrect functionality to freakishly miserable performance. How can such an established technique yield such terrible results? The answer is quite unnerving.

Today’s data modeling tools are amazingly good. Unfortunately, the same is quite often not true about the people using them. This may sound harsh at first, but would you let a person draw the blueprints for your new home if all they had was one semester of formal architectural training? How about if they had no training, but had been a contractor who built homes for the past five years? What if they had attended a three-day architecture class and had used that training on occasion? Nervous yet – it’s only your home.

The point is that even though data modeling has been around for some time and is now being used extensively, many practitioners have had limited training or mentoring in the formal techniques. Think this is an unfair assessment? Then the next time a data modeler hands you a supposedly third normal form ERD, simply ask them to define first, second and third normal form or why that’s important to your business requirements? Still not convinced? Then why aren’t we all great novelists? We all speak the language and have Microsoft Word on our PC’s. Beginning to see the analogy?

Just remember that using data modeling tools does not automatically guarantee success. Yes, today’s ER tools support good relational design. Yes, today’s ER tools have model checking utilities. And yes, today’s ER tools generate fairly good DDL. But give me any language compiler and I can write an infinite loop. The same is true with data modeling. The “garbage in, garbage out” principle applies – you just end up with a pretty picture of the garbage as well. And garbage begets garbage. So the programs cannot compensate for the bad database design – in fact, they usually make a bad thing even worse.

Each issue, we’ll look at some of the more common modeling mistakes and their impacts. This month we’ll start with my favorite: primary, unique and foreign keys (just a little for now). Then in the next issue we’ll dig even further into the complexities of foreign keys.

1.      Implement all candidate keys

Look at the three tables modeled in Figure 1 below. PERSON_1 represents what we’re told as the business requirements. Notably, that a person can be uniquely identified by either their SSN or the concatenation of their First Name, Last Name, Gender and Birth Date. Thus PERSON_1 has two candidate keys: AK1 and AK2, respectively.

Figure 1

PERSON_2 represents the logical choice of choosing the shortest candidate key as the primary key, with all remaining candidate keys as alternate keys. While PERSON_3 represents adding a meaningless surrogate key as the primary key, with each candidate key implemented as an alternate key.

The problem is that many modelers or DBA’s will choose not to implement the candidate keys to minimize the number of indexes thereby saving space and minimizing overhead for inserts, updates and deletes. But this tradeoff also negates the business requirements!

PERSON_2 without the AK permits people with duplicate concatenations of First Name, Last Name, Gender and Birth Date. And PERSON_3 without both AK1 and AK2 is even worse as it permits people with duplicates of SSN. Less space and faster performance are truly moot points if the data is incorrect! Always implement all candidate keys.

2.      Foreign keys can also be candidate keys

In effect, this is a restatement of the above that stresses that being a foreign key does not negate the ability to be either a primary or alternate key. It may seem superfluous to state this, but I’ve seen this problem more than most. Let’s look at the concept of marriage as shown in Figure 2 below.

Figure 2

MARRIAGE_1 demonstrates how two identifying relationships from PERSON_2 with the marriage start date form the primary key. But note the two additional alternate keys. What’s up with that? Well the business rule might be that a particular person can only get married once per day. So SSN_1 and SSN_2 are parts of the primary, unique and foreign keys – and this is legal. But most ER tools will generate suboptimal index DDL for this construct – five indexes for a total of 75 bytes.

MARRIAGE_2 is an attempt to correct this index design flaw in the model by adding a meaningless surrogate key as the primary key, with each candidate key implemented as an alternate key. This too will also cause most ER tools to generate suboptimal index DDL – five indexes for a total of 59 bytes (assuming that the surrogate key consumes 9 bytes) – saving 16 bytes or 21%.

As before, the problem is that many modelers or DBA’s will choose not to implement the candidate keys to minimize the number of indexes thereby saving space and minimizing overhead for inserts, updates and deletes. Once again this tradeoff negates the business requirements! As before, always implement all candidate keys.

3.      Separate the concept of keys from indexes

In both the above issues, the real culprit is peoples’ inability to differentiate keys from indexes. Keys enforce business rules – it’s a logical concept. Indexes speed up database access – it’s purely a physical concept. Yes primary and unique keys require indexes in order to affect the uniqueness, but foreign keys do not require indexes – though DBA’s very often index them for performance and locking reasons.

Let’s go back to Figure 2. Assume we model it as shown in MARRIAGE_2 and pass it on to a DBA who’s quite comfortable with both data modeling and index design. We can implement merely two indexes in just 41 bytes that support all the defined keys – saving 34 bytes or 45%. Thus we get correct business rules, fast access and minimum space.

Index 1 would be SSN_1 + Start Date + SSN_2. Index 2 would be SSN_2 + Start Date. By using the DBA’s knowledge of the databases leading column index capabilities, we can keep separate the concept of five keys over nine columns and the two indexes over five columns that support their needs.

Likewise, that DBA could also optimize MARRIAGE_2 to use merely three indexes in just 34 bytes  – saving 41 bytes or 55%. Clearly, keys do no equal indexes.

4.      Keys do one job and business rules do all the others

Look again back at Figure 2. Let’s assume we’re using MARRIAGE_1 as our solution. Can the primary key of SSN_1, SSN_2 and Start Date really sufficiently differentiate one marriage from another? For the sake of argument, let’s say yes. But does this primary key guarantee valid or legal instances? The answer is a resounding no. It is entirely possible to have unique but illegal instances of data if all you have are the keys.

Let’s assume our business analyst tells us that for this application, the following marriage rules are true (note: these rules are for demonstration purpose only, and do no reflect any kind of commentary or point of view regarding the concept of marriage).

So where’s that displayed in our model? The answer is nowhere.  These requirements are known as business rules, and should be documented in modeling tools so as to generate either constraints or triggers in the database. If not, then all your application developers must instinctually know all these rules and consistently program them throughout the entire application. Yeah, right – then I’ve got some great swampland to sell you too.

In Figure 3 below, we have a screen snapshot showing how to define business rules at the entity/table and attribute/column level in QDesigner – Quest Software’s state of the art data modeling tool. QDesigner has complete trigger editing and template facilities as well. The DDL below was entirely generated from QDesigner and meets all the above business rules.

create table PERSON_2  (
   SSN                  CHAR(9)                          not null,
   FIRSTNAME            VARCHAR(20)                      not null,
   LASTNAME             VARCHAR(30)                      not null,
   GENDER               CHAR(1)                          not null,
   BIRTHDATE            DATE                             not null,
   STREET               VARCHAR(40),
   CITY                 VARCHAR(30),
   STATE                CHAR(2),
   ZIP                  CHAR(5),
   constraint PK_PERSON_2 primary key (SSN),
   constraint AK_AK_PERSON_2 unique (FIRSTNAME, LASTNAME, GENDER, BIRTHDATE)
)
/
create table MARRIAGE_1  (
   SSN_1                CHAR(9)                          not null
         constraint CKC_SSN_1_MARRIAGE check (SSN_1 between '000000001' and '999999999'),
   SSN_2                CHAR(9)                          not null
         constraint CKC_SSN_2_MARRIAGE check (SSN_2 between '000000001' and '999999999'),
   STARTDATE            DATE                             not null,
   ENDDATE              DATE,
   RELIGION             VARCHAR(20)                      
         constraint CKC_RELIGION_MARRIAGE check (RELIGION is null or ( RELIGION in ('PROTESTANT','CATHOLIC','BAPTIST','METHODIST','HINDU','BUDDHIST','ISLAM') )),
   constraint PK_MARRIAGE_1 primary key (SSN_1, SSN_2, STARTDATE),
   constraint AK_AK1_MARRIAGE unique (SSN_1, STARTDATE),
   constraint AK_AK2_MARRIAGE unique (SSN_2, STARTDATE),
   constraint FK_SSN_1 foreign key (SSN_1)
         references PERSON_2 (SSN),
   constraint FK_SSN_2 foreign key (SSN_2)
         references PERSON_2 (SSN),
   constraint CKT_MARRIAGE_1 check ((SSN_1 <> SSN_2) and (StartDate >= EndDate))
)
/
create trigger CHECK_PEOPLE
before insert
on MARRIAGE_1
for each row
declare
  -- Declare User Defined Exceptions
  too_young     exception;
  pragma exception_init(too_young,-20001);
  is_married    exception;
  pragma exception_init(is_married,-20002);
  no_address    exception;
  pragma exception_init(no_address,-20003);
  v_too_young   integer;
  v_is_married  integer;
  v_no_address  integer;
begin
  -- See if either person is under 18 years old
  select  count(*)
    into  v_too_young
    from  person_2
    where ssn in (:new.ssn_1, :new.ssn_2)
      and birthdate < sysdate-(365*18);
  if (v_is_married > 0) then
    raise is_married;
  end if;
  -- See if either person is currently married
  select  count(*)
    into  v_is_married
    from  marriage_1
    where (ssn_1 in (:new.ssn_1, :new.ssn_2)
           or
           ssn_2 in (:new.ssn_1, :new.ssn_2))
      and (enddate is null
           or
           enddate > sysdate);
  if (v_is_married > 0) then
    raise is_married;
  end if;
  -- Verify that at least one address is known
  select  count(*)
    into  v_no_address
    from  person_2
    where ssn in (:new.ssn_1, :new.ssn_2)
      and street is not null
      and city is not null
      and state is not null
      and zip is not null;
  if (v_no_address = 0) then
    raise no_address;
  end if;
exception
  when too_young   then
    raise_application_error(-20001, 'Illegal Marriage: both parties must be >= age 18!');
  when is_married  then
    raise_application_error(-20002, 'Illegal Marriage: both parties must be single!!!');
  when no_address  then
    raise_application_error(-20003, 'Illegal marriage: at least one address required!');
end;
/