Reprinted with Permission by RevealNet, Inc.  June  2001

Implementing DB2 Star Joins
(Extracted from the Knowledge Base for DB2)

A star schema is sometimes used in decision support applications. It consists of a fact table and multiple dimension tables as shown below. The schema in the example has three dimension tables containing information on products, time, and location. The sales fact table must have an ID column for each dimension. In our example, the column names are S.PRODUCT, S.LOCATION, and S.TIME. The product dimension table can have information on product item, class, inventory, etc. The location dimension table can have information on city, state, country, etc. The time dimension table can have information on the month, quarter, year, etc.

A star join of the fact and dimension tables can be written as:

    SELECT *
    FROM   SALES S, PRODUCT P, LOCATION L, TIME T, 
    WHERE  S.PRODUCT  = P.ID
    AND    S.LOCATION = L.ID
    AND    S.TIME     = T.ID
    AND    P.CLASS    = 'Coats'
    AND    L.CITY     = 'Omaha' 
    AND    T.YEAR     = 2000;

A distinguishing characteristic of a star join is that there are no join predicates between the dimension tables. Star join processing (V6 APAR PQ28813 and PQ36206) can be identified in a plan table. The following are partial results from a plan table. Notice that the JT=JOIN_TYPE=S for a star join with the index indicated.

   CREATE INDEX SPLTX ON SALES (PRODUCT,
                LOCATION, TIME) CLUSTER;


   QN_B  MT  TNAME    PRE  AT  MC  XNAME  JT UJOG  UJOC 
   ---------+---------+---------+---------+---------+-- 
   1  1   0  Product  S    I   0   PRODX  S  NYNN  NNNN 
   1  1   1  Location S    R   0          S  NYNN  NNNN 
   1  1   1  Time     S    R   0          S  NYNN  NNNN 
   1  1   1  Sales    S    I   3   SPLTX  S  NNNN  NNNN

It is important to create a composite index on fact table columns that corresponds to each dimension table ID. Star join processing is performed in the order of columns in the composite index. The star join processing is done at stage 1 with the composite indexed created as specified.

Star join processing uses a work file for each column of the sales fact composite index created in DSNDB07 tablespaces which is used to store qualifying column values from corresponding dimension tables. A matching index scan is done into the sales table using values in the work file. A technique of next key feed back provides for the highest column combination in the index to be used to avoid unnecessary matching index scans into the sales table. This avoids generating duplicate rows and eliminating duplicates to return to the user and improves performance dramatically. In one test case star join processing required 5.10 and 4.34 seconds of CPU and elapsed time compared to 3007.82 and 126.41 seconds of CPU and elapsed time without star join processing.

There are a number of requirements for star join processing including:

The systems administrator can disallow and influence DB2s usage of a star join using hidden DSNZPARM DSN6SPRC parameter SPMSJR with the values shown below

 
Value Description
0 Enable star join (default).
-1 Disable star join.
1 Fact table will be the largest table in star join. No Fact/Dimension ratio checking. More star join processing is exploited if ratio is 1.
1 Value used as the star join fact table and the largest dimension root table ratio. The higher the ratio, the less chance the star join will be invoked.

Order of columns in composite index: A composite index on the fact table columns that correspond to each dimension table ID should be created. The order of columns in the index is important to achieve good performance. Star join processing is performed in the order of columns in the composite index. We will describe a technique that can be used to determine that order. Assume that the cardinality of fact table columns are:

   S.TIME     = 2,000
   S.PRODUCT  =   433
   S.LOCATION =   100

The cardinality of dimension tables are:

   T.ID = 2,000
   P.ID =   500
   L.ID =   100

The cardinality of column pairs in fact table are:

   S.TIME    and S.PRODUCT  = 625,000
   S.TIME    and S.LOCATION = 196,000
   S.PRODUCT and S.LOCATION =     994

The density of all pairs of columns in fact table can be calculated as follows. The density is a measure of the correlation of the columns.

   Density (S.TIME and S.PRODUCT)  =
     625,000 / (2,000 * 500)=0.625
   Density (S.TIME and S.LOCATION) = 
     196,000 / (2,000 * 100)=0.98 
   Density (S.PRODUCT and S.LOCATION) =
     994 / (  500 * 100)=0.01988

By reviewing the prior results we can see that the pair of columns with the lowest density is S.PRODUCT and S.LOCATION with a value of 0.01988. The column not in the pair should be used as the third column of the composite index (S.TIME).
Repeat the process to determine:

   Density of S.PRODUCT  = 433 / 500 = 0.866
   Density of S.LOCATION = 100 / 100 = 1.0

The S.PRODUCT column has the lowest density of 0.866 and therefore should be made the first column of the composite index. The remaining column, S.LOCATION, should be made the second column of the index. The index can be created like this:

   CREATE INDEX SPLTX ON SALES (PRODUCT, LOCATION, TIME) CLUSTER;

The usage of the data overrides the order suggested by the formula. For example, if the PRODUCT table does not participate in most joins requested by the user, it should not be the first column in the index. If the user frequently narrows the search on LOCATION, consideration should be made to making this the first column in the index. An alternative is to have an index on LOCATION in the dimension table to allow DB2 to narrow the search to specific locations requested by the user. The dimension tables should be indexed for anticipated user requests using the guidelines in the section on index design. The dimension table indexes are not used in star join processing.