Reprinted with Permission by RevealNet, Inc.  March 2001

Getting a random sample of records
the good the bad and the ugly….

One common requirement that crops up from time to time is that someone will want a random sample of records from your database. A common example of this is where a marketing department makes what at first may seem like a simple request. "Could we have a list of 1000 potential customer addresses matching criteria x and y and z. Oh, and if there are more than 1000 people matching the criteria then can we choose those people randomly from the total set of people."

At first sight you may thing that this is easily done by not ordering the output from whatever query you use and then reading as many records as you require from the output. This is often a mistake made by the novice as Oracle tends to return the rows in the same order each time if you don’t give an order-by clause. The next thing you may do is to reach for the manual to look at the dbms_random package without really thinking through what it is that you want to do. Its certainly what I did.

The two solutions that I considered and rapidly dismissed were as follows:

The bad solution

This was the easiest to code but was immensely inefficient. It went like this "If you need 100 records, get 100 random numbers using dbms_random package and read until the rownum equals the random number". I dismissed this idea in less time than it took to describe it as this is obviously just too inefficient. The inefficiency lay in the fact that if your random number generator selected rows 500, 10000 and 2000 you need to read 12500 records to get at those 3 random records. As the number of random records you require gets bigger, the overhead associated with this method becomes massive.

The ugly solution.

This grew from the previous idea. I wanted to cut down the number of records required to be read in order to get at those random records. So where is the overhead – in the repeated reading of all those records – if you read 1 to 10000 to get the 10000th record you don’t want to read 1 to 500 to get the 500th record as you have already read them once. The solution I came up with was to load all of the records into a plsql table and then use the dbms_random package again to give me random numbers. These I then could use as the index to the record in the plsql table. The benefits of this solution are that it is relatively quick and it reduces the number of records read. The number of records read is (number of rows in table) + (number of records read from plsql table). As the main component is reading all the records to the plsql table, the response time for the query should change very little. Now this solution works but it is not very pretty – the reason being that what you have is a procedure which writes output to a file perhaps or to a temporary table. This is probably acceptable but not necessarily the nicest way we can do this. It would be nicer if the solution was simpler and could be incorporated into simple sql.

The good solution.

The solution I finally landed on came out of a conversation with one of my workmates. Basically he pointed me in the direction of a function in dbms_utility called GET_HASH_VALUE. I think this was new in 8.15 although it may predate this release. At its simplest, all this function does is accept a string and return a number associated with it. The nice thing about this is that if you select this function as part of a sql statement you can then order your returned rows by something that bears no relationship to the order of the data. Using the solution I am about to explain, you end up with a view where the records are returned in a different order every time you query it!

The DBMS_UTILITY.GET_HASH_VALUE function

The specification for the function is as follows

DBMS_UTILITY.GET_HASH_VALUE

(

name VARCHAR2,

base NUMBER,

hash_size NUMBER

)RETURN NUMBER;

 

PRAGMA restrict_references(GET_HASH_VALUE, WNDS, RNDS, WNPS, RNPS);

The really nice thing about the function, the thing that makes it useful for us is that PRAGMA. The bottom line is that we can now use it as part of a sql statement or as part of a view.

Parameters

There are three IN parameters to the function

Name :- Text string to be used as the basis of the hashing.

Base :- The lowest number you want for the hash value.

Hash_size:- Size of the hash table.

Returns

A numeric value which is derived from the IN parameter ‘name’.

How we use the function to get random records.

What I want to do here is create a view onto the table(s) that I want to look at. In the following example you can see that I have used the ‘emp’ table – lets say we want to find 5 random employees in our organisation to ask about how they feel about working conditions.... Basically I will create a view onto the emp table, including a call to the get_hash_value function and making the view order on that pseudocolumn.

CREATE OR REPLACE VIEW random_employees AS

SELECT DBMS_UTILITY.GET_HASH_VALUE

(

TO_CHAR(dbms_utility.get_time)||ename

,2

,1048576

) "RANDOM_ORDER"

, emp.*

FROM EMP

ORDER BY RANDOM_ORDER

In this example I chose 1048576 for the size of the hash table as oracle recommends that the size be a power of 2. Given that the lowest value is only 2, the highest will be 1048578 (2 + 1048576). You may not have seen dbms_utility.get_time function before – this is a nice (and also pragma’d) function that returns elapsed time in 100th of a second.

When you have created the view, select from it where the rownum is less than your desired maximum number or required records. Because the hash value for each row is dependant upon the time in hundredths of a second concatenated to the surname, the hash value for each row will be different each time you select from the view.

In this way the order or the records changes with time and hence if you select only the first 4 records you will get different records each time.

SQL> select ename from random_employees where rownum < 5;

 

ENAME

----------

WARD

KING

ALLEN

JAMES

 

SQL> select ename from random_employees where rownum < 5;

 

ENAME

----------

JONES

TURNER

WARD

CLARK

 

SQL> select ename from random_employees where rownum < 5;

 

ENAME

----------

JONES

JAMES

CLARK

ADAMS

As there are only 14 rows in the table in total, some employees will turn up in each select as we are picking 28% of the rows in this particular table at random.

Hopefully from this you can see how easy it is to get a random set of records from a query. While the example is pretty superficial, the same technique could be applied to your list of customers for mailing, a list of financial accounts for random checking, getting a random list of batch numbers for random quality control checks.