Tuesday, May 09, 2006

Finding Nearby Rows

Today's question is how do you find rows that have values nearest the value of a given row?

Although there are many different variations on this central theme, in order to solve problems of this nature, you'll need to know:

1. The given column's value for the given row
2. The difference between each row's value and the value of the given row
3. That row's difference's relationship to all the other rows.

Step 1 is easy, 2 and 3 are more difficult.

For starters, let's say my requirement is to find the 2 battles that started nearest the beginning of the battle of Batoche. Here's how I first attacked that problem.
Note: You can find my test data at the end of this blog if you want to try it for yourself.

SELECT loc, difference FROM
(SELECT a.loc, ABS(a.start_date - b.start_date) difference
FROM battles a,
(SELECT start_date FROM battles WHERE loc = 'BATOCHE') b
ORDER BY difference)
WHERE rownum < 3;

LOC                  DIFFERENCE             
-------------------- ----------------------
BATOCHE 0
CUT KNIFE 7
FISH CREEK 15


Despite my use of ROWNUM to get just the two closest rows, I will admit that my result is not satisfying, because I don't like going through the same data more than once. Once to find the value of the given row, and another to figure out the differences. Blech!

I do have a better solution, but first let's say my requirements are a little tougher, to the point where this solution won't even work. For example:

1. In the above example, the two nearest rows both happened to be prior to the given row. Let's say my requirements are to find the rows immediately preceding the given row, and the rows immediately succeeding it.

AND let's also say

2. I want the TWO rows immediately before and immediately after instead of just one.

What then?

RANK

For starters, we can use the analytic function RANK instead of ROWNUM for an easier way to get the order in which the rows are presented.

Once again, the 'b' query gives us the rank of our given row, and now the 'a' query will get a list of all the locations and their respective RANK. Now we can take as many rows as we want from that centrepoint.

SELECT loc FROM
(SELECT a.loc, ABS(a.battle_num - b.battle_num) distance FROM
(SELECT loc, RANK() OVER (ORDER BY start_date) battle_num FROM battles) a,
(SELECT loc, RANK() OVER (ORDER BY start_date) battle_num FROM battles) b
WHERE b.loc = 'BATOCHE')
WHERE distance BETWEEN 1 AND 2;

LOC                  
--------------------
FISH CREEK
CUT KNIFE
FRENCHMAN'S BUTTE
LOON LAKE


I'm still annoyed about the inelegance of the solution, because we're still going through my data more than once. Is there a better way?

LEAD and LAG

With LEAD and LAG you can access multiple rows without a self-join (joining a table to itself). With that, we can do all sorts of fancy things, such as figure out how many days have passed since the last battle started, and how long until the next one began.

SELECT loc, start_date,
start_date - LAG(start_date, 1, NULL) OVER (ORDER BY start_date) AS last_battle,
LEAD(start_date, 1, NULL) OVER (ORDER BY start_date) - start_date AS next_battle
FROM battles;

Back to the task at hand, armed with LAG and LEAD, we can create a list of all rows including the value of the previous row, and the value of the next one. From that, its pretty trivial to select the rows we want.

SELECT loc FROM (SELECT loc,
LAG(loc, 1, NULL) OVER (ORDER BY start_date) previous_battle,
LEAD(loc, 1, NULL) OVER (ORDER BY start_date) next_battle
FROM battles)
WHERE previous_battle = 'BATOCHE' OR next_battle = 'BATOCHE';

LOC                  
--------------------
CUT KNIFE
FRENCHMAN'S BUTTE


That's very nice! But what about the requirement to get the TWO rows immediately preceding and succeeding the given row, can we still do that with LAG and LEAD?

The SQL Reference describes how to use LAG and LEAD, under Chapter 6, Functions. There it is explained that the second parameter to LAG and LEAD is the number of rows to look. So we can look 2 battles before and after Batoche, and use DECODE to make it a single column.

SELECT loc FROM
(SELECT loc, DECODE(LAG(loc, 1, NULL) OVER (ORDER BY start_date), 'BATOCHE', 1, 0) +
DECODE(LAG(loc, 2, NULL) OVER (ORDER BY start_date), 'BATOCHE', 1, 0) +
DECODE(LEAD(loc, 1, NULL) OVER (ORDER BY start_date), 'BATOCHE', 1, 0) +
DECODE(LEAD(loc, 2, NULL) OVER (ORDER BY start_date), 'BATOCHE', 1, 0) near_batoche
FROM battles)
WHERE near_batoche > 0;

LOC                  
--------------------
FISH CREEK
CUT KNIFE
FRENCHMAN'S BUTTE
LOON LAKE


Depending how many rows you want, this query will get bigger and uglier, so you may want to revert to using the RANK example instead. If you like the looks of LAG and LEAD, Tim Hall has a good introduction article worth reviewing.

Performance

I guess the big question is related to the performance implications of each approach. Granted I have a small amount of data, but here are the differences I observed

ROWNUM example, 2 nearest rows:
Slowest of the three, had more parse, executes and queries than the other two combined. The query came in three pieces, the first two of which had full table accesses, the last of which having a nested loop with a table access, a sort, and a view.

RANK example, looking 1 or 2 rows:
No real difference between 1 or 2 rows. Slower than LAG/LEAD with double the parse/execute/fetch. Came in two pieces, the first of which had a full access, the second of which had 2 nested loops, 2 full accesses, and 2 sorts, including one that was cartesian (49 rows).

LAG/LEAD example, looking 1 or 2 rows:
No real difference between 1 or 2 rows. Faster of the three, lowest parse/execute/fetch. Came in two pieces, both of which had a full access, but the second of which also had a sort and a view.

In terms of performance, it looks like LAG/LEAD wins, on this sample data.

Sample data:
CREATE TABLE battles (loc VARCHAR2(20), start_date DATE, canadian VARCHAR2(20), rebel VARCHAR2(20));
INSERT INTO battles VALUES ('DUCK LAKE', '26-March-1885', 'Crozier', 'Dumont');
INSERT INTO battles VALUES ('FORT PITT', '15-April-1885', 'Dickens', 'Big Bear');
INSERT INTO battles VALUES ('FISH CREEK', '24-April-1885', 'Middleton', 'Dumont');
INSERT INTO battles VALUES ('CUT KNIFE', '2-May-1885', 'Otter', 'Poundmaker');
INSERT INTO battles VALUES ('BATOCHE', '9-May-1885', 'Middleton', 'Dumont');
INSERT INTO battles VALUES ('FRENCHMAN''S BUTTE', '28-May-1885', 'Strange', 'Wandering Spirit');
INSERT INTO battles VALUES ('LOON LAKE', '3-June-1885', 'Steele', 'Wandering Spirit');

Bonus: My favourite recent blog article was James Koopmann's recent write-up on storing Word Documents in Oracle.

Comments:
Hi Robert,

Thanks for nice sum-up. I would like to mention another solution. I will also use a bit more representative test case (sorry for flooding your "historical" table with random pseudo-events).

So let's create table (note that I added PK which must be there anyway):
CREATE TABLE battles
(loc VARCHAR2(20),
start_date DATE,
canadian VARCHAR2(20),
rebel VARCHAR2(20),
PRIMARY KEY (loc)
);

Now let's populate with 100000 rows:
INSERT INTO battles (loc, start_date, canadian, rebel)
SELECT lvl, trunc(sysdate)-dbms_random.value(0,1000000),
'dummy field 1', 'dummy field 2'
FROM (select level lvl FROM DUAL CONNECT BY level<=100000);
COMMIT;

And build an index on start_date:
CREATE INDEX ind_battles_start_date ON battles(start_date);

Don't forget our good friend statistics:
BEGIN
SYS.DBMS_STATS.GATHER_TABLE_STATS (
OwnName => USER
,TabName => 'BATTLES'
,Method_Opt => 'FOR ALL COLUMNS SIZE 1 ');
END;
/

Now let's get the 2 previous events with autotrace on:
WITH point as (SELECT start_date FROM battles where loc='42876')
SELECT * FROM (
SELECT b.loc, s.start_date-b.start_date diff
FROM point s, battles b
WHERE b.start_date < s.start_date
ORDER BY b.start_date DESC
) WHERE rownum <= 2;
----------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
----------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 2 | 36 | 2 (0)| 00:00:01 |
|* 1 | COUNT STOPKEY | | | | | |
| 2 | VIEW | | 2 | 36 | 2 (0)| 00:00:01 |
| 3 | NESTED LOOPS | | 2 | 84 | 2 (0)| 00:00:01 |
| 4 | TABLE ACCESS BY INDEX ROWID | BATTLES | 1 | 14 | 1 (0)| 00:00:01 |
|* 5 | INDEX UNIQUE SCAN | SYS_C004647 | 1 | | 1 (0)| 00:00:01 |
| 6 | TABLE ACCESS BY INDEX ROWID | BATTLES | 2 | 28 | 1 (0)| 00:00:01 |
|* 7 | INDEX RANGE SCAN DESCENDING| IND_BATTLES_START_DATE | 2 | | 1 (0)| 00:00:01 |
----------------------------------------------------------------------------------------------------------
...
Statistics
...
8 consistent gets
...

In the same way we can get the next 2 nearest events. If we need more than 2 - just change where clause. This was done in 10gR2 and if I am not mistaken it uses new 10g feature for optimization of Top-N type queries SELECT ... FROM (SELECT ... ORDER BY...) WHERE ROWNUM < n.

We can also put these queries together via UNION if needed, however, for puzzling reason Oracle (10.2.0.2) changes execution plan in this case and does full index scans. I think it maybe worked around if needed.

Hope it might add a bit to you collection of ways to find nearby rows. ;-)

Regards,
Alex
 
Hello Robert,

can we produce this result by using the following query with the help of oracle LAG or LEAD function.

select dname from dept

ACCOUNTING,RESEARCH,SALES,OPERATIONS

Thank you
Khurram Naseem
 
Post a Comment

<< Home

This page is powered by Blogger. Isn't yours?