Tuesday, May 30, 2006

Windowing Clauses

Answer: "Analytic Functions"

Question: "I have become quite comfortable, perhaps even adept, with basic SQL queries. I am ready to learn how to do more with my data. What do you recommend I study?"

Given how often I write about analytic functions, my answer may not come as a surprise. To those who have been studying along with me, I would like to add one more tool to our toolbelts. Following up on my recent article about finding nearby rows, I would like to touch on windowing clauses.

Life Without Windowing Clauses

Quick, find me the average of a value for all rows in a table. No problem, right? Use the analytic function AVG.

Note: Sample data can be found at the end of the article.

SELECT AVG(votes) FROM elections;


Great. Now, find me that average only for a certain range of data, say from elections in 1990s only. Again, no problem.

SELECT AVG(votes) FROM elections WHERE election BETWEEN 1990 and 1999;


Ok. Now, for each row, find me the average from a 10 year range before and after.

Um... errr...

How about the simply the average including the preceding and succeeding rows?

Um ... errr ... what was the link to your article on finding nearby rows again?

To the Rescue

Well you can relax. I am about to show you how to answer these questions, and others like it. We'll answer these questions using windowing clauses, which are the integral part of analytic functions that define the set of rows you wish to work with.

First let's review analytic functions. You can go read Chapter 6 of the SQL Reference, but generally, an analytic function comes in the following form:


Using windowing clauses, you can define which row to start with, which row to end with including, if you wish, in reference to a specific row (either by a range, or row number).

Now to let you off the hook. Here is an example of how to use windowing clauses to determine the average of a column for a 10-year range, and an average including the rows immediately before and after.

SELECT election, leader, votes,
CEIL(AVG(votes) OVER (ORDER BY election RANGE BETWEEN 10 preceding AND 10 following)) ten_year,
CEIL(AVG(votes) OVER (ORDER BY election ROWS BETWEEN 1 preceding AND 1 following)) before_after
FROM elections;

-------- ------------ ------------ ------------ -------------
1959 Kirby 98730 93184 75004
1963 Harradance 51278 144122 93184
1967 Lougheed 129544 189250 159252
1971 Lougheed 296934 251124 265414
1975 Lougheed 369764 358565 358265
1979 Lougheed 408097 399552 455449
1982 Lougheed 588485 420075 454455
1986 Getty 366783 434118 440838
1989 Getty 367244 442418 391336
1993 Klein 439981 457035 430380
1997 Klein 483914 467097 517049
2001 Klein 627252 492060 509420
2004 Klein 417092 509420 522172

Armed with the knowledge of the windowing clause, we are finally beginning to unleash the full power of analytic functions. With that in mind, a closer look at the other two types of clauses are in order.

Query Partition Clause

Strictly speaking, the windowing clause is not the only way to define the set of rows you wish to work with. The query partition clause can achieve that same goal by grouping your data on a single, specific value (or set of values).

For example, say we wanted the average number of votes by leader. No problem using GROUP BY, right?

SELECT leader, AVG(votes) avg_votes FROM elections GROUP BY leader;

LEADER                           AVG_VOTES              
-------------------------------- ----------------------
Getty 367013.5
Harradance 51278
Kirby 98730
Klein 492059.75
Lougheed 358564.8

By using the partitioning clause we can achieve the same result, but use it in other ways. For example, to compare the average to every row.

SELECT election, leader, votes,
CEIL(votes - AVG (votes) OVER (PARTITION BY leader)) diff
FROM elections ORDER BY diff desc;

ELECTION LEADER        VOTES     DIFF                   
-------- ------------- --------- ---------
1982 Lougheed 588485 229921
2001 Klein 627252 135193
1979 Lougheed 408097 49533
1975 Lougheed 369764 11200
1989 Getty 367244 231
1963 Harradance 51278 0
1959 Kirby 98730 0
1986 Getty 366783 -230
1997 Klein 483914 -8145
1993 Klein 439981 -52078
1971 Lougheed 296934 -61630
2004 Klein 417092 -74967
1967 Lougheed 129544 -229020

Putting the partitioning clause and the windowing clause together is like mixing chocolate and peanut butter. Alone, they're great, together they're even better. We could solve such complex problems as "For each row, show me the average that includes the immediately preceding and succeeding rows that share the same id."

Further Reading

Other than the SQL Reference, and Chapter 19 "SQL for Analysis" of the Data Warehousing Guide, my primary sources include Chapter 12 of Tom Kyte's "Expert One-on-One Oracle" and Daniel Morgan's on-line reference.

Beyond that, there are a lot of fine articles on specific applications. For beginners, I would review an anonymous contribution to Howard Rogers' wiki introduction to analytic functions, which also includes a great example.

I will close with the final ingredient to analytic functions, the order by clause.

Creating Rolling Averages

By using the order by clause within the context of the analytic function, we can create rolling averages.

SELECT election, leader, votes,
CEIL(AVG (votes) OVER (ORDER BY election)) rolling_avg
FROM elections;

ELECTION LEADER        VOTES      ROLLING_AVG              
-------- ------------- ---------- -------------
1959 Kirby 98730 98730
1963 Harradance 51278 75004
1967 Lougheed 129544 93184
1971 Lougheed 296934 144122
1975 Lougheed 369764 189250
1979 Lougheed 408097 225725
1982 Lougheed 588485 277548
1986 Getty 366783 288702
1989 Getty 367244 297429
1993 Klein 439981 311684
1997 Klein 483914 327342
2001 Klein 627252 352334
2004 Klein 417092 357316


Sample data used in this article:

CREATE TABLE elections (election NUMBER(4), party VARCHAR2(32), leader VARCHAR2(32), seats NUMBER(2), votes NUMBER(7), tot_seats NUMBER(2), tot_votes NUMBER(7));

INSERT INTO elections VALUES (2004, 'PC', 'Klein', 61, 417092, 83, 890700);
INSERT INTO elections VALUES (2001, 'PC', 'Klein', 74, 627252, 83, 1013152);
INSERT INTO elections VALUES (1997, 'PC', 'Klein', 63, 483914, 83, 845713);
INSERT INTO elections VALUES (1993, 'PC', 'Klein', 51, 439981, 83, 989025);
INSERT INTO elections VALUES (1989, 'PC', 'Getty', 59, 367244, 83, 829189);
INSERT INTO elections VALUES (1986, 'PC', 'Getty', 61, 366783, 83, 713654);
INSERT INTO elections VALUES (1982, 'PC', 'Lougheed', 75, 588485, 79, 944936);
INSERT INTO elections VALUES (1979, 'PC', 'Lougheed', 74, 408097, 79, 710963);
INSERT INTO elections VALUES (1975, 'PC', 'Lougheed', 69, 369764, 75, 590200);
INSERT INTO elections VALUES (1971, 'PC', 'Lougheed', 49, 296934, 75, 639862);
INSERT INTO elections VALUES (1967, 'PC', 'Lougheed', 6, 129544, 65, 498351);
INSERT INTO elections VALUES (1963, 'PC', 'Harradance', 0, 51278, 63, 403444);
INSERT INTO elections VALUES (1959, 'PC', 'Kirby', 1, 98730, 65, 413516);

Tuesday, May 09, 2006

Finding Nearby Rows

Today's question is how do you find rows that have values nearest the value of a given row?

Although there are many different variations on this central theme, in order to solve problems of this nature, you'll need to know:

1. The given column's value for the given row
2. The difference between each row's value and the value of the given row
3. That row's difference's relationship to all the other rows.

Step 1 is easy, 2 and 3 are more difficult.

For starters, let's say my requirement is to find the 2 battles that started nearest the beginning of the battle of Batoche. Here's how I first attacked that problem.
Note: You can find my test data at the end of this blog if you want to try it for yourself.

SELECT loc, difference FROM
(SELECT a.loc, ABS(a.start_date - b.start_date) difference
FROM battles a,
(SELECT start_date FROM battles WHERE loc = 'BATOCHE') b
ORDER BY difference)
WHERE rownum < 3;

LOC                  DIFFERENCE             
-------------------- ----------------------

Despite my use of ROWNUM to get just the two closest rows, I will admit that my result is not satisfying, because I don't like going through the same data more than once. Once to find the value of the given row, and another to figure out the differences. Blech!

I do have a better solution, but first let's say my requirements are a little tougher, to the point where this solution won't even work. For example:

1. In the above example, the two nearest rows both happened to be prior to the given row. Let's say my requirements are to find the rows immediately preceding the given row, and the rows immediately succeeding it.

AND let's also say

2. I want the TWO rows immediately before and immediately after instead of just one.

What then?


For starters, we can use the analytic function RANK instead of ROWNUM for an easier way to get the order in which the rows are presented.

Once again, the 'b' query gives us the rank of our given row, and now the 'a' query will get a list of all the locations and their respective RANK. Now we can take as many rows as we want from that centrepoint.

(SELECT a.loc, ABS(a.battle_num - b.battle_num) distance FROM
(SELECT loc, RANK() OVER (ORDER BY start_date) battle_num FROM battles) a,
(SELECT loc, RANK() OVER (ORDER BY start_date) battle_num FROM battles) b
WHERE b.loc = 'BATOCHE')
WHERE distance BETWEEN 1 AND 2;


I'm still annoyed about the inelegance of the solution, because we're still going through my data more than once. Is there a better way?


With LEAD and LAG you can access multiple rows without a self-join (joining a table to itself). With that, we can do all sorts of fancy things, such as figure out how many days have passed since the last battle started, and how long until the next one began.

SELECT loc, start_date,
start_date - LAG(start_date, 1, NULL) OVER (ORDER BY start_date) AS last_battle,
LEAD(start_date, 1, NULL) OVER (ORDER BY start_date) - start_date AS next_battle
FROM battles;

Back to the task at hand, armed with LAG and LEAD, we can create a list of all rows including the value of the previous row, and the value of the next one. From that, its pretty trivial to select the rows we want.

LAG(loc, 1, NULL) OVER (ORDER BY start_date) previous_battle,
LEAD(loc, 1, NULL) OVER (ORDER BY start_date) next_battle
FROM battles)
WHERE previous_battle = 'BATOCHE' OR next_battle = 'BATOCHE';


That's very nice! But what about the requirement to get the TWO rows immediately preceding and succeeding the given row, can we still do that with LAG and LEAD?

The SQL Reference describes how to use LAG and LEAD, under Chapter 6, Functions. There it is explained that the second parameter to LAG and LEAD is the number of rows to look. So we can look 2 battles before and after Batoche, and use DECODE to make it a single column.

(SELECT loc, DECODE(LAG(loc, 1, NULL) OVER (ORDER BY start_date), 'BATOCHE', 1, 0) +
DECODE(LAG(loc, 2, NULL) OVER (ORDER BY start_date), 'BATOCHE', 1, 0) +
DECODE(LEAD(loc, 1, NULL) OVER (ORDER BY start_date), 'BATOCHE', 1, 0) +
DECODE(LEAD(loc, 2, NULL) OVER (ORDER BY start_date), 'BATOCHE', 1, 0) near_batoche
FROM battles)
WHERE near_batoche > 0;


Depending how many rows you want, this query will get bigger and uglier, so you may want to revert to using the RANK example instead. If you like the looks of LAG and LEAD, Tim Hall has a good introduction article worth reviewing.


I guess the big question is related to the performance implications of each approach. Granted I have a small amount of data, but here are the differences I observed

ROWNUM example, 2 nearest rows:
Slowest of the three, had more parse, executes and queries than the other two combined. The query came in three pieces, the first two of which had full table accesses, the last of which having a nested loop with a table access, a sort, and a view.

RANK example, looking 1 or 2 rows:
No real difference between 1 or 2 rows. Slower than LAG/LEAD with double the parse/execute/fetch. Came in two pieces, the first of which had a full access, the second of which had 2 nested loops, 2 full accesses, and 2 sorts, including one that was cartesian (49 rows).

LAG/LEAD example, looking 1 or 2 rows:
No real difference between 1 or 2 rows. Faster of the three, lowest parse/execute/fetch. Came in two pieces, both of which had a full access, but the second of which also had a sort and a view.

In terms of performance, it looks like LAG/LEAD wins, on this sample data.

Sample data:
CREATE TABLE battles (loc VARCHAR2(20), start_date DATE, canadian VARCHAR2(20), rebel VARCHAR2(20));
INSERT INTO battles VALUES ('DUCK LAKE', '26-March-1885', 'Crozier', 'Dumont');
INSERT INTO battles VALUES ('FORT PITT', '15-April-1885', 'Dickens', 'Big Bear');
INSERT INTO battles VALUES ('FISH CREEK', '24-April-1885', 'Middleton', 'Dumont');
INSERT INTO battles VALUES ('CUT KNIFE', '2-May-1885', 'Otter', 'Poundmaker');
INSERT INTO battles VALUES ('BATOCHE', '9-May-1885', 'Middleton', 'Dumont');
INSERT INTO battles VALUES ('FRENCHMAN''S BUTTE', '28-May-1885', 'Strange', 'Wandering Spirit');
INSERT INTO battles VALUES ('LOON LAKE', '3-June-1885', 'Steele', 'Wandering Spirit');

Bonus: My favourite recent blog article was James Koopmann's recent write-up on storing Word Documents in Oracle.

This page is powered by Blogger. Isn't yours?