### Tuesday, May 30, 2006

## Windowing Clauses

**Answer**: "Analytic Functions"

**Question**: "I have become quite comfortable, perhaps even adept, with basic SQL queries. I am ready to learn how to do more with my data. What do you recommend I study?"

Given how often I write about analytic functions, my answer may not come as a surprise. To those who have been studying along with me, I would like to add one more tool to our toolbelts. Following up on my recent article about finding nearby rows, I would like to touch on windowing clauses.

**Life Without Windowing Clauses**

Quick, find me the average of a value for all rows in a table. No problem, right? Use the analytic function AVG.

*Note: Sample data can be found at the end of the article.*

SELECT AVG(votes) FROM elections;

AVG(VOTES)

----------------------

357315.2307692307692307692307692307692308

Great. Now, find me that average only for a certain range of data, say from elections in 1990s only. Again, no problem.

SELECT AVG(votes) FROM elections WHERE election BETWEEN 1990 and 1999;

AVG(VOTES)

----------------------

461947.5

Ok. Now, for each row, find me the average from a 10 year range before and after.

*Um... errr...*

How about the simply the average including the preceding and succeeding rows?

*Um ... errr ... what was the link to your article on finding nearby rows again?*

**To the Rescue**

Well you can relax. I am about to show you how to answer these questions, and others like it. We'll answer these questions using windowing clauses, which are the integral part of analytic functions that define the set of rows you wish to work with.

First let's review analytic functions. You can go read Chapter 6 of the SQL Reference, but generally, an analytic function comes in the following form:

ANALYTIC FUNCTION (ARGUMENT) OVER (QUERY PARTITION CLAUSE, ORDER BY CLAUSE, WINDOWING CLAUSE);

**Using windowing clauses, you can define which row to start with, which row to end with including, if you wish, in reference to a specific row (either by a range, or row number).**Now to let you off the hook. Here is an example of how to use windowing clauses to determine the average of a column for a 10-year range, and an average including the rows immediately before and after.

SELECT election, leader, votes,

CEIL(AVG(votes) OVER (ORDER BY election RANGE BETWEEN 10 preceding AND 10 following)) ten_year,

CEIL(AVG(votes) OVER (ORDER BY election ROWS BETWEEN 1 preceding AND 1 following)) before_after

FROM elections;

ELECTION LEADER VOTES TEN_YEAR BEFORE_AFTER

-------- ------------ ------------ ------------ -------------

1959 Kirby 98730 93184 75004

1963 Harradance 51278 144122 93184

1967 Lougheed 129544 189250 159252

1971 Lougheed 296934 251124 265414

1975 Lougheed 369764 358565 358265

1979 Lougheed 408097 399552 455449

1982 Lougheed 588485 420075 454455

1986 Getty 366783 434118 440838

1989 Getty 367244 442418 391336

1993 Klein 439981 457035 430380

1997 Klein 483914 467097 517049

2001 Klein 627252 492060 509420

2004 Klein 417092 509420 522172

Armed with the knowledge of the windowing clause, we are finally beginning to unleash the full power of analytic functions. With that in mind, a closer look at the other two types of clauses are in order.

**Query Partition Clause**

Strictly speaking, the windowing clause is not the only way to define the set of rows you wish to work with. The query partition clause can achieve that same goal by grouping your data on a single, specific value (or set of values).

For example, say we wanted the average number of votes by leader. No problem using GROUP BY, right?

SELECT leader, AVG(votes) avg_votes FROM elections GROUP BY leader;

LEADER AVG_VOTES

-------------------------------- ----------------------

Getty 367013.5

Harradance 51278

Kirby 98730

Klein 492059.75

Lougheed 358564.8

By using the partitioning clause we can achieve the same result, but use it in other ways. For example, to compare the average to every row.

SELECT election, leader, votes,

CEIL(votes - AVG (votes) OVER (PARTITION BY leader)) diff

FROM elections ORDER BY diff desc;

ELECTION LEADER VOTES DIFF

-------- ------------- --------- ---------

1982 Lougheed 588485 229921

2001 Klein 627252 135193

1979 Lougheed 408097 49533

1975 Lougheed 369764 11200

1989 Getty 367244 231

1963 Harradance 51278 0

1959 Kirby 98730 0

1986 Getty 366783 -230

1997 Klein 483914 -8145

1993 Klein 439981 -52078

1971 Lougheed 296934 -61630

2004 Klein 417092 -74967

1967 Lougheed 129544 -229020

Putting the partitioning clause and the windowing clause together is like mixing chocolate and peanut butter. Alone, they're great, together they're even better. We could solve such complex problems as "For each row, show me the average that includes the immediately preceding and succeeding rows

*that share the same id*."

**Further Reading**

Other than the SQL Reference, and Chapter 19 "SQL for Analysis" of the Data Warehousing Guide, my primary sources include Chapter 12 of Tom Kyte's "Expert One-on-One Oracle" and Daniel Morgan's on-line reference.

Beyond that, there are a lot of fine articles on specific applications. For beginners, I would review an anonymous contribution to Howard Rogers' wiki introduction to analytic functions, which also includes a great example.

I will close with the final ingredient to analytic functions, the order by clause.

**Creating Rolling Averages**

By using the order by clause within the context of the analytic function, we can create rolling averages.

SELECT election, leader, votes,

CEIL(AVG (votes) OVER (ORDER BY election)) rolling_avg

FROM elections;

ELECTION LEADER VOTES ROLLING_AVG

-------- ------------- ---------- -------------

1959 Kirby 98730 98730

1963 Harradance 51278 75004

1967 Lougheed 129544 93184

1971 Lougheed 296934 144122

1975 Lougheed 369764 189250

1979 Lougheed 408097 225725

1982 Lougheed 588485 277548

1986 Getty 366783 288702

1989 Getty 367244 297429

1993 Klein 439981 311684

1997 Klein 483914 327342

2001 Klein 627252 352334

2004 Klein 417092 357316

--

Sample data used in this article:

CREATE TABLE elections (election NUMBER(4), party VARCHAR2(32), leader VARCHAR2(32), seats NUMBER(2), votes NUMBER(7), tot_seats NUMBER(2), tot_votes NUMBER(7));

INSERT INTO elections VALUES (2004, 'PC', 'Klein', 61, 417092, 83, 890700);

INSERT INTO elections VALUES (2001, 'PC', 'Klein', 74, 627252, 83, 1013152);

INSERT INTO elections VALUES (1997, 'PC', 'Klein', 63, 483914, 83, 845713);

INSERT INTO elections VALUES (1993, 'PC', 'Klein', 51, 439981, 83, 989025);

INSERT INTO elections VALUES (1989, 'PC', 'Getty', 59, 367244, 83, 829189);

INSERT INTO elections VALUES (1986, 'PC', 'Getty', 61, 366783, 83, 713654);

INSERT INTO elections VALUES (1982, 'PC', 'Lougheed', 75, 588485, 79, 944936);

INSERT INTO elections VALUES (1979, 'PC', 'Lougheed', 74, 408097, 79, 710963);

INSERT INTO elections VALUES (1975, 'PC', 'Lougheed', 69, 369764, 75, 590200);

INSERT INTO elections VALUES (1971, 'PC', 'Lougheed', 49, 296934, 75, 639862);

INSERT INTO elections VALUES (1967, 'PC', 'Lougheed', 6, 129544, 65, 498351);

INSERT INTO elections VALUES (1963, 'PC', 'Harradance', 0, 51278, 63, 403444);

INSERT INTO elections VALUES (1959, 'PC', 'Kirby', 1, 98730, 65, 413516);

### Tuesday, May 09, 2006

## Finding Nearby Rows

Today's question is how do you find rows that have values nearest the value of a given row?

Although there are many different variations on this central theme, in order to solve problems of this nature, you'll need to know:

1. The given column's value for the given row

2. The difference between each row's value and the value of the given row

3. That row's difference's relationship to all the other rows.

Step 1 is easy, 2 and 3 are more difficult.

For starters, let's say my requirement is to find the 2 battles that started nearest the beginning of the battle of Batoche. Here's how I first attacked that problem.

SELECT loc, difference FROM

(SELECT a.loc, ABS(a.start_date - b.start_date) difference

FROM battles a,

(SELECT start_date FROM battles WHERE loc = 'BATOCHE') b

ORDER BY difference)

WHERE rownum < 3;

Despite my use of ROWNUM to get just the two closest rows, I will admit that my result is not satisfying, because I don't like going through the same data more than once. Once to find the value of the given row, and another to figure out the differences. Blech!

I do have a better solution, but first let's say my requirements are a little tougher, to the point where this solution won't even work. For example:

1. In the above example, the two nearest rows both happened to be prior to the given row. Let's say my requirements are to find the rows immediately preceding the given row, and the rows immediately succeeding it.

AND let's also say

2. I want the TWO rows immediately before and immediately after instead of just one.

What then?

For starters, we can use the analytic function RANK instead of ROWNUM for an easier way to get the order in which the rows are presented.

Once again, the 'b' query gives us the rank of our given row, and now the 'a' query will get a list of all the locations and their respective RANK. Now we can take as many rows as we want from that centrepoint.

SELECT loc FROM

(SELECT a.loc, ABS(a.battle_num - b.battle_num) distance FROM

(SELECT loc, RANK() OVER (ORDER BY start_date) battle_num FROM battles) a,

(SELECT loc, RANK() OVER (ORDER BY start_date) battle_num FROM battles) b

WHERE b.loc = 'BATOCHE')

WHERE distance BETWEEN 1 AND 2;

I'm still annoyed about the inelegance of the solution, because we're still going through my data more than once. Is there a better way?

With LEAD and LAG you can access multiple rows without a self-join (joining a table to itself). With that, we can do all sorts of fancy things, such as figure out how many days have passed since the last battle started, and how long until the next one began.

SELECT loc, start_date,

start_date - LAG(start_date, 1, NULL) OVER (ORDER BY start_date) AS last_battle,

LEAD(start_date, 1, NULL) OVER (ORDER BY start_date) - start_date AS next_battle

FROM battles;

Back to the task at hand, armed with LAG and LEAD, we can create a list of all rows including the value of the previous row, and the value of the next one. From that, its pretty trivial to select the rows we want.

SELECT loc FROM (SELECT loc,

LAG(loc, 1, NULL) OVER (ORDER BY start_date) previous_battle,

LEAD(loc, 1, NULL) OVER (ORDER BY start_date) next_battle

FROM battles)

WHERE previous_battle = 'BATOCHE' OR next_battle = 'BATOCHE';

That's very nice! But what about the requirement to get the TWO rows immediately preceding and succeeding the given row, can we still do that with LAG and LEAD?

The SQL Reference describes how to use LAG and LEAD, under Chapter 6, Functions. There it is explained that the second parameter to LAG and LEAD is the number of rows to look. So we can look 2 battles before and after Batoche, and use DECODE to make it a single column.

SELECT loc FROM

(SELECT loc, DECODE(LAG(loc, 1, NULL) OVER (ORDER BY start_date), 'BATOCHE', 1, 0) +

DECODE(LAG(loc, 2, NULL) OVER (ORDER BY start_date), 'BATOCHE', 1, 0) +

DECODE(LEAD(loc, 1, NULL) OVER (ORDER BY start_date), 'BATOCHE', 1, 0) +

DECODE(LEAD(loc, 2, NULL) OVER (ORDER BY start_date), 'BATOCHE', 1, 0) near_batoche

FROM battles)

WHERE near_batoche > 0;

Depending how many rows you want, this query will get bigger and uglier, so you may want to revert to using the RANK example instead. If you like the looks of LAG and LEAD, Tim Hall has a good introduction article worth reviewing.

I guess the big question is related to the performance implications of each approach. Granted I have a small amount of data, but here are the differences I observed

ROWNUM example, 2 nearest rows:

Slowest of the three, had more parse, executes and queries than the other two combined. The query came in three pieces, the first two of which had full table accesses, the last of which having a nested loop with a table access, a sort, and a view.

RANK example, looking 1 or 2 rows:

No real difference between 1 or 2 rows. Slower than LAG/LEAD with double the parse/execute/fetch. Came in two pieces, the first of which had a full access, the second of which had 2 nested loops, 2 full accesses, and 2 sorts, including one that was cartesian (49 rows).

LAG/LEAD example, looking 1 or 2 rows:

No real difference between 1 or 2 rows. Faster of the three, lowest parse/execute/fetch. Came in two pieces, both of which had a full access, but the second of which also had a sort and a view.

In terms of performance, it looks like LAG/LEAD wins, on this sample data.

Sample data:

CREATE TABLE battles (loc VARCHAR2(20), start_date DATE, canadian VARCHAR2(20), rebel VARCHAR2(20));

INSERT INTO battles VALUES ('DUCK LAKE', '26-March-1885', 'Crozier', 'Dumont');

INSERT INTO battles VALUES ('FORT PITT', '15-April-1885', 'Dickens', 'Big Bear');

INSERT INTO battles VALUES ('FISH CREEK', '24-April-1885', 'Middleton', 'Dumont');

INSERT INTO battles VALUES ('CUT KNIFE', '2-May-1885', 'Otter', 'Poundmaker');

INSERT INTO battles VALUES ('BATOCHE', '9-May-1885', 'Middleton', 'Dumont');

INSERT INTO battles VALUES ('FRENCHMAN''S BUTTE', '28-May-1885', 'Strange', 'Wandering Spirit');

INSERT INTO battles VALUES ('LOON LAKE', '3-June-1885', 'Steele', 'Wandering Spirit');

Bonus: My favourite recent blog article was James Koopmann's recent write-up on storing Word Documents in Oracle.

Although there are many different variations on this central theme, in order to solve problems of this nature, you'll need to know:

1. The given column's value for the given row

2. The difference between each row's value and the value of the given row

3. That row's difference's relationship to all the other rows.

Step 1 is easy, 2 and 3 are more difficult.

For starters, let's say my requirement is to find the 2 battles that started nearest the beginning of the battle of Batoche. Here's how I first attacked that problem.

*Note: You can find my test data at the end of this blog if you want to try it for yourself.*SELECT loc, difference FROM

(SELECT a.loc, ABS(a.start_date - b.start_date) difference

FROM battles a,

(SELECT start_date FROM battles WHERE loc = 'BATOCHE') b

ORDER BY difference)

WHERE rownum < 3;

LOC DIFFERENCE

-------------------- ----------------------

BATOCHE 0

CUT KNIFE 7

FISH CREEK 15

Despite my use of ROWNUM to get just the two closest rows, I will admit that my result is not satisfying, because I don't like going through the same data more than once. Once to find the value of the given row, and another to figure out the differences. Blech!

I do have a better solution, but first let's say my requirements are a little tougher, to the point where this solution won't even work. For example:

1. In the above example, the two nearest rows both happened to be prior to the given row. Let's say my requirements are to find the rows immediately preceding the given row, and the rows immediately succeeding it.

AND let's also say

2. I want the TWO rows immediately before and immediately after instead of just one.

What then?

**RANK**For starters, we can use the analytic function RANK instead of ROWNUM for an easier way to get the order in which the rows are presented.

Once again, the 'b' query gives us the rank of our given row, and now the 'a' query will get a list of all the locations and their respective RANK. Now we can take as many rows as we want from that centrepoint.

SELECT loc FROM

(SELECT a.loc, ABS(a.battle_num - b.battle_num) distance FROM

(SELECT loc, RANK() OVER (ORDER BY start_date) battle_num FROM battles) a,

(SELECT loc, RANK() OVER (ORDER BY start_date) battle_num FROM battles) b

WHERE b.loc = 'BATOCHE')

WHERE distance BETWEEN 1 AND 2;

LOC

--------------------

FISH CREEK

CUT KNIFE

FRENCHMAN'S BUTTE

LOON LAKE

I'm still annoyed about the inelegance of the solution, because we're still going through my data more than once. Is there a better way?

**LEAD and LAG**With LEAD and LAG you can access multiple rows without a self-join (joining a table to itself). With that, we can do all sorts of fancy things, such as figure out how many days have passed since the last battle started, and how long until the next one began.

SELECT loc, start_date,

start_date - LAG(start_date, 1, NULL) OVER (ORDER BY start_date) AS last_battle,

LEAD(start_date, 1, NULL) OVER (ORDER BY start_date) - start_date AS next_battle

FROM battles;

Back to the task at hand, armed with LAG and LEAD, we can create a list of all rows including the value of the previous row, and the value of the next one. From that, its pretty trivial to select the rows we want.

SELECT loc FROM (SELECT loc,

LAG(loc, 1, NULL) OVER (ORDER BY start_date) previous_battle,

LEAD(loc, 1, NULL) OVER (ORDER BY start_date) next_battle

FROM battles)

WHERE previous_battle = 'BATOCHE' OR next_battle = 'BATOCHE';

LOC

--------------------

CUT KNIFE

FRENCHMAN'S BUTTE

That's very nice! But what about the requirement to get the TWO rows immediately preceding and succeeding the given row, can we still do that with LAG and LEAD?

The SQL Reference describes how to use LAG and LEAD, under Chapter 6, Functions. There it is explained that the second parameter to LAG and LEAD is the number of rows to look. So we can look 2 battles before and after Batoche, and use DECODE to make it a single column.

SELECT loc FROM

(SELECT loc, DECODE(LAG(loc, 1, NULL) OVER (ORDER BY start_date), 'BATOCHE', 1, 0) +

DECODE(LAG(loc, 2, NULL) OVER (ORDER BY start_date), 'BATOCHE', 1, 0) +

DECODE(LEAD(loc, 1, NULL) OVER (ORDER BY start_date), 'BATOCHE', 1, 0) +

DECODE(LEAD(loc, 2, NULL) OVER (ORDER BY start_date), 'BATOCHE', 1, 0) near_batoche

FROM battles)

WHERE near_batoche > 0;

LOC

--------------------

FISH CREEK

CUT KNIFE

FRENCHMAN'S BUTTE

LOON LAKE

Depending how many rows you want, this query will get bigger and uglier, so you may want to revert to using the RANK example instead. If you like the looks of LAG and LEAD, Tim Hall has a good introduction article worth reviewing.

**Performance**I guess the big question is related to the performance implications of each approach. Granted I have a small amount of data, but here are the differences I observed

ROWNUM example, 2 nearest rows:

Slowest of the three, had more parse, executes and queries than the other two combined. The query came in three pieces, the first two of which had full table accesses, the last of which having a nested loop with a table access, a sort, and a view.

RANK example, looking 1 or 2 rows:

No real difference between 1 or 2 rows. Slower than LAG/LEAD with double the parse/execute/fetch. Came in two pieces, the first of which had a full access, the second of which had 2 nested loops, 2 full accesses, and 2 sorts, including one that was cartesian (49 rows).

LAG/LEAD example, looking 1 or 2 rows:

No real difference between 1 or 2 rows. Faster of the three, lowest parse/execute/fetch. Came in two pieces, both of which had a full access, but the second of which also had a sort and a view.

In terms of performance, it looks like LAG/LEAD wins, on this sample data.

Sample data:

CREATE TABLE battles (loc VARCHAR2(20), start_date DATE, canadian VARCHAR2(20), rebel VARCHAR2(20));

INSERT INTO battles VALUES ('DUCK LAKE', '26-March-1885', 'Crozier', 'Dumont');

INSERT INTO battles VALUES ('FORT PITT', '15-April-1885', 'Dickens', 'Big Bear');

INSERT INTO battles VALUES ('FISH CREEK', '24-April-1885', 'Middleton', 'Dumont');

INSERT INTO battles VALUES ('CUT KNIFE', '2-May-1885', 'Otter', 'Poundmaker');

INSERT INTO battles VALUES ('BATOCHE', '9-May-1885', 'Middleton', 'Dumont');

INSERT INTO battles VALUES ('FRENCHMAN''S BUTTE', '28-May-1885', 'Strange', 'Wandering Spirit');

INSERT INTO battles VALUES ('LOON LAKE', '3-June-1885', 'Steele', 'Wandering Spirit');

Bonus: My favourite recent blog article was James Koopmann's recent write-up on storing Word Documents in Oracle.