OracleBlog: May 2006

Tuesday, May 30, 2006

Windowing Clauses

Answer: "Analytic Functions"

Question: "I have become quite comfortable, perhaps even adept, with basic SQL queries. I am ready to learn how to do more with my data. What do you recommend I study?"

Given how often I write about analytic functions, my answer may not come as a surprise. To those who have been studying along with me, I would like to add one more tool to our toolbelts. Following up on my recent article about finding nearby rows, I would like to touch on windowing clauses.

Life Without Windowing Clauses

Quick, find me the average of a value for all rows in a table. No problem, right? Use the analytic function AVG.

Note: Sample data can be found at the end of the article.

SELECT AVG(votes) FROM elections;

AVG(VOTES)
----------------------
357315.2307692307692307692307692307692308

Great. Now, find me that average only for a certain range of data, say from elections in 1990s only. Again, no problem.

SELECT AVG(votes) FROM elections WHERE election BETWEEN 1990 and 1999;

AVG(VOTES)
----------------------
461947.5

Ok. Now, for each row, find me the average from a 10 year range before and after.

Um... errr...

How about the simply the average including the preceding and succeeding rows?

Um ... errr ... what was the link to your article on finding nearby rows again?

To the Rescue

Well you can relax. I am about to show you how to answer these questions, and others like it. We'll answer these questions using windowing clauses, which are the integral part of analytic functions that define the set of rows you wish to work with.

First let's review analytic functions. You can go read Chapter 6 of the SQL Reference, but generally, an analytic function comes in the following form:

ANALYTIC FUNCTION (ARGUMENT) OVER (QUERY PARTITION CLAUSE, ORDER BY CLAUSE, WINDOWING CLAUSE);

Using windowing clauses, you can define which row to start with, which row to end with including, if you wish, in reference to a specific row (either by a range, or row number).

Now to let you off the hook. Here is an example of how to use windowing clauses to determine the average of a column for a 10-year range, and an average including the rows immediately before and after.

SELECT election, leader, votes,
CEIL(AVG(votes) OVER (ORDER BY election RANGE BETWEEN 10 preceding AND 10 following)) ten_year,
CEIL(AVG(votes) OVER (ORDER BY election ROWS BETWEEN 1 preceding AND 1 following)) before_after
FROM elections;

ELECTION LEADER       VOTES        TEN_YEAR     BEFORE_AFTER           
-------- ------------ ------------ ------------ ------------- 
1959     Kirby        98730        93184        75004                  
1963     Harradance   51278        144122       93184                  
1967     Lougheed     129544       189250       159252                 
1971     Lougheed     296934       251124       265414                 
1975     Lougheed     369764       358565       358265                 
1979     Lougheed     408097       399552       455449                 
1982     Lougheed     588485       420075       454455                 
1986     Getty        366783       434118       440838                 
1989     Getty        367244       442418       391336                 
1993     Klein        439981       457035       430380                 
1997     Klein        483914       467097       517049                 
2001     Klein        627252       492060       509420                 
2004     Klein        417092       509420       522172

Armed with the knowledge of the windowing clause, we are finally beginning to unleash the full power of analytic functions. With that in mind, a closer look at the other two types of clauses are in order.

Query Partition Clause

Strictly speaking, the windowing clause is not the only way to define the set of rows you wish to work with. The query partition clause can achieve that same goal by grouping your data on a single, specific value (or set of values).

For example, say we wanted the average number of votes by leader. No problem using GROUP BY, right?

SELECT leader, AVG(votes) avg_votes FROM elections GROUP BY leader;

LEADER                           AVG_VOTES              
-------------------------------- ---------------------- 
Getty                            367013.5               
Harradance                       51278                  
Kirby                            98730                  
Klein                            492059.75              
Lougheed                         358564.8

By using the partitioning clause we can achieve the same result, but use it in other ways. For example, to compare the average to every row.

SELECT election, leader, votes,
CEIL(votes - AVG (votes) OVER (PARTITION BY leader)) diff
FROM elections ORDER BY diff desc;

ELECTION LEADER        VOTES     DIFF                   
-------- ------------- --------- --------- 
1982     Lougheed      588485    229921                 
2001     Klein         627252    135193                 
1979     Lougheed      408097    49533                  
1975     Lougheed      369764    11200                  
1989     Getty         367244    231                    
1963     Harradance    51278     0                      
1959     Kirby         98730     0                      
1986     Getty         366783    -230                   
1997     Klein         483914    -8145                  
1993     Klein         439981    -52078                 
1971     Lougheed      296934    -61630                 
2004     Klein         417092    -74967                 
1967     Lougheed      129544    -229020

Putting the partitioning clause and the windowing clause together is like mixing chocolate and peanut butter. Alone, they're great, together they're even better. We could solve such complex problems as "For each row, show me the average that includes the immediately preceding and succeeding rows that share the same id."

Further Reading

Other than the SQL Reference, and Chapter 19 "SQL for Analysis" of the Data Warehousing Guide, my primary sources include Chapter 12 of Tom Kyte's "Expert One-on-One Oracle" and Daniel Morgan's on-line reference.

Beyond that, there are a lot of fine articles on specific applications. For beginners, I would review an anonymous contribution to Howard Rogers' wiki introduction to analytic functions, which also includes a great example.

I will close with the final ingredient to analytic functions, the order by clause.

Creating Rolling Averages

By using the order by clause within the context of the analytic function, we can create rolling averages.

SELECT election, leader, votes,
CEIL(AVG (votes) OVER (ORDER BY election)) rolling_avg
FROM elections;

ELECTION LEADER        VOTES      ROLLING_AVG              
-------- ------------- ---------- ------------- 
1959     Kirby         98730      98730                  
1963     Harradance    51278      75004                  
1967     Lougheed      129544     93184                  
1971     Lougheed      296934     144122                 
1975     Lougheed      369764     189250                 
1979     Lougheed      408097     225725                 
1982     Lougheed      588485     277548                 
1986     Getty         366783     288702                 
1989     Getty         367244     297429                 
1993     Klein         439981     311684                 
1997     Klein         483914     327342                 
2001     Klein         627252     352334                 
2004     Klein         417092     357316

--

Sample data used in this article:

CREATE TABLE elections (election NUMBER(4), party VARCHAR2(32), leader VARCHAR2(32), seats NUMBER(2), votes NUMBER(7), tot_seats NUMBER(2), tot_votes NUMBER(7));

INSERT INTO elections VALUES (2004, 'PC', 'Klein', 61, 417092, 83, 890700);
INSERT INTO elections VALUES (2001, 'PC', 'Klein', 74, 627252, 83, 1013152);
INSERT INTO elections VALUES (1997, 'PC', 'Klein', 63, 483914, 83, 845713);
INSERT INTO elections VALUES (1993, 'PC', 'Klein', 51, 439981, 83, 989025);
INSERT INTO elections VALUES (1989, 'PC', 'Getty', 59, 367244, 83, 829189);
INSERT INTO elections VALUES (1986, 'PC', 'Getty', 61, 366783, 83, 713654);
INSERT INTO elections VALUES (1982, 'PC', 'Lougheed', 75, 588485, 79, 944936);
INSERT INTO elections VALUES (1979, 'PC', 'Lougheed', 74, 408097, 79, 710963);
INSERT INTO elections VALUES (1975, 'PC', 'Lougheed', 69, 369764, 75, 590200);
INSERT INTO elections VALUES (1971, 'PC', 'Lougheed', 49, 296934, 75, 639862);
INSERT INTO elections VALUES (1967, 'PC', 'Lougheed', 6, 129544, 65, 498351);
INSERT INTO elections VALUES (1963, 'PC', 'Harradance', 0, 51278, 63, 403444);
INSERT INTO elections VALUES (1959, 'PC', 'Kirby', 1, 98730, 65, 413516);

// posted by Robert Vollman @ Tuesday, May 30, 2006 1 comments

Tuesday, May 09, 2006

Finding Nearby Rows

Today's question is how do you find rows that have values nearest the value of a given row?

Although there are many different variations on this central theme, in order to solve problems of this nature, you'll need to know:

1. The given column's value for the given row
2. The difference between each row's value and the value of the given row
3. That row's difference's relationship to all the other rows.

Step 1 is easy, 2 and 3 are more difficult.

For starters, let's say my requirement is to find the 2 battles that started nearest the beginning of the battle of Batoche. Here's how I first attacked that problem.
Note: You can find my test data at the end of this blog if you want to try it for yourself.

SELECT loc, difference FROM
(SELECT a.loc, ABS(a.start_date - b.start_date) difference
FROM battles a,
(SELECT start_date FROM battles WHERE loc = 'BATOCHE') b
ORDER BY difference)
WHERE rownum < 3;

LOC                  DIFFERENCE             
-------------------- ---------------------- 
BATOCHE              0                      
CUT KNIFE            7                      
FISH CREEK           15

Despite my use of ROWNUM to get just the two closest rows, I will admit that my result is not satisfying, because I don't like going through the same data more than once. Once to find the value of the given row, and another to figure out the differences. Blech!

I do have a better solution, but first let's say my requirements are a little tougher, to the point where this solution won't even work. For example:

1. In the above example, the two nearest rows both happened to be prior to the given row. Let's say my requirements are to find the rows immediately preceding the given row, and the rows immediately succeeding it.

AND let's also say

2. I want the TWO rows immediately before and immediately after instead of just one.

What then?

RANK

For starters, we can use the analytic function RANK instead of ROWNUM for an easier way to get the order in which the rows are presented.

Once again, the 'b' query gives us the rank of our given row, and now the 'a' query will get a list of all the locations and their respective RANK. Now we can take as many rows as we want from that centrepoint.

SELECT loc FROM
(SELECT a.loc, ABS(a.battle_num - b.battle_num) distance FROM
(SELECT loc, RANK() OVER (ORDER BY start_date) battle_num FROM battles) a,
(SELECT loc, RANK() OVER (ORDER BY start_date) battle_num FROM battles) b
WHERE b.loc = 'BATOCHE')
WHERE distance BETWEEN 1 AND 2;

LOC                  
-------------------- 
FISH CREEK           
CUT KNIFE            
FRENCHMAN'S BUTTE    
LOON LAKE

I'm still annoyed about the inelegance of the solution, because we're still going through my data more than once. Is there a better way?

LEAD and LAG

With LEAD and LAG you can access multiple rows without a self-join (joining a table to itself). With that, we can do all sorts of fancy things, such as figure out how many days have passed since the last battle started, and how long until the next one began.

SELECT loc, start_date,
start_date - LAG(start_date, 1, NULL) OVER (ORDER BY start_date) AS last_battle,
LEAD(start_date, 1, NULL) OVER (ORDER BY start_date) - start_date AS next_battle
FROM battles;

Back to the task at hand, armed with LAG and LEAD, we can create a list of all rows including the value of the previous row, and the value of the next one. From that, its pretty trivial to select the rows we want.

SELECT loc FROM (SELECT loc,
LAG(loc, 1, NULL) OVER (ORDER BY start_date) previous_battle,
LEAD(loc, 1, NULL) OVER (ORDER BY start_date) next_battle
FROM battles)
WHERE previous_battle = 'BATOCHE' OR next_battle = 'BATOCHE';

LOC                  
-------------------- 
CUT KNIFE            
FRENCHMAN'S BUTTE

That's very nice! But what about the requirement to get the TWO rows immediately preceding and succeeding the given row, can we still do that with LAG and LEAD?

The SQL Reference describes how to use LAG and LEAD, under Chapter 6, Functions. There it is explained that the second parameter to LAG and LEAD is the number of rows to look. So we can look 2 battles before and after Batoche, and use DECODE to make it a single column.

SELECT loc FROM
(SELECT loc, DECODE(LAG(loc, 1, NULL) OVER (ORDER BY start_date), 'BATOCHE', 1, 0) +
DECODE(LAG(loc, 2, NULL) OVER (ORDER BY start_date), 'BATOCHE', 1, 0) +
DECODE(LEAD(loc, 1, NULL) OVER (ORDER BY start_date), 'BATOCHE', 1, 0) +
DECODE(LEAD(loc, 2, NULL) OVER (ORDER BY start_date), 'BATOCHE', 1, 0) near_batoche
FROM battles)
WHERE near_batoche > 0;

LOC                  
-------------------- 
FISH CREEK           
CUT KNIFE            
FRENCHMAN'S BUTTE    
LOON LAKE

Depending how many rows you want, this query will get bigger and uglier, so you may want to revert to using the RANK example instead. If you like the looks of LAG and LEAD, Tim Hall has a good introduction article worth reviewing.

Performance

I guess the big question is related to the performance implications of each approach. Granted I have a small amount of data, but here are the differences I observed

ROWNUM example, 2 nearest rows:
Slowest of the three, had more parse, executes and queries than the other two combined. The query came in three pieces, the first two of which had full table accesses, the last of which having a nested loop with a table access, a sort, and a view.

RANK example, looking 1 or 2 rows:
No real difference between 1 or 2 rows. Slower than LAG/LEAD with double the parse/execute/fetch. Came in two pieces, the first of which had a full access, the second of which had 2 nested loops, 2 full accesses, and 2 sorts, including one that was cartesian (49 rows).

LAG/LEAD example, looking 1 or 2 rows:
No real difference between 1 or 2 rows. Faster of the three, lowest parse/execute/fetch. Came in two pieces, both of which had a full access, but the second of which also had a sort and a view.

In terms of performance, it looks like LAG/LEAD wins, on this sample data.

Sample data:
CREATE TABLE battles (loc VARCHAR2(20), start_date DATE, canadian VARCHAR2(20), rebel VARCHAR2(20));
INSERT INTO battles VALUES ('DUCK LAKE', '26-March-1885', 'Crozier', 'Dumont');
INSERT INTO battles VALUES ('FORT PITT', '15-April-1885', 'Dickens', 'Big Bear');
INSERT INTO battles VALUES ('FISH CREEK', '24-April-1885', 'Middleton', 'Dumont');
INSERT INTO battles VALUES ('CUT KNIFE', '2-May-1885', 'Otter', 'Poundmaker');
INSERT INTO battles VALUES ('BATOCHE', '9-May-1885', 'Middleton', 'Dumont');
INSERT INTO battles VALUES ('FRENCHMAN''S BUTTE', '28-May-1885', 'Strange', 'Wandering Spirit');
INSERT INTO battles VALUES ('LOON LAKE', '3-June-1885', 'Steele', 'Wandering Spirit');

Bonus: My favourite recent blog article was James Koopmann's recent write-up on storing Word Documents in Oracle.

// posted by Robert Vollman @ Tuesday, May 09, 2006 2 comments

OracleBlog

Tuesday, May 30, 2006

Windowing Clauses

Tuesday, May 09, 2006

Finding Nearby Rows

About Me

Best Links

Aggregators

Top Blogs

ARCHIVES