SQL Analytic Functions

Yağmur Bali
5 min readOct 10, 2022

--

If you are doing or planning to do a data-related job, you need to know SQL very well. Especially when analyzing data, we often have to write nested and complex queries. In this article, I will mention analytical SQL functions that you can use to eliminate code confusion. These functions are also called window functions.

First of all, let’s start by explaining the main difference between analytic functions and group/aggregation functions. Group functions manipulate the data in a column and return a single value, these are sum, count, avg, max, and min.

The common difference between group functions and analytical functions is that group functions go through all the data and return a single line result, while analytical functions take the data to be grouped according to rows and return results as many as the number of rows. Analytical functions are frequently used in data warehouses.

As for the analytical function, OVER() is definitely used and its formula is as follows:

Analytic_function_name ([numeric_expression] OVER PARTITION BY column_name ORDER BY column ROWS/RANGE BETWEEN expression)
  • PARTITION BY splits data into groups.
  • ORDER BY allows sorting the rows for each group.
  • ROWS/RANGE BETWEEN allows calculating the data in a specific field.

In this article, I used the data from the order entry and human resources schemas in ORACLE Live SQL.

FIRST_VALUE()

FIRST_VALUE() returns the value of the specified column from the first row of the table.

What are the lowest-priced products in each category?

SELECT CATEGORY_ID, PRODUCT_ID, PRODUCT_NAME, LIST_PRICE, FIRST_VALUE(PRODUCT_NAME) 
OVER (PARTITION BY CATEGORY_ID ORDER BY LIST_PRICE) FIRST_PRODUCT
FROM OE.PRODUCT_INFORMATION;

LAST_VALUE()

LAST_VALUE() is opposite to the FIRST_VALUE. It returns the value of the specified column from the last row of the table. In other words, it returns the last value for the entire dataset and is used like the example for the FIRST_VALUE above.

NTH_VALUE()

You can return not only the first or last value but also the Nth value in a dataset using NTH_VALUE() function. The syntax of it is similar to FIRST_VALUE and LAST_VALUE.

RANK() and DENSE_RANK()

RANK() and DENSE_RANK() calculate the rank of a value in a dataset. The difference between them is ranking using RANK would be as (1,1,3,1,2) while ranking using DENSE_RANK() would be as (1,1,1,2,3). These two are very useful functions to find the highest and lowest value from the table.

What are the income levels of top-3 customers with the highest credit limit?

WITH CUST_LIM AS (
SELECT CUST_FIRST_NAME, INCOME_LEVEL, CREDIT_LIMIT,
RANK() OVER(PARTITION BY INCOME_LEVEL ORDER BY CREDIT_LIMIT DESC) RNK_LIM
FROM OE.CUSTOMERS
)
SELECT
CUST_FIRST_NAME,
INCOME_LEVEL,
CREDIT_LIMIT,
RNK_LIM
FROM CUST_LIM
WHERE RNK_LIM <=3;

LAG() and LEAD()

LAG() is used to get the value in the row before a row in the data set. Its syntax is LAG(column_name, n) OVER() . The default value of n is 1. You can write any number instead of n, for instance, if you want to print the value in the second previous lines, you should write 2 instead of n.

LEAD() fetches data in the following row and returns the fetched value in the current row. Its syntax is LEAD(column_name, n) OVER(). The default value of n is 1. LEAD() is the opposite of the LAG().

Let’s fetch the salary information of the employees whose JOB_ID is IT_PROG.

SELECT EMPLOYEE_ID,
FIRST_NAME,
JOB_ID,
SALARY,
LAG (SALARY, 1) OVER (ORDER BY SALARY DESC) AS PREV_SALARY
FROM HR.EMPLOYEES
WHERE JOB_ID='IT_PROG';

As you can see below, the salary information of the previous line was added next to the salary information of each line.

CUME_DIST()

CUME_DIST() calculates the cumulative distribution of values in the specified partition. The result of it is less than or equal to the current row. This function may be helpful if you would like to get the top n% of the results from the dataset.

What is the percentile of orders for each customer in June 2007?

SELECT 
CUSTOMER_ID,
ORDER_DATE,
ORDER_TOTAL,
ROUND(CUME_DIST() OVER (
PARTITION BY EXTRACT(Month FROM ORDER_DATE)
ORDER BY ORDER_TOTAL DESC
) * 100,2) || '%' cume_dist
FROM
OE.ORDERS
WHERE
EXTRACT(Month FROM ORDER_DATE) = 6
AND EXTRACT(Year FROM ORDER_DATE) = 2007;

NTILE()

NTILE() is basically a function that groups the dataset according to the parameter entered from outside.

To understand how the NTILE() function works, let’s split the data for each department into 2 buckets in employees dataset, and then get the results for department ID 30.

SELECT EMPLOYEE_ID,
FIRST_NAME,
DEPARTMENT_ID,
SALARY,
NTILE(2) OVER (PARTITION BY DEPARTMENT_ID ORDER BY SALARY) AS GROUPED_SAL
FROM HR.EMPLOYEES
WHERE DEPARTMENT_ID=30;

What are the mid-level salaries for each department?

For finding mid-level salaries, we would split data for each department in our employees dataset into 3 buckets and then get results for the bucket equal to 2.

SELECT EMPLOYEE_ID,
FIRST_NAME,
DEPARTMENT_ID,
SALARY,
NTILE(3) OVER (PARTITION BY DEPARTMENT_ID ORDER BY SALARY) AS GROUPED_SAL
FROM HR.EMPLOYEES
WHERE GROUPED_SAL=2;

To sum up, analytical functions prevent writing complex queries, making it easier to conduct target-driven analytic research. Thank you for reading!

--

--

Yağmur Bali
Yağmur Bali

No responses yet