| Executive Summary:
Percentiles and percentile ranks are values used in statistical analysis of data; you can easily use T-SQL to implement statistical calculations such as percentile and percentile rank for Microsoft SQL Server database tables. |
Percentiles and percentile ranks are values
used in statistical analysis of data. Given
a set of values representing certain observations
or scores (e.g., students’ marks in an exam),
a percentile is a value below which a certain
percent of the scores fall. For example, for the
following exam scores: 40, 55, 70, 70, 70, 75,
75, 80, 80, 95, and 100, the 50th percentile (also
known as the median) is the mark below which
50 percent of the scores fall (in this case, 75). The
25th percentile (also known as the first quartile)
is 70, and so on.
A percentile rank is the percent of values in
a certain set of observations or scores that are
lower than a certain value. For example, using
the same scores, the percentile rank of the score
75 is 50 percent because 50 percent of the scores
are lower than 75. The percentile rank of the
score 80 is 70 percent, and so on.
After I explain how to calculate percentile
ranks using T-SQL, I’ll show you how to calculate
percentiles. In my examples I’ll use a table
called Marks. To follow my demonstration,
you’ll need to run the code in Listing 1 to create
the Marks table in the tempdb database and
populate it with sample data.
Note that no standard algorithm exists for
calculating percentiles. I use algorithms similar
to the ones used by Microsoft Excel. If you want
to implement other algorithms, you’ll probably
need to adjust my techniques.
Percentile Ranks
As I already mentioned, a percentile rank is the
percent of values from a certain data set (call it
S) that are smaller than a certain value (call it v).
The PERCENTRANK(S, v) function in Excel
uses different calculations based on whether the
given value appears in the given data set. If it does,
the function uses the following formula: (rnk-1)/
(cnt-1), where rnk represents the rank of the value
(number of values in S that are smaller than v, plus
1) and cnt represents the count of values in S.
The implementation of the percentile rank
calculation for a value that appears in the set is
simple; all you need is the rank of the value and
the count of the values in the set. The rank can
be calculated in T-SQL using the RANK function,
and the count can be calculated using the
COUNT function. The code in Listing 2 calculates the percentile rank of each mark
from the Marks table, generating the output
shown in Table 1.
The query that defines the common table
expression (CTE) called MarksRnkCnt
calculates, for each mark, the rank of
the mark (column rnk) using the RANK
function and the COUNT of rows in the
table (column cnt) using the COUNT
function with the OVER clause. The outer
query implements the formula for percentile rank using the expression
1.*(rnk-1)/(cnt-1). The use of
1. (one dot) here is to cause
implicit conversion of the
operands of the expression
to numeric values, otherwise
T-SQL would use integer
division between the integer
operands.
If the given value doesn’t
appear in the set, the PERCENTRANK
function in Excel interpolates to return the
result. In such a case, the formula the function
uses is:
pctrnk_smaller_val
+ (v - smaller_val) /
(larger_val - smaller_val)
* (pctrnk_larger_val
- pctrnk_smaller_val)
where v represents the
given value, smaller_val
and larger_val represent
the two closest values
from the data set that are
smaller and larger than
v, and pctrnk_larger_val
and pctrnk_smaller_val
represent the percentile ranks
of those two values. As an
example, 85 is a mark that
doesn’t appear in the Marks
table. The two closest marks
smaller and larger than 85 are
80 and 95, and their percentile ranks are 0.7 and
0.9 (see Table 1). The percentile rank of the mark
85 is therefore calculated as:
0.7 + (85 - 80) / (95 - 80) * (0.9 - 0.7) = 0.766667
This means that with interpolation, 76.6667 percent
of the marks are lower than the mark 85.
To add the interpolation logic to the calculation,
you first need to pair adjacent marks from
the Marks table, and calculate their percentile
ranks. This is achieved with the code in Listing
3, generating the output shown in Table 2.
The code that defines the CTE called Marks
RnkCnt calculates a rank, a dense rank, and
a count for each mark from the Marks table.
The code that defines the CTE called PctRanks
queries MarksRnkCnt, calculates the percentile
ranks as I previously explained, and removes
duplicate rows. The dense rank becomes a row
number after the removal of duplicates. The
outer query joins two instances of PctRanks
aliased as Cur and Nxt to pair adjacent marks
and their respective percentile ranks.
The code in Listing 4 shows how to add logic
that for an input value (stored in the variable
@mark) calculates the percentile rank of the
input. The code defines a CTE called PctRank
Ranges based on the last query from Listing
3. The outer query in Listing 4 identifies
the relevant pair of marks and percentile ranks from PctRankRanges. This is achieved with
the predicate:
(@mark > mark_from AND @mark <= mark_to)
OR (rownum = 1 AND @mark = mark_from)
Continue to page 2
Prev. page  
[1]
2
next page