Friday, October 02, 2009

F*in Percentiles in SQL

The thousands of readers of this world class blog recognize my affinity for math and unconditional love for those who proclaim it's value. This week was a banner week in this department as I was able to dedicate portions of several days to working with percentiles and debating the utter failures of various methods with people who care only for equations, not if they make any logical sense. All work fairly well enough with large datasets, each having their quirks. Unfortunately for me, many times I am asked to calculate p90 on a 4 record set. The interpolative options skew significantly to the outer bands in these cases IMNSHO. Excel method is the closest to representing the actual dataset.

In case any developers are out there who are asked to create percentiles in sql, and have reached the end of their rope by finding this heinous blog, here are some of your options. Wish me luck, I am guessing that next week may involve creating the other 4 sas methods.

This is my least favorite. It appears to be known as SAS v4 or MiniTab
create PROCEDURE [dbo].[sp_CalcPercentile_Minitab]
@PVal float,
@PType varchar(255),
@JobID int,
@ReturnVal float OUTPUT
AS
BEGIN
SET NOCOUNT ON;
create table #ptemp
(id int IDENTITY,
val decimal(8,2)
)
declare @RecordCount as int
declare @TargetRow as float
declare @Difference as decimal(8,2)
declare @m as decimal(8,2)
declare @val as float
declare @val2 as float
declare @intcheck as int
declare @sqlCommand as varchar(max)

--set the records up since identity is needed.
insert into #ptemp (val) select x from table Y
set @RecordCount = @@identity

set @TargetRow = (@RecordCount + 1) * @Pval
set @intcheck = convert(int, @TargetRow)

If @TargetRow = convert(float, @intcheck) begin --no need to make up numbers, get crazy and use an actual value
set @ReturnVal = (select val from #ptemp where id = @intcheck)
end
else begin --time make stuff up
set @val = (select val from #ptemp where id = @intcheck)
set @val2 = (select val from #ptemp where id = @intcheck + 1)
set @Difference = @val2 - @val
set @m = @targetrow - @intcheck
set @Difference = @Difference * @m
set @returnVal = @val + @Difference
end
end


Version 2 is known as the excel method. Math nerds hate this. Perhaps because it makes the most sense and allows for a way to check your work?

create PROCEDURE [dbo].[sp_CalcPVal_Excel]
@PVal float,
@ReturnVal float OUTPUT
AS
BEGIN
SET NOCOUNT ON;

DECLARE @percentile FLOAT
SELECT @percentile = @pval;
WITH emp_sal(base, prev_rank, curr_rank, next_rank)
AS
(
SELECT base,
(ROW_NUMBER() OVER ( ORDER BY base ) - 2.0) / ((SELECT COUNT(*) FROM #somewhere) - 1) [prev_rank],
(ROW_NUMBER() OVER ( ORDER BY base ) - 1.0) / ((SELECT COUNT(*) FROM #somewhere) - 1) [curr_rank],
(ROW_NUMBER() OVER ( ORDER BY base ) + 0.0) / ((SELECT COUNT(*) FROM #somewhere) - 1) [next_rank]
FROM #somewhere
)
SELECT @ReturnVal =
CASE
WHEN t1.base = t2.base THEN t1.base
ELSE t1.base + (t2.base - t1.base) * ((@percentile - t1.curr_rank) / (t2.curr_rank - t1.curr_rank))
END
FROM emp_sal t1, emp_sal t2
WHERE (t1.curr_rank = @percentile OR (t1.curr_rank < @percentile AND t1.next_rank > @percentile))
AND (t2.curr_rank = @percentile OR (t2.curr_rank > @percentile AND t2.prev_rank < @percentile))

end


My personal favorite is this one (seriously, no sarcasm here). I like it because it uses actual numbers to slot results, not "interpreted" numbers which by definition skews results. I realize it is slightly less useful, but at least it isn't a lie.

alter PROCEDURE [dbo].[sp_CalcPVal_ActualPercentile]
@PVal float,
@ReturnVal float OUTPUT
AS
BEGIN
-- SET NOCOUNT ON added to prevent extra result sets from
-- interfering with SELECT statements.
SET NOCOUNT ON;

Set @ReturnVal = (select max( expr1 )from (select top (@pval) percent val as expr1 FROM #somewhere where base is not null order by val asc) as expr2)

END

No comments: