Query plan and statistics

Query plan and statistics

In the last post, we looked at the statistics very briefly. Now, knowing what statistics are, let’s check how statistics help to find an optimal plan and when it doesn’t work well. We will need the table that we used in the last post. Let’s create it again. I am using the SQL Server 2019; database compatibility mode is 150(2019).

Create the table

Execute the query below to create a table and indexes that we are going to use.

DROP TABLE IF EXISTS statistics_simple_test;
CREATE TABLE statistics_simple_test (id INT IDENTITY(1,1),a_text NVARCHAR(1000), a_column INT)
GO
INSERT INTO statistics_simple_test(a_text,a_column)
SELECT TOP 1000000 m1.text,1 
FROM sys.messages AS m1 CROSS JOIN sys.messages AS m2
GO

CREATE CLUSTERED INDEX clx_statistics_simple_test ON statistics_simple_test ([id] ASC);
GO
CREATE NONCLUSTERED INDEX ix_statistics_simple_test_bad_parameter ON statistics_simple_test (a_column);
GO

Execute queries

Updating the data.

UPDATE statistics_simple_test SET a_column = Id WHERE id = 2
GO

Now we have two different values in ‘a_column’ column. Let’s update the statistics:

UPDATE STATISTICS statistics_simple_test WITH FULLSCAN,ALL
DBCC SHOW_STATISTICS(statistics_simple_test, ix_statistics_simple_test_bad_parameter)
GO

Here is the result:

Everything is predictable here. Two values. One value (1) in 999999 rows, another value (2) in 1 row.

1 Plain value

Run the query:

DBCC FREEPROCCACHE
GO
select *
from statistics_simple_test
where a_column = 1
go
DBCC FREEPROCCACHE
GO
select *
from statistics_simple_test
where a_column = 2
GO

We just selecting all rows for ‘a_column’ = 1 and for ‘a_column’ = 2. DBCC command used to clean the plan cache to prevent SQL Server from reusing the plan. If we look at the query plan:

We can see that SQL Server used statistics successfully: 999,999 rows and 1 row, respectively. Good start!

2 Variables

Change the query a little bit, introducing variables:

DBCC FREEPROCCACHE
GO
declare @a_column int = 1;
select *
from statistics_simple_test
where a_column = @a_column
go
DBCC FREEPROCCACHE
GO
declare @a_column int = 2;
select *
from statistics_simple_test
where a_column = @a_column
GO

Should be the same, right?

No! Variables are changing the query plan drastically: index scan in both cases because now SQL Server thinks that in each case the number of rows is 500,000. Why so? Because now SQL Server during compilation doesn’t know what is the parameter value and uses density for ‘a_column’ instead of the histogram. 0.5(density)*1,000,000(number of rows) = 500,000.

The fact that the variable’s value is hard-coded in the script doesn’t help SQL Server to make a correct estimation.

3 Stored Procedure

Introducing a stored procedure. Let’s find out how it works.

DROP PROCEDURE IF EXISTS dbo.usp_select_rows;
GO
CREATE PROCEDURE dbo.usp_select_rows @a_column INT
AS
BEGIN
SELECT * FROM statistics_simple_test p
WHERE a_column = @a_column
END
GO
DBCC FREEPROCCACHE
GO
exec usp_select_rows 1
GO
DBCC FREEPROCCACHE
GO
exec usp_select_rows 2
GO

The query plan is:

Looks good! The number of rows is estimated correctly.

4 Parameter sniffing

But what if we execute the stored procedure with parameter = 1 without cleaning the cache?

exec usp_select_rows 1
GO

This is not good. Now SQL Server ‘thinks’ that should be only one record instead of 999,999. It happened because last time the stored procedure was compiled with parameter = 2, and the number of rows was = 1. SQL Server stores this query plan in plan cache and now will use it for each parameter we send to the stored procedure. Why is that? We need to check the plan cache:

only one plan in the cache – the SP itself. It is independent of the parameter and doesn’t need to be compiled each time. This is a ‘parameter sniffing’. When a number of rows returned by the query differ significantly based on parameter like in our artificial case, it can and, in most cases, will cause performance issues.

5 Table-valued functions

Now let’s look at a table-valued function. Creating one:

DROP FUNCTION IF EXISTS dbo.fn_SelectRows;
GO
CREATE FUNCTION dbo.fn_SelectRows (@a_column INT)
RETURNS TABLE 
AS
RETURN 
(
	SELECT * FROM statistics_simple_test p
	WHERE a_column = @a_column
)
GO

Free cache.

DBCC FREEPROCCACHE
GO

Executing the first query.

SELECT * FROM dbo.fn_SelectRows(1)
GO

The plan is correct.

Clean the cache and execute two queries:

DBCC FREEPROCCACHE
GO
SELECT * FROM dbo.fn_SelectRows(2)
GO
SELECT * FROM dbo.fn_SelectRows(1)
GO

And again all correct.

What’s in the cache?

Two plans. One for each parameter. That’s how a table-valued function differs from a stored procedure; function is not compiled before execution; this is just parameterized query. Well, it works better than using just a query with a parameter, as we saw in paragraph #2.

6 In conclusion

In conclusion, what do we see? The same query wrapped differently can show different performances because of the different query plans. Sometimes change can be significant. Doing performance tuning, we need to keep this in mind. For example, testing performance for a stored procedure using just a query will not give the correct result. It should be tested as a stored procedure with different parameters because, as we see, the first used parameter can create a query plan that is not the best for another parameter. ‘Parameter sniffing’ can be a significant performance problem, and it should be addressed if it happens.

Possible ways to fix it I’ll explain in one of the next topics!