Very simple about statistics

Very simple about statistics

Prepare the table

Statistics are vital for the SQL Server optimization process. They are playing a critical role in the process of finding a correct query plan. To assume what operation to use, SQL Server needs to know how many rows the query returns. For example, if the query selects all the rows from a table, the scan probably is more appropriate and faster than seek + lookup.
To make this assumption, SQL Server needs to rely on up-to-date statistics. To keep it up-to-date on DB level SQL Server use the ‘Auto Update Statistics’ option. It is better always to keep this option True. Nevertheless, it doesn’t mean that nothing else can and should be done. You can find many recommendations to execute ‘update statistics’ more often manually, and it makes sense to do so, at least for some intensively used tables.
This post will show how statistics look, the meaning of numbers in the statistics, and how it changes depending on the values. Also, I will show an example when ‘auto update statistics’ doesn’t work and explain why.
I used the ‘StackOverflow‘ database in compatibility level 2019, but you can use any other database because we’ll create a new table.
Let’s create a table:

CREATE TABLE statistics_simple_test (id INT IDENTITY(1,1),a_text NVARCHAR(1000), a_column INT)
GO

populate it with 1,000,000 rows:

INSERT INTO statistics_simple_test(a_text,a_column)
SELECT TOP 1000000 m1.text,1 
FROM sys.messages AS m1 CROSS JOIN sys.messages AS m2
GO

Now we have a table with 1,000,000 rows and ‘a_column’ = 1 in all of them.The last thing is to create indexes:

CREATE CLUSTERED INDEX clx_statistics_simple_test ON statistics_simple_test ([id] ASC);
GO
CREATE NONCLUSTERED INDEX ix_statistics_simple_test_bad_parameter ON statistics_simple_test (a_column);
GO

It’s time to update statistics and look into it.

UPDATE STATISTICS statistics_simple_test WITH FULLSCAN,ALL
DBCC SHOW_STATISTICS(statistics_simple_test,ix_statistics_simple_test_bad_parameter)
GO

What the statistics consist of

Here is the result of the last query:

Let’s break down these 3 tables column by column.STAT_HEADER (First table)

  • Name of the object (ix_statistics_simple_test_bad_parameter)
  • When updated last time (Apr 15 2021 12:17PM)
  • Number of rows in the object = 100000
  • The number of rows used to calculate statistics =100000 because we used the FULLSCAN option in UPDATE STATISTICS.
  • The number of steps in the histogram = 1. It is the number of rows in the HISTOGRAM table.
  • Density = 1/distinct values. = 0. In this case no distinct values excepting RANGE_HI_KEY.
  • The average number of bytes per value = 8. This value includes clustered index key.
  • Reflects the fact of creating ‘string summary statistics’ = NO.
  • NULL for non-filtered statistics
  • The number of unfiltered rows = 1000000 because no filter applied.
  • Persisted sample percentage used for statistics updates when a sample wasn’t provided = 0

DENSITY_VECTOR

  • Density = 1/distinct values for all involved columns (see below) = 1 for ‘a_column’ because all values are the same and 0.000001 for a combination of ‘a_column’ and ‘id’ because ‘id’ is unique.
  • The average number of bytes per value = 4 and =8 bytes, respectively.
  • Columns’ names

HISTOGRAM

  • The upper value of the range = 1
  • Number of rows (estimated) in the range including upper value = 0
  • Number of rows (estimated) in the range = Upper Value = 1000000
  • Number of rows (estimated) in the range <> Upper Value = 0
  • Equal RANGE_ROWS /DISTINCT_RANGE_ROWS if DISTINCT_RANGE_ROWS <> 0 or equals 1 otherwise = 1

everything is clear in this example since it is only one column and only one value for all rows.Now execute the query to add one more value to the a_column.

UPDATE statistics_simple_test SET a_column = Id WHERE id = 2

If we recheck statistics, we won’t see any changes. Why? It is simple. SQL Server cannot update statistics after every single change. There are rules that the process of auto-updating statistics follows. It eliminates the overhead of calculating statistics after each update. At the same time, statistics do not always have actual data. Now update statistics and check what changed.

We can see that number of steps is changed to reflect a new value = 2. Also changed density for a_column = 0.5.Let’s update the table even more.

UPDATE statistics_simple_test SET a_column = Id WHERE id <= 500000

Here are the new statistics.

An example used in this post is very simple and, in most cases, doesn’t reflect real-life situations. But it helps to understand how statistics generate and what data is inside them.

See you in the next post!


StackOverflow databases can be downloaded from here, but the original copy is here.