Page Life Expectancy (PLE), where to start?
I have inherited a SQL server {2012 (SP3), but this question is intended to be generic} . We are using SCOM to monitor it. Previously I was getting an alert once or twice a month for PLE < 300>
There are multiple blog posts about PLE, a few tools you can get to monitor it, and many differing opinions about what is good, bad, or indifferent. In the end, there are a lot of variables. No solutions are one size fits all. Low PLE is not a problem so much as it is a symptom, with lots of potential causes, and related measures to consider.
{this paragraph might not add value to the question, I am open to removing it} I think everyone can agree that PLE falling to 299 once a month during an overnight report creation, is a symptom that does not need to be addressed (assuming the report completes before business hours). Most can also agree that PLE consistently at 350, is not good. There are a handful of causes to look at before making hardware changes, with queries and index being near the top.
After reading about a dozen blog posts about PLE. I have tried to narrow down the key symptoms to get a good picture of what is going on. The query below is what I came up with. It gives values for 4 Buffer Manager items that interconnect with PLE 'Page life expectancy 'Free list stalls/sec' 'Lazy writes/sec' 'Buffer cache hit ratio'
SELECT [object_name], [counter_name], [cntr_value] FROM sys.dm_os_performance_counters -- https://docs.microsoft.com/en-us/sql/relational-databases/system-dynamic-management-views/sys-dm-os-performance-counters-transact-sql WHERE [counter_name] = 'Page life expectancy' --if multiple NUMA on a server should return multiple Nodes, OR [counter_name] = 'Free list stalls/sec' -- Number of requests per second that had to wait for a free page https://docs.microsoft.com/en-us/sql/relational-databases/performance-monitor/sql-server-buffer-manager-object OR [counter_name] = 'Lazy writes/sec' --Flushes of dirty pages before a checkpoint runs. OR [counter_name] = 'Buffer cache hit ratio' --percentage of pages found in the buffer cache without having to read from disk you want this ratio to be high Order by [counter_name] DESC, [object_name];
Additionally, if you are looking at Lazy Writes on an inherited server you should check Recovery IntervalEXEC sp_configure @configname='recovery interval (min)'; --The 'config_value' default 0 indicates SQL is applying Checkpoints completely automatically https://docs.microsoft.com/en-us/sql/database-engine/configure-windows/configure-the-recovery-interval-server-configuration-optionIf this first query does not return values:SELECT COUNT(*) FROM sys.dm_os_performance_counters; --If no values from the first query, a value of 0 here indicates a separate issue https://docs.microsoft.com/en-us/sql/relational-databases/system-dynamic-management-views/sys-dm-os-performance-counters-transact-sql
I have a pretty good idea of what all these values represent, and how they work together. I have included comments and sources in my code above.My question is two-part Is my list of buffer items/values above adequate for a starting place when examining PLE? (i.e. values that will always be helpful to consider together, should something excluded or included) How to put the values in good context with each other? (i.e. there is a good answer here saying "Check the Free List Stalls/sec value as well. If above 2, consider adding memory to the server" while the body of the answer is helpful, I don't think a value of 2 for 'Free List Stalls/sec' is a problem on most instances)
You basically asked, "What should I do when Page Life Expectancy changes?"
My answer: nothing. I don't start by looking at SQL server page life expectancy. That metric made sense in the SQL Server 7/2000 days when it was all that we had, but today, in 2018, we can do better.
Start by looking at wait stats - that tells you what SQL Server is waiting on. I don't care whether PLE is 300 or 3,000 - tell me what you're WAITING on, SQL Server, and then I'll go troubleshoot that metric. My personal favorite way to check waits is to use the open-source sp_BlitzFirst (disclaimer: I wrote it.) By default, it takes a 5-second sample of your server's metrics and gives you a few guesses as to why it's slow right now.
Because you like writing long questions, you'll probably also like these:
sp_BlitzFirst @SinceStartup = 1;
The first result set gives you your waits since startup, and:
sp_Blitz @ExpertMode = 1, @Seconds = 60;
Take a longer sample, and tell your waits over that time range.
If PLE is dropping due to queries reading a lot of data pages from disk, for example, you might see PAGEIOLATCH% wait types. If it's dropping due to queries getting huge memory grants, you might see RESOURCE_SEMAPHORE. If PLE isn't the problem, then you'll see different wait types altogether.