What are the Page Life Expectancy (PLE), where to start?

451 Asked by Aalapprabhakaran in Salesforce , Asked on Aug 13, 2021

I have inherited a SQL server {2012 (SP3), but this question is intended to be generic} we are using SCOM to monitor it. Previously I was getting an alert once or twice a month for PLE < 300>There are multiple blog posts about PLE, a few tools you can get to monitor it, and many differing opinions about what is good, bad or indifferent. In the end there are a lot of variables. No solutions are one size fits all. Low PLE is not a problem so much as it is a symptom, with lots of potential causes, and related measures to consider.{this paragraph might not add value to the question, I am open to removing it} I think everyone can agree that PLE falling to 299 once a month during an overnight report creation, is a symptom that does not need to be addressed (assuming the report completes before business hours). Most can also agree that PLE consistently at 350, is not good. There are a handful of cause to look at before making hardware change, with queries and index being near the top. After reading about a dozen blog posts about PLE. I have tried to narrow down the key symptoms to get a good picture of what is going on. The query below is what I came up with. It gives values for 4 Buffer Manager items that interconnect with PLE

'Page life expectancy' 'Free list stalls/sec' 'Lazy writes/sec' 'Buffer cache hit ratio'

SELECT [object_name], [counter_name], [cntr_value] FROM sys.dm_os_performance_counters -- https://docs.microsoft.com/en-us/sql/relational-databases/system-dynamic-management-views/sys-dm-os-performance-counters-transact-sql WHERE [counter_name] = 'Page life expectancy' --if multiple NUMA on a server should return multiple Nodes, OR [counter_name] = 'Free list stalls/sec' -- Number of requests per second that had to wait for a free page https://docs.microsoft.com/en-us/sql/relational-databases/performance-monitor/sql-server-buffer-manager-object OR [counter_name] = 'Lazy writes/sec' --Flushes of dirty pages before a checkpoint runs. OR [counter_name] = 'Buffer cache hit ratio' --percentage of pages found in the buffer cache without having to read from disk you want this ratio to be high Order by [counter_name] DESC, [object_name]; Additionally if you are looking at Lazy Writes on an inherited server you should check Recovery Interval EXEC sp_configure @configname='recovery interval (min)'; --The 'config_value' default 0 indicates SQL is applying Checkpoints completely automatically https://docs.microsoft.com/en-us/sql/database-engine/configure-windows/configure-the-recovery-interval-server-configuration-option If this first query does not return values: SELECT COUNT(*) FROM sys.dm_os_performance_counters; --If no values from the firs query, an value of 0 here indicates a seperate issue https://docs.microsoft.com/en-us/sql/relational-databases/system-dynamic-management-views/sys-dm-os-performance-counters-transact-sql I have a pretty good idea what all these values represent, and how they work together. I have included comments and sources in my code above.

My question is two part

Is my list of buffer items/values above adequate for a starting place when examining PLE? (i.e. values that will always be helpful to consider together, should something excluded or included)How to put the values in good context with each other? (i.e. there is a good answer here saying "Check the Free List Stalls/sec value as well. If above 2, consider adding memory to the server" while the body of the answer is helpful, I don't think a value of 2 for 'Free List Stalls/sec' is a problem on most instances)

NOTE: This question is not about solving the PLE issue, it is about how/where to start looking when assessing the symptoms. Your Doctor checks your Pules, Blood Pressure, Respirations, and Temperature at the start of every exam.Edit 4/13/2018; Attempt to Clarify This is not about knee jerk reactions like checking indexes or waits. This is about identifying other native SQL performance data that should always be examined with PLE. PLE is one of the Buffer Management Objects, what other Buffer Management Objects or Performance Counters should or should not always be part of queries when you really do want to look at buffer management?

Answered by Anusha Acharya

You basically asked, "What should I do when Page Life Expectancy changes?"

My answer: nothing. I don't start by looking at Page Life Expectancy. That metric made sense in the SQL Server 7/2000 days when it was all that we had, but today, in 2018, we can do better.

Start by looking at wait stats - that tells you what SQL Server is waiting on.

I don't care whether PLE is 300 or 3,000 - tell me what you're WAITING on, SQL Server, and then I'll go troubleshoot that metric.

My personal favorite way to check waits is to use the open source sp_BlitzFirst (disclaimer: I wrote it.) By default, it takes a 5-second sample of your server's metrics and gives you a few guesses as to why it's slow right now.

Because you like writing long questions, you'll probably also like these:

sp_BlitzFirst @SinceStartup = 1;

The first result set gives you your waits since startup, and:

sp_Blitz @ExpertMode = 1, @Seconds = 60;

Takes a longer sample, and tells your waits over that time range.

Wait stats can be kind of cryptic, so next to every wait type, I link to the SQLskills wait stats repository for that wait type. You can just copy/paste out the name of your top wait type, go to their site, and learn more about what causes that wait and how to fix it. If PLE is dropping due to queries reading a lot of data pages from disk, for example, you might see PAGEIOLATCH% wait types. If it's dropping due to queries getting huge memory grants, you might see RESOURCE_SEMAPHORE. If PLE isn't the problem, then you'll see different wait types altogether.

What are the Page Life Expectancy (PLE), where to start?

Your Answer