Developing Matt

My Technical Journal

Wait Stats and File Stats by Andrew Kelly part 2

leave a comment »

This is part 2 of a class  taught by Andrew Kelly in the sqlpass 24 hour free training hosted by sql pass.  Wait Stats and File stats are the core of Andrew Kelly’s tools to check performance. 

 WAIT STATS

Definition: Any time sql server has to wait to do something it records the type of wait (wait type) and the time it had to wait in milliseconds (wait time).  In sql server 2005 there are 194 different wait types and 485 different types in 2008.   If it has to read rows…and go to the disk and come back to return your data it will count in milliseconds how long it takes to get this data.

2008 has new waits: PREEMPTIVE waits.  These are waits that are waiting on something else.  So if it is a procedure that is waiting on another procedure somewhere else then it is tracked as a preemptive wait because it is waiting on an external process.

Wait Stats are cumulative from the last time the database server was restarted (they can be manually reset).

 

Here’s how the entire things works. 

This is the view sys.dm_os_wait_stats (in sql 2000 it was dbcc sqlperf).  21: 58:

 AndrewKelly2

When server attempts to do something it goes into a wait state..it starts keeping track at that point, indicated by the red line (T0).  It says I’m waiting on ____(x resource)_____ and then it keeps track of the time in milliseconds.  The T1 wait time (the signal wait time) starts when the resource becomes available and measures how long that resource takes (the data is ready…lets see how long it takes for the data to return). T0’s entire wait time is inclusive of T1’s wait time. 

if you have lots of signal wait time that means you are (in a general sense) cpu bound.  You are waiting on the resource.  But if your overall wait time is more significant than your signal wait time that means you are waiting on the resource to become available.  You are waiting on the disk or i/o of some sort. 

Signal wait time means you are ready to do it but you are waiting for it to be dome…usually this means that you don’t have enough processors to do what you need to do. 

Other points about Wait Stats

Get a snapshot using sys.dm_os_wait_stats dmv.  Get a before and and an after and get the difference between the two (a delta).  Take a snapshot at noon and then a snapshot at 2.  That delta is the amount of waits between noon and two.  To reset the counters you use dbcc sqlperf(‘sys.dm_os_wait_stats’, CLEAR)…although I’m not clear as to why you would want to/need to.

What you are looking for in the end with the delta you can generate a report to show what you are waiting on.

Most common waits :

cxPacket:  the result of inefficient parallel processing…multiple threads going to the same process and is waiting on another thread.  to fix, adjust MAXDOP at server or query level (hyper-threading).  setting it to 1 eliminates parallelism.

lck_m_xx and LCK_M_SCH_xx  results from locking and blocking…it is long running transactions and is due to improper ot lack of indexes, whcih reduces the amount of locking, which removes the amount of blocking.

async_networkio: usually indicates that the client is not absorbing the data as fast as sql server can send it.  It may be a network but most likely it is just the client holding up the server from being able to pass the data.  Client is written poorly or the hardware it is running on is poorly configured or not able to keep up.  Sql has sent the data but client can’t keep up.

Disk related: pageiolatch_xx, io_completion, writelog: these are i/o issues.  You cannot write or read to disk as fast as you need to.  It is a problem with the stored subsystem, not enough disk, a driver, etc.   (writelog waits should be kept as low as possible (if you cannot write transaction a to disk, transaction b has to wait, transaction c has to wait, etc). 

pagelatch_xx (this is different than pageiolatch.  not related to physical io).  Can indicate contention for internal resources other than the bufferpooll  (if you see pagelatch_up this indicates contention  the files, most commonly in tempdb). heaps a lobs can cause latching…heavy inserts into the same pages, or page splits.

demo:  he has a job that every once in a while he inserts file stats to a table and analyzed that data.  He also had a wait stat job that did the same and then he had a stored procedure that gave the delta for two different time periods.

Q/A:

Q. How can i tie wait time aggregate data to individual queries?
A. There are a number of dmvs that can help with this (replaces sysprocesses in 2000)

Q. How often take a snapshot?  
A. When he goes into a situation initially he does it every hour.  After a while just once a day unless there is a problem of course.

He also recommended reading the whitepaper on Performance Tuning Waits Queues by Tom Davidson from Microsoft.  He goes into how to capture the stat and what each statistic is…and explains some of the process as well as some of the wait types that can be ignored (waitfor is one example).  Apparently there is a script that filters some of these out that Davidson has.

Click here or here for that 98 page document.  It looks like a great read so far.

(images used by permission in part1 and part2: thank you!)

Advertisements

Written by matt

September 14, 2009 at 9:54 am

Posted in Sql Performance, Sql Server

Tagged with

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: