Developing Matt

My Technical Journal

Corruption Survival Techniques

leave a comment »

CRW_0184DevConnect1kimpaulsmall

I attended a class titled ‘Corruption Survival Techniques (aka dbcc checkdb: the definitive guide)’ taught by Paul Randal

99% of the corruptions are caused by the i/o subsystem.  People that don’t monitor for corruption don’t know when they have corruption.  When it is discovered it leads to more data loss than necessary.  I talked to Paul after the class to find out how corruption is manifested in the data  His response: anything.  I didn’t know if it would always manifest itself in corrupted unreadable data or if it could actually look correct.  A recent situation at work caused incorrect data across multiple rows after day X.  There is an index on this particular column and no one knows why the data was suddenly wrong.  In some instances it was just off by a row or two.  In other words, the data looked right but it was wrong.  I suspected at the time that it was due to an expired stored procedure/job being ran against the table, but now I wonder if it was actually corruption.  I’ll let you know what I find out (if I do find out anything) when I get back home.

Run DBCC checkdb (BOL)

  • It is the only way you can force all allocated pages to be read.  it will be able to fix some problems and when it can’t it stops.  Paul was one of the individuals behind this command so if you have corruption questions then check out his blog (linkage above).
  • It is completely online (no locks are made)
  • It stamps when the last time it was successfully ran. dbcc page and dbcccinfo shows this (see his blog)
  • In 2000 it requires 2 runs to determine if it there is actually corruption.  Not a problem after (it uses snapshots now)

What does checkdb do? (for more indepth than this please check out CheckDB from every angle)

  • Primitive checks of critical system tables (it reads each of the clustered indexes as well).  If it cannot run then it can’t run a repair…then you are required to do a backup.
  • Allocation checks
  • Logical checks in system table (linkages between keys and indexes, checking row linkages, etc)
  • CheckCatalog is run (after 2000)
  • if it doesnt’ have any problems with tables it then checks views,  index checks, xml, and rapairs (but only with EXTENDED_LOGICAL_CHECKS_ OPTION IN SQL SERVER 2008

What gets checked for a leaf index page?

  • PAGE AND RECORD AUDITS
  • b-tree checks?  (page linkages, key ranges, partitioning, ey ordering between records)
  • per = record checks (complex columns,computed columns, dat apurity, etc, 1 to 1 mapping with base table row)
    • persisted computed column: it will actually regenerate it to check it
  • Page counts

How to run it

  • by default it will only return the first 200 errors, so use DBCC CHECKDB (DBS) WITH ALL_ERRORMSGS, NO_INFOMSGS   (the last switch actually removes the long drawn out messages)
  • If it takes longer it found corruption (check log for message 5268 for sql server 2005 sp2 onwards to see if it is rescanning some data)
  • Wait for it to complete.

How often should I run it?

  • It depends on the the stability of io subsystem.  It depends on how comfortable you feel about your OS.  Seems like 1x weekly would satiate any discomfort I have.  The quicker you find it the less data loss.

How do you tell something went wrong?

  • you set up a regular job to run dbcc checkdb  it will fail the agent job if it fails.  use with tableresults to get all the output (as the agent doesn’t show them all)
  • example:  dbcc checkdb(northwind) with no_infomsgs, all_errormsgs,tableresults
  • 825: look for that error (or alert, rather) in the buffer pool (see creating alerts for agents on his blog)
  • Practice restoring. and plan a restore strategy and tailor your backups to match.
  • Create alerts for severity >= 19, user defined error (flag that the checkdb job ‘failed’)

The output is very complicated (“this is the difficult part”).  Se his blog post for more detail.

  • Did it fail?  If it didn’t complete successfully then it couldn’t even finish.  so you won’t be able to do a repair.  It is a fatal error that causes checkdb to fail.
  • examples of fatal (to checkdb) errors
  • 7984 – 7988: corruption in critical system tables
    8967: invalid states within CHECKDB itself
    8930: corrupt metadata in the database such that CHECKDB
    could not run
    See ‘Understanding DBCC Error Messages’ in the BOL for
    DBCC CHECKDB for more details

You found the corruption.  What next?

  • Are the corruptions only in nonclustered indexes?  if the recommended repair level is repair_rebuild then yes it is a nonclustered index.  If all of the ids in the errors are all greater than 1?  Also, the answer is that it is a nonclustered index (tip: if you pipe out the errors to a table this is easier to determine).  The rebuild might not work, so check the OS.  His example was that a reorg didn’t work but a rebuild did.
  • Was there an un-repairable error found?  Did checkdb fail?  If so you will need to restore from backup.  that is your only choice.  (i think there is a way to check backup logs for errors.)
  • page header corruption (8909, 8938, 8939)?  just to delete the record
  • 8992 is a metadata mismatch (data doesn’t match datatype.  you might see this when bringing in legacy systems due to the difference between bulkcopy from sql2000 and sql 2005 (2005 more strongly enforced all datatypes)
  • He went through an example where he looked at the page level of the data using traceon and found the key and updated the bad data. 
  • if you are restoring system databases you might have to start the server in single user mode. and you migth need to log in as admin to view certain system databases: admin:databaseserver

Things people do that they shouldn’t:

  • Restart server (pointless)
  • Running repair (last resort option that will cause data loss)
  • Rebuilding transaction log(last resort option that will cause data loss)
  • Detach: many times you won’t be able to re-attach and then things get real ugly (he had a workaround, check his blog…but it was essentially creating a new database, renaming it, do three michael jackson knee kicks, and something else I can’t remember with a balloon)

So, have fun and download a corrupted database from his blog here.  Thanks again Paul for the great information.

Advertisements

Written by matt

March 25, 2009 at 8:36 am

Posted in DevConnect2009, Sql Server

Tagged with

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: