Thursday 3 January 2013

When Maintenance Plans Go Wrong

I wrote in an earlier blog about implementing a maintenance plan.  I promised then to look at how to set up warnings for when things go wrong - and they will!

You can look every day at the history of your SQL Server Agent jobs, and see if any have failed.  The problem, of course, is that there are probably lots of servers, and lots of jobs to look at on each one.  The best way is to get SQL Server Database Mail to send a message if something goes wrong.  Assuming that you have managed to get this up and running, here's how:

Set up an Operator

First of all, set up a DBAteam group in Active Directory, listing everyone on the team, if this hasn't been done already.  (Depending on how your firm works, it may be worth adding the Helpdesk team to that list, although this is probably only as a backup, unless of course you have trained them in how to fix DBA things...)

Here’s a stored procedure that will list all the operators who currently exist. This will probably be your own monitoring service, but you will occasionally find that a message will be sent to a vendor in the event of a problem. You may or may not be happy about this.

-- list existing operators
exec msdb.dbo.sp_help_operator



If your DBATeam is already in there, well and good.  But if not, set up a new Operator by expanding SQL Server Agent, selecting Operators, and then right clicking on New Operator.  If you still use pagers, you can get it to contact your pager.


Or you can do this with a script, which is handy if you have lots of servers.

USE [msdb]
GO
EXEC msdb.dbo.sp_add_operator @name=N'DBAteam',
              @enabled=1,
              @pager_days=0,
              @email_address=N'DBAteam@companyname.co.uk'
GO

Maintenance Plan Jobs

If you have followed my guidelines on implementing a maintenance plan, there are probably a number of maintenance plan jobs.  You can get a list with this handy stored procedure:

-- list jobs
select * from msdb..sysjobs

  - something like this:
  • DB Admin User DB Maintenance.Hourly Tasks
  • DB Admin User DB Maintenance.DailyTasks
  • DB Admin User DB Maintenance.WeeklyTasks
  • DB Admin System DB Maintenance.DailyTasks
  • DB Admin System DB Maintenance.WeeklyTasks

Right click on the first one, select Properties, then Notifications.  Tick the email box, select DBAteam from the first pulldown menu, and select "When the job fails" from the second.  (The other options are when it succeeds and when it completes - you probably don't want either of these options).

Tick the Write to the event log box, and select "When the job completes".  I don't mind the log containing lots of messages - that's what logs are for.

The dialog box should now look like this:



This can be done with a script, which is probably better if you have lots of jobs on lots of servers.  For further reading, check here.  

EXEC msdb.dbo.sp_update_job
      @job_name = 'DB Admin User DB Maintenance.Daily Tasks'
     , @owner_login_name = 'SA'
     , @notify_email_operator_name = 'DBAteam'
     , @notify_level_email = 2            -- on failure
     , @notify_level_eventlog = 3         -- on completion 

You want it to email the DBAteam when the job fails, and write to the event log when it completes (success or failure).  You could send an email when it completes too, but that will mean you get lots of emails every day, almost all of them saying "the job succeeded".  I think that it's better to report the exceptions.  If a job succeeds, you don't need to do anything, except bask in the warm glow of a job well done.  If a job fails, that's when you need to spring into action - drop the newspaper, drain the coffee mug, have a gentle scratch and fix the problem.

There's a problem with this approach - if SQL Server Agent fails, or Database Mail fails, you can get a backup failure without a warning email.  If you set it up to send an email on Completion, you might possibly notice the lack of emails.  But chances are that you would not notice one missing email out of fifty, so my advice is to be aware of this possibility and check the job history every now and then.

Is there a better way?  Please let me know.
EDIT: This may help - Heartbeat

Enable Alerting

You've set up Database Mail; you've set up an operator; you've set up a notification in the event of failure.  All you have to do is sit back and wait for the error messages to arrive.  When none come in, you can sit back and bask.

Sadly no.  I once found a job which had been failing for a month without anyone being at all aware of it.   And that's because by default SQL Server doesn't tell you about these problems - you have to enable alerts of this type.  You may think that it would be better if it told you by default - I couldn't possibly comment.

Anyway, right click on SQL Server Agent and select Properties.  Pick Alert System.  Notice that the mail profile is NOT enabled?  <sigh>


OK, tick the box to enable the mail profile; make sure the mail system is set to Database Mail; select a suitable mail profile.

Surely that must be it?  Er, no.

The change you just made won't take effect until the SQL Server Agent service is restarted.  Take a deep breath, and try and unclench your fists!  Put the hammer down.  Put it down now!

Right then.

Right click on SQL Server Agent, and select Stop.  It'll ask you to confirm.  Before you do, you might just make sure that an important job isn't about to kick off...

Right click on SQL Server Agent, and select Start.  And confirm.


Set up a test job

You could now wait for a job to fail.  Assuming that you have done everything right, you should then receive an email to say that it failed, and then you can fix it.  Problem is that you might have missed something.  So rather than screwing up a working job, we'll create a job which will fall over deliberately.

Go to SQL Server Agent, right click on Jobs, and select New Job.

The Name is compulsory - I like to call all my DBA jobs "DB Admin something" but it's up to you.  The Category and the Description are optional, but it's worth taking a few minutes to say what it's about, and initial and date it.  Trust me, in a couple of years time your future self, or your successor, will be glad you did this.


Now Steps - there's only one, and I call it "Deliberate Error" because that's what comes out on the email message.  And a little bit of code to cause a deliberate error - division by zero will do nicely:


--declare variables
declare @i int
declare @j int
declare @k int

--set values
Set @i = 1
set @j = 0

-- force division by zero
set @k = @i / @j







You don't need to schedule this - you'll probably only need to run it once (let's hope so anyway!)
But you do need to set up the notification - back to the top of the page, or just run this script:


EXEC msdb.dbo.sp_update_job
      @job_name = 'DB Admin Test Job Fail'
     , @owner_login_name = 'SA'
     , @notify_email_operator_name = 'DBAteam'
     , @notify_level_email = 2            -- on failure
     , @notify_level_eventlog = 3         -- on completion 


Run the job - it should fail more or less immediately.



Go and check your inbox for a failure message.  Assuming this arrives safely - well done!  Next time a job fails, it won't be deliberate - it'll be a real issue.  And you can be reasonably confident that you will know about it straight away.