Monday 16 December 2013

The Curse of SQL Server Embedded Edition


Help!  The database is writing a log file which has filled up drive C of server XYZMGT02!

Huh?  That isn’t one of our database servers – in fact I’ve never even heard of it!  Not only that, I don’t even have permission to log onto it!  Nuffink to do with me, guv!

 It turned out that there was a database involved, sure enough, which is why the DBA team got called.  But it wasn’t something that we had ever set up.  Windows Server Update Services or WSUS  downloads updates from Microsoft and sends them out to the computers in the corporate network.  It runs under a freebie cut-down version of SQL Server called Embedded Edition  - SSEE for short – and not unlike Express Edition, when you want to manage it, the things you need have more often than not been disabled. 

The underlying problem in this case was that normally, updates get distributed to the network and can then be purged from the WSUS system.  But if for some reason a computer on the network is unavailable, that update cannot be delivered, and therefore it is not purged.  Drive F:\ which contains the WSUS data had filled up.  And then the software writes a message in the log on Drive C to say something like:
“Could not allocate space for object 'dbo.tbXml'.'PK__tbXml__0E6E26BF' in database 'SUSDB' because the 'PRIMARY' filegroup is full. Create disk space by deleting unneeded files, dropping objects in the filegroup, adding additional files to the filegroup, or setting autogrowth on for existing files in the filegroup."

53,881 error messages – all but a dozen say that. Keep on writing that message for long enough, and you fill up 10 Gb of Drive C, which then grinds to a halt, bringing the whole server down.  
Now in an ideal world I would have configured that log so that it gets located somewhere else - drive D has twice the space on it, and even if it filled right up, it wouldn't give the server heart failure.  But as far as I can tell, there is no way to change the destination drive - the edit option has been disabled.  Alternatively I might get SQL Server to send an email message to the WSUS administrator - but email has been disabled too. 

Hmm, tricky.  Let's think about those error logs for a minute.  By default, SQL Server carries on writing an error log until it gets restarted - which might mean forever.   This can mean that the error log gets very large indeed, and slow to open if you ever want to have a look at the contents.   So on most of the servers I work with, I like to create a new log every month, by setting up an agent job to run this:
exec master..sp_cycle_errorlog

exec msdb..sp_cycle_agent_errorlog

That's one for the error log, and one for the agent error log - which of course doesn't exist in SSEE (duh, because it has been disabled).

Again by default, SQL Server keeps the current log, plus the six previous logs.  This seems very sensible  - you are probably never going to want to check further back than six months.  And you can change that default if you do.  

But in this case we don't have room on the disk to save all that stuff, and since every error message is in effect identical, we don't really care.  So what I did was set up a scheduled task to cycle the error logs daily.   So it retains the error messages for the past seven days, and then slings them.  

A scheduled task is a Windows option, and not nearly as flexible as SQL Server Agent - but if you can't use Agent , it can come in handy.  

So - I created a folder called scripts on drive C.  
I created a text file called Cycle_Errorlog.sql which contains exec master..sp_cycle_errorlog
 
I created a text file called Cycle_Errorlogs.bat which changes to Drive C, goes to the correct directory, and runs SQLCMD with the SQL script above.  Notice that the connection string to the embedded edition is a bit weird - full details here

C:\
cd\Program Files\Microsoft SQL Server\90\Tools\binn\
sqlcmd -E -S \\.\pipe\MSSQL$MICROSOFT##SSEE\sql\query -i "c:\scripts\Cycle_Errorlog.sql"


And I set up a scheduled task to run the batch file daily.

Three months on, WSUS is still filling up Drive F with updates that can't be deployed, the WSUS Administrator is tearing his hair out, but drive C has plenty of room, and the server isn't crashing. 

Saturday 14 December 2013

Book Review - The Phoenix Project

A parable of life for IT folk, told from the point of view of mild-mannered Bill Palmer who is suddenly promoted out of his comfortable middle-management niche to Vice President of IT Operations. Then everything starts to go wrong.  

The payroll fails. This is a BAD thing. Trying to fix it, they mess up the SAN (storage area network), another bad thing. Bill and his team sit down to create a change management system to stop this from happening in the future. Then the auditors strike - to comply with the rules, they have to do something about a stack of issues six inches high. But they can't do that because the number one priority is Phoenix, which will save the company from bankruptcy (yet another bad thing).


Luckily Bill has the advice of Erik his mentor to fall back on, as well as his common sense. They beat back the dreaded auditors, help Phoenix limp into production, and introduce far better ways of doing things which rapidly overtake Phoenix and leave their competitors struggling in their wake. I say a parable, rather than a novel - the authors want you to behave in a certain way with your IT and so they show the mistakes to avoid and good practices for you to follow. And surprisingly, I rather enjoyed it.

Size Property is NULL so indexing fails

Here’s an oddball story.  As part of the weekly maintenance plan, I rebuild the indexes.  Usually this is fine – even on big databases – they have a whole weekend to sort themselves out in, after all.

But then one day I got a message to say the job had failed

Executing the query "ALTER INDEX [IX_Events_eventId] ON [dbo].[Events] REBUILD WITH ( PAD_INDEX  = OFF, STATISTICS_NORECOMPUTE  = OFF, ALLOW_ROW_LOCKS  = ON, ALLOW_PAGE_LOCKS  = ON, SORT_IN_TEMPDB = OFF, IGNORE_DUP_KEY  = OFF, ONLINE = OFF )
" failed with the following error: "Could not allocate a new page for database ‘Gandalf01' because of insufficient disk space in filegroup 'PRIMARY'. Create the necessary space by dropping objects in the filegroup, adding additional files to the filegroup, or setting autogrowth on for existing files in the filegroup.
The statement has been terminated.". Possible failure reasons: Problems with the query, "ResultSet" property not set correctly, parameters not set correctly, or connection not established correctly.


Insufficient disk space?  Thinking that 2 terabytes ought to be enough disk space to turn round in for most DBs, I right clicked on the database and asked for properties - it told me that the property size is not available. 


One of the joys of SQL Server is that there are almost always at least two ways to skin a cat.  I ran this code:

SELECT SUM(size)*1.0/128 AS [size in MB] FROM [Gandalf01].sys.database_files

It worked and told me the database size (and as I suspected, it wasn't much).

I tried the GUI again and this time the property dialog came up fine and showed me the size, as expected.
I tried my re-index again and it worked. 

I've seen something like this before - the GUI refuses to tell me the database properties because the database owner has somehow got set to NULL. 

EXEC sp_helpdb   --reveals that the database owner is NULL and the GUI refuses to work
EXEC sp_changedbowner [SA] -- changes the owner to SA and the GUI now works.

Conceivably the owner is someone who has now left - given that Fred has left and his access removed, it doesn't seem unreasonable that his database might no longer have an owner.  But how can it no longer have a size?  I would be interested to know if anyone has an explanation (other than "It's a bug, Jack") 


Friday 13 September 2013

So, Farewell Then, Port 1433

The default port for SQL Server is 1433.  So what port will a hacker breaking in to your system try first?  Got it in one. 

Change the port, goes the theory, and you make life more difficult for a hacker to get at your systems.  It might not prevent a determined attacker, who just starts with Port 1 and goes on to 9999 or whatever the theoretical maximum is, but it will put off the casual browser.  Discussion here

It isn't difficult to do, although if you have lots of servers it can be a bit boring.  Here's a quick Summary:
  • Log onto server and run SQL Server Configuration manager to set new port; restart services. 
  • Turn on firewall and allow access to that new port.
  • In SSMS registered servers, specify Servername,Port e.g. SQL2012Dev,4321

Here's a step by step procedure with a little bit more detail. 

1)   First of all check which port is currently in use:
exec master..xp_readerrorlog 0, 1, N'Server is listening on' 
GO
The answer will probably be 1433, but just check that your predecessor hasn't done this exercise for you already! 


2)   Log on to the server using Remote Desktop Connection


3)   Run SQL Server Connection Manager.  There are a couple of places where it specifies the port number.  First select SQL Native Client 11.0 Configuration / Client protocols / RCP/IP
– and set the default port to your chosen value:



Next one is SQL Server Network Configuration / Protocols / tcp/ip / ip addresses.  There are six ports - set them all to your chosen value.




Your changes will not work until you stop and restart the SQL Server Services:



4)   The next step is to modify the Windows Firewall so that it allows your chosen port to get through the firewall. On the Administrative Tools menu, select Server Manager, and then Configuration and Windows Firewall. The Firewall should be on, but if it isn't you had better turn it on.

This is what it looks like if the Firewall is OFF:


Right click on Windows Firewall on the left of the screen, and select Properties from the pull down menu that appears.  If the Firewall State is Off, set it to On (recommended) for each of the three profile tabs. 

 

This is what it looks like when the Firewall is ON:
 

5)   Select the Inbound Rules from the menu on the left hand side.  SQL Server is probably close to the top, but you may need to sort the columns. 


In the Properties, select the Protocols and Ports tab, and specify the port of your choice. 


That was the Inbound Rules - now repeat that for the Outbound Rules. 


6)   Now go back to SQL Server Management Studio, and edit the Registered Server information.  It doesn't need much, just enter a comma after the server name and then the new port number
e.g. SQL2012Dev,4321



7)   And then just to be sure, log on to that server and go back to the code we started with:
 
 
 
exec master..xp_readerrorlog 0, 1, N'Server is listening on'
 
GO
 
 
The answer should be the new port number. 

Saturday 3 August 2013

Check your backups - or else!

You make backups.

You restore backups.

Not all of them, necessarily - but every now and then, restore a backup just to check that it's OK.  Find a location on a test server - maybe even on your own laptop, if it has enough grunt.  Restore it.  Check it - maybe run:

dbcc checkdb(vanilla) with no_infomsgs

It should think for a while and then say Command(s) completed successfully.

If it restores OK and DBCC finds no issues, then you can be reassured that things are probably OK with the backups for that database.  Drop it from the laptop - you don't need it anymore.  

Why "with no_infomsgs"?  If you omit this bit, DBCC Checkdb will come back with page after page of guff - this is OK, that's OK - this sort of thing:

DBCC results for 'Production.BillOfMaterials'.
There are 2679 rows in 20 pages for object "Production.BillOfMaterials".
DBCC results for 'sys.ifts_comp_fragment_610101214_4'.
There are 351 rows in 2 pages for object "sys.ifts_comp_fragment_610101214_4".
On and on it goes

And you have to read through it all  looking for anything which isn't OK.  Sling the stuff which is there "just for information" and life is much simpler.  

What if it is corrupt?
  1. Don't panic
  2. Don't detach the database
  3. Don't restart SQL
  4. Don't just run repair.
  5. Run an integrity check
  6. Afterwards, do a root-cause analysis
This is from Gail Shaw - read more of her words of wisdom here.  In fact save that link in your favourites, memorise it, get it tattooed upon your left buttock - whatever it takes.  You will need it on the day you find a problem.  I once found four corrupt databases.  The previous guy had been running this:

dbcc checkdb(Vanilla) with physical_only

According to Microsoft, this...
"Limits the checking to the integrity of the physical structure of the page and record headers and the allocation consistency of the database. This check is designed to provide a small overhead check of the physical consistency of the database, but it can also detect torn pages, checksum failures, and common hardware failures that can compromise a user's data."
It's quicker too.  But as I say, it missed four corrupt databases.  Gail's advice helped me recover all of them without any data loss.  So - thanks Gail.

Now most of the time you take backups to disk, but you can't keep everything on disk - in fact you probably don't keep more than three days' backup on disk.  More likely you use something like Symantec Backupexec to copy your backups from Disk to Tape overnight.  

You can keep every single backup ever made on tape if you want to, although it will need a lot of tapes and it's probably overkill.  A good compromise is to do something like this:
  • Keep every daily backup for a week
  • Keep your weekly backups for a month (a four week lunar month works for me).
  • Keep each monthly backup for a year.  
Ask yourself - will I ever need a backup that's older than a year?  Probably not, but it's up to you - you know your business, and maybe your business has a legal duty to keep stuff for seven years.  You can keep everything for ever if you want, but it will cost a lot of tapes and management.  You will almost certainly want to store them offsite - firms like Iron Mountain will provide this kind of service - they collect and drop off too.  Tell them Jack sent you (you never know).  

But here's the rub - every now and then you want to check that you can retrieve data from those tapes.  Pick a date - maybe the monthly backup for three months ago.  Pick a database.  And request that tape, get it copied to your test area, and restore the backup.  I tried that today.  Guess what?  there was some sort of problem with the tape.  I asked for Friday the 19th July, a couple of weeks ago now.  But what the tape actually contained was data from March.  For some reason (and we are frantically looking for that reason now) the backups were not overwriting the existing data.  

When we find the problem, it will probably be something so trivial it's not true - a missing comma in the command used, perhaps; maybe a worn patch on a $4.99 tape.  And now that we've found it, it won't go wrong again.  But it was found, and it will be fixed, only because I checked.  

So check.  

Wednesday 3 July 2013

Colour Coding Your Servers

Dozens of servers, all with similar names – and you’re new; it’s a recipe for confusion!  Suppose you accidentally update the Production server instead of the Development server when trying out some new code?  That would probably be a Bad Thing, and lead to your colleagues mocking you, not to mention the company maybe going bust.  So, being a DBA, you get a bit paranoid, and check again.  And again, a bit later. 

This is a handy hint for Mr/Ms Paranoia.  Take a look at the list of Registered Servers and pick one of the development boxes.  The Properties look something like this:



But there’s another tab - click on the Connection Properties.  



Tick the box to the left of Use Custom Color, and then click Select.  


This gives you 48 basic colours, plus the chance to define your own custom colours, if you are into interior design.  Don’t get carried away – you only need Red (for Production) and Green (for Development).  Okay, maybe Amber for Test/UAT.  Click on Green for Development, and then OK.  

This takes you back to the Server Registration dialog box.  Just click on SAVE




And now, when you open your development server – ta da!  It’s clearly marked with a nice safe-looking green stripe


Thursday 7 March 2013

Getting Data out of Active Directory and into SQL


Active Directory is a directory in which you can store information about people, things (computers, printers, etc), applications, domains, services, security access permissions, and more. Here's a good introduction.  But unfortunately it's in its own wacky format, and it isn't easy to get at it.  But if you do want to run SQL queries, here's how.  

First job is to set up a Linked Server to the AD server.  That's reasonably straightforward.  


execute  sp_addlinkedserver
       'ADSI',
       'Active Directory Service Interfaces',
       'ADSDSOObject',
       'adsdatasource'



After that you can wave goodbye to Mr and Mrs Straightforward.  Fortunately others have been here before me, and I'm indebted to this article by Brendan Tompkins who explains how to do it.  Here's Brendan's basic query:


SELECT *
FROM OpenQuery(ADSI, 'SELECT title, displayName, sAMAccountName, givenName, telephoneNumber, facsimileTelephoneNumber, sn
FROM
       ''LDAP://DC=DomCon1,
       DC=domain,
       DC=org''
       where objectClass =''User''')


So you select the fields you want from the Active Directory system, and in fact there are at least 40 fields you might want.  Some of them you won't know you want until you look at the data, so unless things are very organised in your company you may want to read in everything, and then decide which ones matter to you by setting up a view.  LDAP stands for Lightweight Directory Access Protocol – an application protocol for querying and modifying directory services developed at the University of Michigan in the early 1990s. An LDAP directory tree is a hierarchical structure of organizations, domains, trees, groups, and individual units.  I've never heard of it myself, but it is explained in the intro to AD that I linked earlier.

You will probably have at least two domain controllers, in case one fails, and maybe lots more than that.  When you log on, you log on to whichever domain controller happens to be available.  Each one contains a copy of the AD database, and if information on one DC changes (e.g. a user changes their password), it sends a signal to the other domain controllers to begin a pull replication of the data to ensure they are all up to date. This is true for most of the fields you want, but sadly not all.

One weird feature is that you can only read in a thousand records at a time, which might be a challenge if you have 100,000 records in the AD database.  But in that same article, in the comments, there's a chunk of code by Hermann Croucamp to tackle that limit. Herman loops through each record until they have all been read in.  Without that code I would still be reading in chunks of 1000 records and trying to stick them together with sellotape.

So, I created a temporary table and read all the stuff out of Active Directory into my temp table.  Then I merged it into a permanent table - updating if the record exists, inserting a new row if it doesn't.  Alternatively you could just truncate the permanent table every night and load it up fresh - up to you.  

Here's another weird feature.  The lastlogon field is useful because you might want to find the people who never log on to their computer.  Perhaps they have left, or changed their name and been given a new login instead of an update, or maybe they just don't need to use a computer in their work - whatever, if they don't need the login, it can be disabled.  But the lastlogon field, for some wacky reason, is NOT replicated from one domain controller to another.  So if you have two domain controllers - DomCon1 and DomCon2, it may be that someone has never logged on to DomCon1 but they log on to DomCon2 every day.  

Microsoft have actually deprecated Lastlogon, and they recommend you to use LastLogonTimestamp instead.  Now this one IS replicated to all the domain controllers.  That's the good news.  The bad news is that if you have the default settings for replication set, this only happens every two weeks.  And changing it to replicate every day gives you a nice performance hit, I'm told (although I haven't witnessed this myself, it seems plausible - after all, everything useful has a performance hit)..  Which you probably won't want.  

So you have a choice - read in LastLogonTimestamp and be aware that it could be up to two weeks out of date, or read in LastLogon from all the domain controllers and use the latest one.  Which is what I did, since I was merging data anyway; I added a clause to read in the Lastlogon date only if it was more recent than the existing one.

Er, sorry.  When I say LastLogon "date", that isn't entirely accurate.  It isn't a date, it's a timestamp.  It can be turned into a date though, and if you read this article it'll tell you how.  What they have done is start counting from the 1st of January, 1600 (why, for goodness sake?) in steps of 100 nanoseconds.  So all you have to do is calculate the number of seconds per day, the number of 100 nanoseconds per second, and the number of days since 1600.  Not forgetting leap years.

    864000000000 (86400 * 10000000) is the number of 100ns intervals per day,
    which is significant as AD timestamp is the number of 100ns since 1601-01-01 00:00:00.000.
   
   :86400       == number of seconds per day
   :10000000 == number of 100ns per second
   
    109207 (299 * 365 + 72) is:
     the offset number of days from 1601-01-01 (ANSI 0) to 1900-01-01 (SQL 0).

   :299      == number of years between 1601 and 1900
   :365      == standard days per year
   :72       == number of leap years between 1601 and 1900 

Simple, huh?  Fortunately computers are good at this kind of madness, and MWVisa1 (the author of the article) suggests setting up a user defined function to do it.  Follow his advice - do not attempt to do this in your head.  

OK, ready for some more wackiness?  Let's talk about the UAC - the User Account Control.  By the standards of Active Directory, this is simple a number that has been assigned a meaning - a UAC of 512 means Enabled, while 514 means Disabled.  But there are lots of them, so I set up a case statement to give them some meaning:

UserAccountControlDescription =
 CASE
   WHEN userAccountControl = '512' then 'Enabled'
   WHEN userAccountControl = '514' then 'ACCOUNTDISABLE'
   WHEN userAccountControl = '528' then 'Enabled - LOCKOUT
   WHEN userAccountControl = '530' then 'ACCOUNTDISABLE - LOCKOUT' 
   WHEN userAccountControl = '544' then 'Enabled - PASSWD_NOTREQD' 
   WHEN userAccountControl = '546' then 'ACCOUNTDISABLE - PASSWD_NOTREQD'

514 is actually the sum of 512 (normal), and 2 (disabled)
528 is 512 (normal) + 16 (lockout)
etc, etc.

Which leads us to this little chunk of Hermann's code:
(!(UserAccountControl:1.2.840.113556.1.4.803:=2)

The exclamation mark is another way of saying <> or not equal to. If all those bits of UAC code do not equal to 2, then the record is Enabled.  Take out the !, and you are asking for the Disabled records only.  Take out the whole thing - and it doesn't work at all.  So if you want to read in both Enabled and Disabled records, you need to run the code twice.



So - to put it all together I set up a SQL Agent Job, with one step each for my three domain controllers DOMCON1, 2 and 3.  Then three more steps, this time reading in the Disabled records for each DC.  Then another step to Update UAC Descriptions in my ADUserInfo table.  Set that to run overnight.  (Mine takes less than 10 minutes - your mileage may vary.)

Sounds complicated?  Well, it is - but now I have a SQL table updated every night, and which I can query to identify people who haven't logged in for ages, people who need a password change, whatever I want to know.


Thursday 3 January 2013

When Maintenance Plans Go Wrong

I wrote in an earlier blog about implementing a maintenance plan.  I promised then to look at how to set up warnings for when things go wrong - and they will!

You can look every day at the history of your SQL Server Agent jobs, and see if any have failed.  The problem, of course, is that there are probably lots of servers, and lots of jobs to look at on each one.  The best way is to get SQL Server Database Mail to send a message if something goes wrong.  Assuming that you have managed to get this up and running, here's how:

Set up an Operator

First of all, set up a DBAteam group in Active Directory, listing everyone on the team, if this hasn't been done already.  (Depending on how your firm works, it may be worth adding the Helpdesk team to that list, although this is probably only as a backup, unless of course you have trained them in how to fix DBA things...)

Here’s a stored procedure that will list all the operators who currently exist. This will probably be your own monitoring service, but you will occasionally find that a message will be sent to a vendor in the event of a problem. You may or may not be happy about this.

-- list existing operators
exec msdb.dbo.sp_help_operator



If your DBATeam is already in there, well and good.  But if not, set up a new Operator by expanding SQL Server Agent, selecting Operators, and then right clicking on New Operator.  If you still use pagers, you can get it to contact your pager.


Or you can do this with a script, which is handy if you have lots of servers.

USE [msdb]
GO
EXEC msdb.dbo.sp_add_operator @name=N'DBAteam',
              @enabled=1,
              @pager_days=0,
              @email_address=N'DBAteam@companyname.co.uk'
GO

Maintenance Plan Jobs

If you have followed my guidelines on implementing a maintenance plan, there are probably a number of maintenance plan jobs.  You can get a list with this handy stored procedure:

-- list jobs
select * from msdb..sysjobs

  - something like this:
  • DB Admin User DB Maintenance.Hourly Tasks
  • DB Admin User DB Maintenance.DailyTasks
  • DB Admin User DB Maintenance.WeeklyTasks
  • DB Admin System DB Maintenance.DailyTasks
  • DB Admin System DB Maintenance.WeeklyTasks

Right click on the first one, select Properties, then Notifications.  Tick the email box, select DBAteam from the first pulldown menu, and select "When the job fails" from the second.  (The other options are when it succeeds and when it completes - you probably don't want either of these options).

Tick the Write to the event log box, and select "When the job completes".  I don't mind the log containing lots of messages - that's what logs are for.

The dialog box should now look like this:



This can be done with a script, which is probably better if you have lots of jobs on lots of servers.  For further reading, check here.  

EXEC msdb.dbo.sp_update_job
      @job_name = 'DB Admin User DB Maintenance.Daily Tasks'
     , @owner_login_name = 'SA'
     , @notify_email_operator_name = 'DBAteam'
     , @notify_level_email = 2            -- on failure
     , @notify_level_eventlog = 3         -- on completion 

You want it to email the DBAteam when the job fails, and write to the event log when it completes (success or failure).  You could send an email when it completes too, but that will mean you get lots of emails every day, almost all of them saying "the job succeeded".  I think that it's better to report the exceptions.  If a job succeeds, you don't need to do anything, except bask in the warm glow of a job well done.  If a job fails, that's when you need to spring into action - drop the newspaper, drain the coffee mug, have a gentle scratch and fix the problem.

There's a problem with this approach - if SQL Server Agent fails, or Database Mail fails, you can get a backup failure without a warning email.  If you set it up to send an email on Completion, you might possibly notice the lack of emails.  But chances are that you would not notice one missing email out of fifty, so my advice is to be aware of this possibility and check the job history every now and then.

Is there a better way?  Please let me know.
EDIT: This may help - Heartbeat

Enable Alerting

You've set up Database Mail; you've set up an operator; you've set up a notification in the event of failure.  All you have to do is sit back and wait for the error messages to arrive.  When none come in, you can sit back and bask.

Sadly no.  I once found a job which had been failing for a month without anyone being at all aware of it.   And that's because by default SQL Server doesn't tell you about these problems - you have to enable alerts of this type.  You may think that it would be better if it told you by default - I couldn't possibly comment.

Anyway, right click on SQL Server Agent and select Properties.  Pick Alert System.  Notice that the mail profile is NOT enabled?  <sigh>


OK, tick the box to enable the mail profile; make sure the mail system is set to Database Mail; select a suitable mail profile.

Surely that must be it?  Er, no.

The change you just made won't take effect until the SQL Server Agent service is restarted.  Take a deep breath, and try and unclench your fists!  Put the hammer down.  Put it down now!

Right then.

Right click on SQL Server Agent, and select Stop.  It'll ask you to confirm.  Before you do, you might just make sure that an important job isn't about to kick off...

Right click on SQL Server Agent, and select Start.  And confirm.


Set up a test job

You could now wait for a job to fail.  Assuming that you have done everything right, you should then receive an email to say that it failed, and then you can fix it.  Problem is that you might have missed something.  So rather than screwing up a working job, we'll create a job which will fall over deliberately.

Go to SQL Server Agent, right click on Jobs, and select New Job.

The Name is compulsory - I like to call all my DBA jobs "DB Admin something" but it's up to you.  The Category and the Description are optional, but it's worth taking a few minutes to say what it's about, and initial and date it.  Trust me, in a couple of years time your future self, or your successor, will be glad you did this.


Now Steps - there's only one, and I call it "Deliberate Error" because that's what comes out on the email message.  And a little bit of code to cause a deliberate error - division by zero will do nicely:


--declare variables
declare @i int
declare @j int
declare @k int

--set values
Set @i = 1
set @j = 0

-- force division by zero
set @k = @i / @j







You don't need to schedule this - you'll probably only need to run it once (let's hope so anyway!)
But you do need to set up the notification - back to the top of the page, or just run this script:


EXEC msdb.dbo.sp_update_job
      @job_name = 'DB Admin Test Job Fail'
     , @owner_login_name = 'SA'
     , @notify_email_operator_name = 'DBAteam'
     , @notify_level_email = 2            -- on failure
     , @notify_level_eventlog = 3         -- on completion 


Run the job - it should fail more or less immediately.



Go and check your inbox for a failure message.  Assuming this arrives safely - well done!  Next time a job fails, it won't be deliberate - it'll be a real issue.  And you can be reasonably confident that you will know about it straight away.