Thursday 7 March 2013

Getting Data out of Active Directory and into SQL


Active Directory is a directory in which you can store information about people, things (computers, printers, etc), applications, domains, services, security access permissions, and more. Here's a good introduction.  But unfortunately it's in its own wacky format, and it isn't easy to get at it.  But if you do want to run SQL queries, here's how.  

First job is to set up a Linked Server to the AD server.  That's reasonably straightforward.  


execute  sp_addlinkedserver
       'ADSI',
       'Active Directory Service Interfaces',
       'ADSDSOObject',
       'adsdatasource'



After that you can wave goodbye to Mr and Mrs Straightforward.  Fortunately others have been here before me, and I'm indebted to this article by Brendan Tompkins who explains how to do it.  Here's Brendan's basic query:


SELECT *
FROM OpenQuery(ADSI, 'SELECT title, displayName, sAMAccountName, givenName, telephoneNumber, facsimileTelephoneNumber, sn
FROM
       ''LDAP://DC=DomCon1,
       DC=domain,
       DC=org''
       where objectClass =''User''')


So you select the fields you want from the Active Directory system, and in fact there are at least 40 fields you might want.  Some of them you won't know you want until you look at the data, so unless things are very organised in your company you may want to read in everything, and then decide which ones matter to you by setting up a view.  LDAP stands for Lightweight Directory Access Protocol – an application protocol for querying and modifying directory services developed at the University of Michigan in the early 1990s. An LDAP directory tree is a hierarchical structure of organizations, domains, trees, groups, and individual units.  I've never heard of it myself, but it is explained in the intro to AD that I linked earlier.

You will probably have at least two domain controllers, in case one fails, and maybe lots more than that.  When you log on, you log on to whichever domain controller happens to be available.  Each one contains a copy of the AD database, and if information on one DC changes (e.g. a user changes their password), it sends a signal to the other domain controllers to begin a pull replication of the data to ensure they are all up to date. This is true for most of the fields you want, but sadly not all.

One weird feature is that you can only read in a thousand records at a time, which might be a challenge if you have 100,000 records in the AD database.  But in that same article, in the comments, there's a chunk of code by Hermann Croucamp to tackle that limit. Herman loops through each record until they have all been read in.  Without that code I would still be reading in chunks of 1000 records and trying to stick them together with sellotape.

So, I created a temporary table and read all the stuff out of Active Directory into my temp table.  Then I merged it into a permanent table - updating if the record exists, inserting a new row if it doesn't.  Alternatively you could just truncate the permanent table every night and load it up fresh - up to you.  

Here's another weird feature.  The lastlogon field is useful because you might want to find the people who never log on to their computer.  Perhaps they have left, or changed their name and been given a new login instead of an update, or maybe they just don't need to use a computer in their work - whatever, if they don't need the login, it can be disabled.  But the lastlogon field, for some wacky reason, is NOT replicated from one domain controller to another.  So if you have two domain controllers - DomCon1 and DomCon2, it may be that someone has never logged on to DomCon1 but they log on to DomCon2 every day.  

Microsoft have actually deprecated Lastlogon, and they recommend you to use LastLogonTimestamp instead.  Now this one IS replicated to all the domain controllers.  That's the good news.  The bad news is that if you have the default settings for replication set, this only happens every two weeks.  And changing it to replicate every day gives you a nice performance hit, I'm told (although I haven't witnessed this myself, it seems plausible - after all, everything useful has a performance hit)..  Which you probably won't want.  

So you have a choice - read in LastLogonTimestamp and be aware that it could be up to two weeks out of date, or read in LastLogon from all the domain controllers and use the latest one.  Which is what I did, since I was merging data anyway; I added a clause to read in the Lastlogon date only if it was more recent than the existing one.

Er, sorry.  When I say LastLogon "date", that isn't entirely accurate.  It isn't a date, it's a timestamp.  It can be turned into a date though, and if you read this article it'll tell you how.  What they have done is start counting from the 1st of January, 1600 (why, for goodness sake?) in steps of 100 nanoseconds.  So all you have to do is calculate the number of seconds per day, the number of 100 nanoseconds per second, and the number of days since 1600.  Not forgetting leap years.

    864000000000 (86400 * 10000000) is the number of 100ns intervals per day,
    which is significant as AD timestamp is the number of 100ns since 1601-01-01 00:00:00.000.
   
   :86400       == number of seconds per day
   :10000000 == number of 100ns per second
   
    109207 (299 * 365 + 72) is:
     the offset number of days from 1601-01-01 (ANSI 0) to 1900-01-01 (SQL 0).

   :299      == number of years between 1601 and 1900
   :365      == standard days per year
   :72       == number of leap years between 1601 and 1900 

Simple, huh?  Fortunately computers are good at this kind of madness, and MWVisa1 (the author of the article) suggests setting up a user defined function to do it.  Follow his advice - do not attempt to do this in your head.  

OK, ready for some more wackiness?  Let's talk about the UAC - the User Account Control.  By the standards of Active Directory, this is simple a number that has been assigned a meaning - a UAC of 512 means Enabled, while 514 means Disabled.  But there are lots of them, so I set up a case statement to give them some meaning:

UserAccountControlDescription =
 CASE
   WHEN userAccountControl = '512' then 'Enabled'
   WHEN userAccountControl = '514' then 'ACCOUNTDISABLE'
   WHEN userAccountControl = '528' then 'Enabled - LOCKOUT
   WHEN userAccountControl = '530' then 'ACCOUNTDISABLE - LOCKOUT' 
   WHEN userAccountControl = '544' then 'Enabled - PASSWD_NOTREQD' 
   WHEN userAccountControl = '546' then 'ACCOUNTDISABLE - PASSWD_NOTREQD'

514 is actually the sum of 512 (normal), and 2 (disabled)
528 is 512 (normal) + 16 (lockout)
etc, etc.

Which leads us to this little chunk of Hermann's code:
(!(UserAccountControl:1.2.840.113556.1.4.803:=2)

The exclamation mark is another way of saying <> or not equal to. If all those bits of UAC code do not equal to 2, then the record is Enabled.  Take out the !, and you are asking for the Disabled records only.  Take out the whole thing - and it doesn't work at all.  So if you want to read in both Enabled and Disabled records, you need to run the code twice.



So - to put it all together I set up a SQL Agent Job, with one step each for my three domain controllers DOMCON1, 2 and 3.  Then three more steps, this time reading in the Disabled records for each DC.  Then another step to Update UAC Descriptions in my ADUserInfo table.  Set that to run overnight.  (Mine takes less than 10 minutes - your mileage may vary.)

Sounds complicated?  Well, it is - but now I have a SQL table updated every night, and which I can query to identify people who haven't logged in for ages, people who need a password change, whatever I want to know.