How to Recover a Journal Wrap Error (JRNL_WRAP_ERROR) and a Corrupted FRS SYSVOL from a Good DC – What option do I use, D4 or D2? What’s the Difference between D4 and D2?

Original: 11/21/2013
Updated 8/30/2014

Errata

Ace here again. I’ working on updating all of my blogs. If you see any inconsistencies, please let email me and let me know.

Prologue

Are you seeing Event ID 13508, 13568, and anything else related to SYSVOL, JRNL_WRAPS, or NTFRS?

Note – I will not address Event ID 2042 or 1864. That’s an issue with replication not working beyond the AD tombstone. If you are seeing them, you’re best bet is to forcedemote the machine, run a metadata cleanup, and re-promote it, and make sure you configure your firewall and/or AV to allow replication traffic or stop using the ISP’s or router as a DNS address, or disable IP routing and WINS Proxy, to prevent this in the future. And while you’re at it bump up your AD tombstone to 180 days,

As for the NTFRS, after talking to numerous folks whether directly assisting a customer, or through the TechNet forums, there seems to be some confusion associated with how to handle Journal Wrap errors, what caused them, and what are the differences between the D2 and D4 options. I’ll try to quell this confusion in this blog, as well as provide an easy step-step and providing an explanation for the steps, to get out of this error. Note: The steps are from Microsoft KB290762. I just thought to further break it down so a layman will understand them.

Reference KB: Using the BurFlags registry key to reinitialize File Replication Service Replica Sets
http://support.microsoft.com/kb/290762

For Windows 2008/2008 R2/2012/2012 R2 with DFSR

Follow this KB to fix it:

How to force an authoritative and non-authoritative synchronization for DFSR-replicated SYSVOL (like “D4/D2” for FRS)
http://support.microsoft.com/kb/2218556

Backing Up and Restoring an FRS-Replicated SYSVOL Folder
http://msdn.microsoft.com/en-us/library/windows/desktop/cc507518(v=vs.85).aspx 

What Caused the Journal Wrap?

First you have to ask yourself, what caused this error on my DC? What did I do to get here? In a nutshell, JRNL_WRAPS are caused by SYSVOL corruption.

The usual culprit can be a number of things:

  • Abrupt shutdown/restart. I don’t usually see this unless there are power issues in the building with not power protection or UPS battery system.
  • Disk errors – corrupted sectors. This is a common issue with a DC on older hardware.
  • AV not configured to exclude SYSVOL, NTDS and the AD processes. This is the typical culprit I’ve seen in many cases.

Ok, So what do I have to do to fix this?

To get yourself out of this quandary, it’s rather simple. Yea, you might say yea, right, this is not so simple, but it really isn’t that hard. It just requires a little understanding of what you have to do, which is all it’s doing is simply copying a good SYSVOL folder and subfolders from a good DC to the bad DC (the one with the errors.

Basically, you first choose which DC is the good DC to be your “source” DC for the SYSVOL folder. Then you you stop the NTFRS service on all DCs. Yes, NTFRS must to be stopped on all DCs to perform this. Then set the registry key on the good DC and the bad DC. That’s it. The process will take care of itself and reset the keys back to default after it’s done.

  • If you only have one DC, such as an SBS server, and SYSVOL  appears ok, or restore just the SYSVOL from a backup. Then just follow the “Specific” steps I’ve outlined below.
  • If more than one DC, but not that many where you can’t shutdown the NTFRS on all of them, such as if you have 40 DCs, pick and choose the best one and set Burflags to D2 on the bad and D4 on the good.
  • If there are numerous DCs, such as a large infrastructure, simply run dcpromo /forcedemote the DC with the error, run a metadata cleanup, then re-promote to a DC back into the domain. If you unplug the DC and run a metadata cleanup, then you will have to rebuild the DC from scratch. The forcedemote switch removes the AD binaries off the machine allowing you to re-promote it.

 

To summarize:

You have two choices as to a restore from a good DC using FRS:

  1. D2 is set on the bad DC: Non-Authoritative restore: Use the D2 option on the DC with the empty SYSVOL folder, or the SYSVOL folder with the incorrect data. This way it will get a copy of the current SYSVOL and other folders from the good DC that you set the BurFlags D4 option on.
  2. D4 is set on the good DC: Authoritative restore: Use the BurFlags D4 option on the DC that has a copy of the current policies and scripts folder (a good, not corrupted folder).

 

The BurFlags option – D4 or D2? What do I use?

The steps refer to changing a registry setting called the BurFlags value. If the BurFlags key does not exist, simply create it. It’s a DWORD key.

More importantly, it references change the BurFlags to one of two options: D4 or D2. Therefore, before going further, I would like to squelch the confusion on what the D2 and D4 settings mean:

D2/D4 – Which is which?

  • D2, also known as a Nonauthoritative mode restore – this gets set on the DC with the bad or corrupted SYSVOL
  • D4, also known as an Authoritative mode restore – use this on the DC with the good copy of SYSVOL.
  • You must shut the NTFRS service down on ALL DCs while you’re doing this until instructed to start it.
  • You’ll probably want to copy the current SYSVOL structure on the good DC to another folder as a backup prior to doing this.

The D2 option on the bad DC will do two things:

  1. Copies the current stuff in the SYSVOL folder and puts it in a folder called “Pre-existing.” That folder is exactly what it says it is, it is your current data. This way if you have to revert back to it, you can use the data in this folder.
  2. Then it replicates (copies) good data from the GOOD DC (D4) to the bad guy (D2).

Once again, simply put:

  • The BurFlags D4 setting is “the Source DC” that you want to copy its good SYSVOL folder from, to the bad DC.
  • The bad DC BurFlags is set to D2, which tells it to pull from the source DC, the one you set D4 on.

 

Here are the steps summarized:

  1. For an Authoritative Restore you must stop the NTFRS services on all of your DCs
  2. In the registry location: HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\NtFrs\Parameters\Backup/Restore\Process
    1. Set the BurFlags setting to HEX “D4” on a known DC that has a good SYSVOL (or at this time restore SYSVOL data from backup then set the Burflag to D4)
    2. Then start NTFRS on this  server.
    3. You may want to rename the old folders with .old extensions prior to restoring good data.
  3. Clean up the folders on all the remaining servers (Policies, Scripts, etc) – renamed them with .old extensions.
  4. Set the BurFlags to D2 on all remaining servers and then start NTFRS.
  5. Wait for FRS to replicate.
  6. Clean up the .old stuff if things look good.
  7. If the “D4” won’t solve the problem try the “D2” value.

 

So circling back, to fix this and make it work, just copy the contents of SYSVOL to another location, then follow the KB, which simply states you must stop the NTFR service on ALL DCs. Then pick a good one to be the “Source DC.”

Of course, as I’ve stated above, if you have a large number of DCs, the best bet is to forcedemote the bad DC, run a metadata cleanup to remove its reference from AD, then re-promote it.

If you have a small number of DCs, and if you have a good DC and a bad DC, on the good DC, you would set the BurFlags to D4, and on the BAD DC you would set the Burflags to D2.

Example run:

In the example below, if you set BurFlags to D4 on a single domain controller and set BurFlags to D2 on all other domain controllers in that domain, you can rebuild the SYSVOL from the D4 DC (the source DC).

I’ve also heard of admins manually copying the SYSVOL folder, then set the BurFlags options as mentioned, which works too. But no, I haven’t tested it. That would be for a lab on another day. 🙂

Authoritative Restore Example

Use the BurFlags D4 option on the DC that has a copy of the current policies and scripts folder (a good, not corrupted folder).

  1. Stop the FRS service on all DCs. To do this to all DCs from one DC, you can download PSEXEC and run “psexec \\otherDC net stop ntfrs” one at a time for each DC.
  2. On a good DC that you want to be the source, run regedit and go to the following key:
    HKEY_LOCAL_MACHINE\System\CurrentControlSet\Services\NtFrs\Parameters\Backup/Restore\Process at Startup
    In the right pane, double-click “BurFlags.” (or Rt-click, Edit DWORD)
       Type D4 and then click OK.
  3. On the bad DC, run regedit and go to the following key:   HKEY_LOCAL_MACHINE\System\CurrentControlSet\Services\NtFrs\Parameters\Backup/Restore\Process at Startup
       In the right pane, double-click “BurFlags.” (or Rt-click, Edit DWORD)
       Type D2 and then click OK.
  4. Quit Registry Editor, and then switch to the Command Prompt (which you still have opened).
  5. On the good DC, start the FRS service, or in a command prompt, type in “net start ntfrs” and hit <enter>
  6. On the bad DC, start the FRS service, or in a command prompt, type in “net start ntfrs” and hit <enter>
  7. On the bad DC, check the Sysvol folder to see if it started populating.
  8. Check for EventID 13565 which shows the process started
  9. Check for EventID 13516, which shows it’s complete
  10. Start FRS on the other DCs.

The following occurs after running the steps above after you start the FRS service (NTFRS):

  • The value for BurFlags registry key returns to 0.
  • Files in the reinitialized FRS folders are moved to a <var>Pre-existing</var> folder.
  • An event 13565 is logged to signal that a nonauthoritative restore is started.
  • The FRS database is rebuilt.
  • The member replicates (copies) the SYSVOL folder from the GOOD DC.
  • The reinitialized computer runs a full replication of the affected replica sets when the relevant replication schedule begins.
  • When the process is complete, an event 13516 is logged to signal that FRS is operational. If the event is not logged, there is a problem with the FRS configuration.
     
    Note: The placement of files in the <var>Pre-existing</var> folder on reinitialized members is a safeguard in FRS designed to prevent accidental data loss. You can copy this stuff back if it didn’t work, but I have not yet seen when this has not worked!

Summary

I hope this helps cleaning up your FRS and SYSVOL replication issues.

Ace Fekay
MVP, MCT, MCSE 2012, MCITP EA & MCTS Windows 2008/R2, Exchange 2013, 2010 EA & 2007, MCSE & MCSA 2003/2000, MCSA Messaging 2003
Microsoft Certified Trainer
Microsoft MVP – Directory Services
Complete List of Technical Blogs: http://www.delawarecountycomputerconsulting.com/technicalblogs.php

This blog is provided AS-IS with no warranties or guarantees and confers no rights.

AD Site Design and Auto Site Link Bridging, or Bridge All Site Links (BASL)

By Ace Fekay, MCT, MVP, MCSE 2012/Cloud, MCITP EA, MCTS Windows 2008/R2, Exchange 2007, 2010 & 2013, Exchange 2013, Exchange 2010 Enterprise Administrator, MCSE 2003/2000, MCSA Messaging 2003
  Microsoft Certified Trainer
  Microsoft MVP: Directory Services
  Active Directory, Exchange and Windows Infrastructure Engineer

Updated 12/12/2013

Preface

Ace here again with something I really would like to discuss, since this topic comes up from time to time.

To properly designed an AD multi-site infrastructure, there are a few things that need to be taken into account. I won’t bore you with all the background techno babble, rather I’m going to discuss a no-nonsense, get down to business on why you need to either keep Auto Site Link Bridging enabled, or why you need to disable it, both of which depends on your physical routed topology design.

AD Sites

First, a basic understanding of Active Directory Sites is important to understand before I go further.

Some of the biggest questions I hear about AD Sites are:

  • What are AD Sites?
  • What are AD Sites for?
  • Why can’t I create an AD Site without a domain controller in the Site?

These are all valid questions. A little research will usually result in an answer, but you may have to dig through piles of technical details to get to it. Let’s address each one:

What are AD Sites?

An AD Site defines a highly-connected, physical network locations in Active Directory. We define them by IP subnet or subnets. And yes, you can have multiple subnets that are highly-connected by routers within a location. In some cases, for example, if you have a very high-speed backbone, such as an OC-1 (51.84Mbps or higher), between locations, you can put all those subnets in one AD Site. However, in many cases, we probably don’t want to do that. Hang in there, I’ll be getting to that in a few minutes.

What are AD Sites for?

AD sites are basically used for two things:

  1. To facilitate service localization. In simple English, this means to control logon and authentication traffic to DCs in a specific location, or Site. After all, we don’t want a client in NYC to pick a DC in Seattle, Japan, or somewhere else, to send its logon or authentication request (such as when accessing a folder), do we? Nope. The client side DC Locator process will find a DC in its own Site by using the client’s IP address.
  2. To manage DC replication traffic. In simple English, this means to control DC replication traffic across WAN links between the bridgeheads (the DCs in a Site that communicate with DCs in other Sites). By default, replication between Sites are compressed down to 15% of total traffic. And with Sites, we can control frequency of replication, and when we’re allowing it to happen.’

Why can’t I create an AD Site without a DC in the Site?

Good question. If you look at what AD Sites are for, then it should be pretty obvious that you need a DC in it. After all, if there is no DC in it, and a client picks a Site based on its IP subnet, then looks for a DC, it won’t find any, will it? Nope, so it may wind up randomly picking a DC in another location, such as that DC in Seattle or Japan.

DC Locator Process

I don’t want to dwell on this, but I will briefly mention it because this is part of the reason why we want to create AD Sites anyway.

There is a process that a client uses to pick a DC. Here’s a quick view of how a client picks a DC. I should add a #9 to the list in a scenario when no DC exists in the Site, then it uses Automatic Site Coverage, and this ONLY if you created an IP Site link to another Site that you want the DCs to cover that site.

If you didn’t create an IP Site link for a Site that has no DCs, then it will pretty much become a random process, sort of, by using other factors, such as subnet netmask ordering and Round Robin. If you want to read up on this subject, here are two good TechNet Forum discussions on it:

Briefly, here are the DC Locator process steps, and these steps were directly quoted from How Domain Controllers Are Located in Windows XP

  1. Client does a DNS search for DC’s in _LDAP._TCP.dc._msdcs.domainname
  2. DNS server returns list of DC’s.
  3. Client sends an LDAP ping to a DC asking for the site it is in based on the clients IP address (IP address ONLY! The client’s subnet is NOT known to the DC).
  4. DC returns…
    1. The client’s site or the site that’s associated with the subnet that most matches the client’s IP (determined by comparing just the client’s IP to the subnet-to-site table Netlogon builds at startup).
    2. The site that the current domain controller is in.
    3. A flag (DSClosestFlag=0 or 1) that indicates if the current DC is in the site closest to the client.
  5. The client decides whether to use the current DC or to look for a closer option.
    1. Client uses the current DC if it’s in the client’s site or in the site closest to the client as indicated by DSClosestFlag reported by the DC.
    2. If DSClosestFlag indicates the current DC is not the closest, the client does a site specific DNS query to: _LDAP._TCP.sitename._sites.domainname (_LDAP or whatever service you happen to be looking for) and uses a returned domain controller.

Brief overview:

For a full-sized image, click on the images.

Let me point out again, that if there are no DCs in a Site, then Automatic Site Coverage will take over.

To me, it’s a process to “find” a DC that will authenticate a user in a Site without a DC. However, my take on it is I would rather associate the location’s subnet to a current Site so as to not make the client go through this process. Besides, there may be scenarios that not having a DC in a Site can directly affect directory enabled applications and services such as DFS site referrals, SCCM or Exchange with it’s high dependency on GCs and DSAccess.

Here’s the DC Locator process, directly quoted from the Technet article, “How DNS Support for Active Directory Works:”

  1. Build a list of target sites — sites that have no domain controllers for this domain (the domain of the current domain controller).
  2. Build a list of candidate sites — sites that have domain controllers for this domain.
  3. For every target site, follow these steps:
    1. Build a list of candidate sites of which this domain is a member. (If none, do nothing.)
    2. Of these, build a list of sites that have the lowest site link cost to the target site. (If none, do nothing.)
    3. If more than one, break ties (reduce this list to one candidate site) by choosing the site with the largest number of domain controllers.
    4. If more than one, break ties by choosing the site that is first alphabetically.
    5. Register target-site-specific SRV records for the domain controllers for this domain in the selected site.

If there are no DCs in a Site, you can use PowerShell to figure out which DC in which Site will be picked. If you like, you can further read up on the commands used to figure this out in Sean Ivey’s blog:

Sites Sites Everywhere…, By Sean Ivey, Microsoft DS PFE
http://blogs.technet.com/b/askds/archive/2011/04/29/sites-sites-everywhere.aspx

So wouldn’t you want your clients to pick a DC in its own Site?

Moving forward, do we really want a client to pick a DC in some other site or go through the Automatic Site Coverage process? Would you want that? I ‘m sure you already know the answer to that.

Therefore, if you have a location that have no DCs, then simply create an IP subnet object, and associate the subnet object to an existing AD Site that you want those users to use. In this case, you may base your own pick on a site linked by the fastest WAN link, or the only WAN link.

And if there are any subnets that are not associated with an AD site, then any DC is game to authenticate a client, as seen in the process above. To check for clients which subnets are not configured to AD Sites & Services, among other things, enable Netlogon logging, and check the system32\config\netlogon.log file. Here’s more info:

Enabling debug logging for the Net Logon service, Last Review: May 3, 2011 – Revision: 11.0, Applies to: all operating systems.
 http://support.microsoft.com/kb/109626

Auto Site Link Bridging

This now brings us to bridging, what it is, etc.

Within an AD Site, the KCC (Knowledge Consistency Checker) will automatically assume that all DCs can directly reach each other, and create Intrasite replication partnerships between the DCs in the Site. The one point that I want to be clear about that no matter how many DCs are in a Site, and there can be hundreds of DCs in a Site, the KCC will make sure that the  partnerships created are done so that all DCs in a Site will have an updated replication set for any changes by any of the DCs in the site, within 15 minutes. If you add a new DC to the Site, the KCC jumps in and evaluates the new guy and adds it so it gets updated data from other DCs under 15 minutes. How does it do that? It follows a set algorithm, but that is beyond this discussion.

When there are multiple Sites, and more specifically three or more Sites, and keeping in mind that by default AD automatically assumes that all the Sites have direct physically connectivity and communications between each other. This means you can literally ping a DC from in any Site to any other Site.

Here’s where the ISTG (Intersite Topology Generator) kicks in. The ISTG is a component of the KCC. It evaluates the overall topology, and builds connection objects between servers in each of the sites to enable Intersite replication— DC replication between sites.

Here’s a fully routed infrastructure, For the full-sized image, click here.

 

If remote sites cannot directly communicate with each other and only to the hub site

However, if your physical network topology is designed where each site does not have direct communications with each other, and you leave all the default “Auto Site Link Bridge” setting enabled as is, then lots of things will go wrong, such as replication problems, duplicate AD integrated zones, and more … keep reading. But I won’t address duplicate zones. You can click the link in the previous sentence for more on that.

If the network topology was a hub and spoke and BASL wasn’t disabled and individual sites links between the hub and each site weren’t created until recently, then there may be replication problems. This is a whole different subject. What I can say, besides checking to see if there are duplicate zones, as I mentioned in the previous paragraph, I would also run the Active Directory Replication Status Tool to check replication status. It will provide a report, and anything amiss will show up in Red. Pretty cool tool. Download it here:

Download The Active Directory Replication Status Tool (ADREPLSTATUS):
   http://www.microsoft.com/en-us/download/details.aspx?id=30005
     Note: This tool requires .Net Framework 4. If it’s not installed, download and install it:
       Microsoft .NET Framework 4 (Web Installer)
       http://www.microsoft.com/en-us/download/details.aspx?id=17851

Remember, by default, the KCC assumes all sites can directly communicate, therefore it will create partnerships between Bridgeheads in all sites. And any DC in a site can automatically become a bridgehead.

So if corporate headquarters is in NYC, and you have three remote locations, Miami, Chicago and Seattle, and direct communications does not exist, meaning that each remote location can only communicate with headquarters, and IP routing has not been configured between the remote locations, and the KCC creates a connection object (partnership) between a DC in Miami and a DC in Seattle, what will happen?

Since they can’t directly communicate, then replication fails. And if the Seattle partnership is the only connection object Miami may have, but Seattle happens to have one to NYC, and keeping in mind, replication is a PULL request, then Seattle will receive replication from NYC, but Miami can’t pull anything from Seattle, because there is no direct or indirect communications. So Miami winds up being in a secluded island.

In Miami’s DC’s view, it thinks no one wants to talk to it, so it will complain (you will see multiple event log errors) that others having replicated with it. And according to the DCs in the other sites, they will all think the same thing about Miami.

So who’s right? Of course, they all are. If the lack of replication goes beyond the AD Tombstone, then Miami would need to be demoted. Then again, you can’t even do that because it doesn’t have direct communications with its partner. Then if it does pick a DC in headquarters to demote, you will see an error stating that the headquarters DC already thinks the DC no longer exists. In the case of trying to demote it, or even forcedemoting it beyond the Tombstone, then your only option is to unplug it, run a metadata cleanup and re-promote it. But wait, then the same thing will occur if you don’t disable BASL.

So is it right that we do the same thing over and over and expect different results? Nope. Let’s configure AD to make sure it will not happen again, by disabling BASL.

Here’s a non-fully routed infrastructure. For the full-sized image, click here.

Disable BASL

Simply put, what we need to do is disable BASL (Bridge All Site Links) in a non-fully routed infrastructure to tell the KCC to only partner DCs across a specific site link.

Yes, that means you also have to create specific IP site links between headquarters in NYC to each site, as the image above shows.

And even if you have 20 sites all fully routed EXCEPT for one of them, then the same thing goes. You must disable it all because of that one site, otherwise the KCC will partner with a DC that it may not have direct communications with.

How to disable BASL. For the full-sized image, click here.

Summary

If you want to make sure your AD infrastructure is properly purring along and doing its job, then by all means let’s design it properly, make the necessary modifications, and other changes, to get it going in the right direction.

Oh, and you can’t forget to bone up on your DNS knowledge and how it supports AD. All Sites get registered in DNS by the netlogon service. Read more:

How DNS Support for Active Directory Works
http://technet.microsoft.com/en-us/library/cc759550(WS.10).aspx

And to understand the DNS SRV records registered by a DC’s Netlogon service, read Sean Dubey’s blog, with DNS SRV records examples, once again, I refer you to Sean Ivey’s blog:

Sites Sites Everywhere…, By Sean Ivey, Microsoft DS PFE
http://blogs.technet.com/b/askds/archive/2011/04/29/sites-sites-everywhere.aspx

References

Designing the Site Topology
http://technet.microsoft.com/en-us/library/cc787284(WS.10).aspx

Detailed branch office deployment guide (downloadable doc)
http://www.microsoft.com/downloads/details.aspx?FamilyId=9353A4F6-A8A8-40BB-9FA7-3A95C9540112&displaylang=en

Best Practice Active Directory Design for Managing Windows Networks
http://technet.microsoft.com/en-us/library/bb727085.aspx

You may want to take a look at the design IPD guide (Infrastructure Planning and Design) for AD – Download Details: IPD guide for Active Directory Domain Services – version 1.0
http://www.microsoft.com/en-us/download/details.aspx?id=732

Download the complete Infrastructure Planning and Design (IPD) Guide Series v2.0 including links for AD IPD, SCCM IPD, and more.
http://technet.microsoft.com/en-us/library/cc196387.aspx

Comments & Corrections are welcomed.

Ace Fekay