The Compound Case of the Outlook Hangs

This case was shared with me by a friend of mine, Andrew Richards, a Microsoft Exchange Server Escalation Engineer. It’s a really interesting case because it highlights the use of a Sysinternals tool I specifically wrote for use by Microsoft support services and it’s actually two cases in one.

The case unfolds with a systems administrator at a corporation contacting Microsoft support to report that users across their network were complaining of Outlook hangs lasting up to 15-minutes. The fact that multiple users were experiencing the problem pointed at an Exchange issue, so the call was routed to Exchange Server support services.

The Exchange team has developed a Performance Monitor data collector set that includes several hundred counters that have proven useful for troubleshooting Exchange issues, including LDAP, RPC and SMTP message activity, Exchange connection counts, memory usage and processor usage. Exchange support had the administrator collect a log of the server’s activity with 12 hour log cycles, the first from 9pm until 9am the next morning. When Exchange support engineers viewed the log, two patterns were clear despite the heavy density of the plots: first and as expected, the Exchange server’s load increased during the morning when users came into work and started using Outlook; and second, the counter graphs showed a difference in behavior between about 8:05 and 8:20am, a duration that corresponded exactly to the long delays users were reporting:





The support engineers zoomed in and puzzled over the counters in the timeframe and could see Exchange’s CPU usage drop, the active connection count go down, and outbound response latency drastically increase, but they were unable to identify a cause:



They escalated the case to the next level of support and it was assigned to Andrew. Andrew studied the logs and concluded that he needed additional information about what Exchange was doing during an outage. Specifically, he wanted a process memory dump of Exchange when it was in the unresponsive state. This would contain the contents of the process address space, including its data and code, as well as the register state of the process’s threads. Dump files of the Exchange process would allow Andrew to look at Exchange’s threads to see what was causing them to stall.

One way to obtain a dump is to “attach” to the process with a debugger like Windbg from the Debugging Tools for Windows package (included with the Windows Software Development Kit) and execute the .dump command, but downloading and installing the tools, launching the debugger, attaching to the right process, and saving dumps is an involved procedure. Instead, Andrew directed the administrator to download the Sysinternals Procdump utility (a single utility that you can run without installing any software on the server). Procdump makes it easy to obtain dumps of a process and includes options that create multiple dumps at a specified interval. Andrew asked the administrator to run Procdump the next time the server’s CPU usage dropped so that it would generate five dumps of the Exchange Server engine process, Store.exe, spaced three seconds apart:

procdump –n 5 –s 3 store.exe c:\dumps\store_mini.dmp

The next day the problem reproduced and the administrator sent Andrew the dump files Procdump had generated. When a process temporarily hangs it’s often because one thread in the process acquires a lock protecting data that other threads need to access, and holds the lock while performing some long-running operation. Andrew’s first step was therefore to check for held locks. The most commonly used intra-process synchronization lock is a critical section and the !locks debugger command lists the critical sections in a dump that are locked, the thread ID of the thread owning the lock, and the number of threads waiting to acquire it. Andrew used a similar command, !critlist from the Sieext.dll debugger extension (the public version of which, Sieextpub.dll, is downloadable from here). The output showed that multiple threads were piled up waiting for thread 223 to release a critical section:



His next step was to see what the owning thread was doing, which might point at the code responsible for the long delays. He switched to the owning thread’s register context using the ~ command and then dumped the thread’s stack with the k command:



As we’ve seen in previous Case of the Unexplained cases, the debugger was unsure how to interpret the stack when it came across a stack frame pointing into Savfmsevsapi, an image for which it couldn’t obtain symbols. Most Windows images have their symbols posted on the Microsoft symbol server so this was likely a third-party DLL loaded into the Store.exe process and was therefore a suspect in the hangs. The list modules (“lm”) command dumps version information for loaded images and the path of the image made it obvious that Savfmsevsapi was part of Symantec’s mail security product:



Andrew checked the other dumps and they all had similar stack traces. With the anecdotal evidence seeming to point at a Symantec issue, Andrew forwarded the dumps and his analysis, with the administrator’s permission, to Symantec technical support. Several hours later they reported that the dumps indeed revealed a problem with the mail application’s latest antivirus signature distribution and forwarded a patch to the administrator that would fix the bug. He applied it and continued to monitor the server to verify the fix. Sure enough, the server’s performance established fairly regular activity levels and the long delays disappeared.

However, over the subsequent days the administrator started to receive, albeit at a lower rate, complaints from several users that Outlook was sporadically hanging for up to a minute. Andrew asked the administrator to send a correlating 12-hour Performance Monitor capture with the Exchange data collection set, but this time there was no obvious anomaly:



Wondering if the hangs would be visible in the CPU usage history of Store.exe, he removed all the counters except for Store’s processor usage counter. When he zoomed in on the morning hours when users began to login and load on the server increased, he noticed three spikes around 8:30am



Because the server has eight cores, the processor usage counter for an individual process has a possible range between 0 and 800, so the spikes were far from taxing the system, but definitely higher than Exchange’s typical range on that system. Zooming in further and setting the graph’s vertical scale to make the spikes more distinct, he observed that average CPU usage was always below about 75% of a single core and the spikes were 15-30 seconds long:



What was Exchange doing during the spikes? They were too short-lived and random for the administrator to run Procdump like he had before and reliably capture dumps when they occurred. Fortunately, I designed Procdump with this precise scenario in mind. It supports several trigger conditions that when met, cause it to generate a dump. For example, you can configure Procdump to generate a dump of a process when the process terminates, when its private memory usage exceeds a certain value, or even based on the value of a performance counter you specify. Its most basic trigger, though, is the CPU usage of the process exceeding a specified threshold for a specified length of time.

The Performance Monitor log gave Andrew the information he needed to craft a Procdump command line that would capture dumps for future CPU spikes:

procdump.exe -n 20 -s 10 -c 75 -u store.exe c:\dumps\store_75pc_10sec.dmp

The arguments configure Procdump to generate a dump of the Store.exe process when Store’s CPU usage exceeds 75% (-c 75) relative to a single core (-u) for 10 seconds (-s 10), to generate up to 20 dumps (-n 20) and then exit, and to save the dumps in the C:\Dumps directory with names that begin with “store_75pc_10sec”. The administrator executed the command before leaving work and when he checked on its progress the next morning it had finished creating 20 dump files. He emailed them to Andrew, who proceeded to study them in the Windbg debugger one by one.

When Procdump generates a dump because the CPU usage trigger is met, it sets the thread context in the dump file to the thread that was consuming the most CPU at the time of the dump. Since the debugger’s stack-dumping commands are relative to the current thread context, simply entering the stack dumping command shows the stack of the thread most likely to have caused a CPU spike. Over half the dumps were inconclusive, apparently captured after the spike that triggered the dump had already ended, or with threads that were executing code that obviously wasn’t directly related to a spike. However, several of the dumps had stack traces similar to this one:



The stack frame that stuck out listed Store’s EcFindRow function, which implied that the spikes were caused by lengthy database queries, the kind that execute when Outlook accesses a mailbox folder with thousands of entries. With this clue in hand, Andrew suggested the administrator create an inventory of large mailboxes and pointed him at an article the Exchange support team had written that describes how to do this for each version of Exchange:



Sure enough, the script identified several users with folders containing tens of thousands of items. The administrator asked the users to reduce their item count to well below 5000 (the Exchange 2003 recommendation – this has been increased in each version with a recommendation of 100,000 in Exchange 2010) by archiving the items, deleting them, or organizing them into subfolders. Within a couple of days they had reorganized the problematic folders and user complaints ceased entirely. Ongoing monitoring of the Exchange server over the following week confirmed that the problem was gone.

With the help of Procdump, the compound case of the Outlook hangs was successfully closed.

0 comments: