Saturday, April 21, 2012

Case of the Unexplained Donut of Death

A few weeks ago, I had dropped in at the church-house to bring down some updates to the PC we run the services on. I also took advantage of the time to do some drafting work on that Sunday’s service material.

As I was working, one of the other ministers asked me if the worship-leader/sysadmin had gotten with me to look at one of the church administrative assistant’s PC’s that was not running smoothly.

I hadn’t gotten the message yet but had free time, so I popped into her office to take a look.

The user explained that a few weeks prior, another church member had their PC infected with some malware. One consequence was that system mailed out malware link spam from their email client. The church admin got one of those emails, it looked legit, followed the link, and ended up with a malware infection on her own system.

In the end the system had to be paved and reloaded from scratch.

Ever-since, the user reported it was constantly locking up at random times, though launching Internet Explorer sessions seemed to aggrieve it the most.

The dreaded Windows 7 Donut of Death had appeared. Luckily I’ve got an appetite for donuts.

First thing I did was take a look at the system hardware; i7 core processor, Windows 7 x64 Pro, 12 GB system RAM (wow!), really big SATA HDD. This PC was probably one of the most well-built ones in our church. After some checking and tests, I couldn’t find any issues that could be hardware related.

Next, I fired up Process Explorer.  Since the PC had once been infected, but was wiped/reloaded, I didn’t expect to find any unfamiliar processes and that was indeed the case. Everything running looked legit. No one process seemed to be showing evidence of a “CPU hog.”

With Process Explorer running on the 2nd Monitor and some changes made to the default columns, I started running some applications. Once I fired up IE there appeared the donut of death. The system seemed “locked” an unresponsive for almost a minute before control restored and applications became responsive again. Looking over at the Process Explorer, I didn’t see anything jump in terms of CPU hogging. I clicked around and ran some additional apps with the donut of death appearing again from time to time, more frequently than not. The behavior didn’t seem limited to IE. Other apps triggered the same result. Also, IE could run fine for a while and then suddenly when clicking a new page-link, the system freeze occurred again.

Next I fired up Process Monitor and began a capture. I wasn’t paying close attention to the time, but was able to get the donut of death appear a few more times before closing the capture and saving the file. When I was done I had acquired a 7GB Process Monitor trace file. Yikes!

While I was there, I also grabbed the available System Logs with Nir Sofer’s MyEventViewer tool and exported them into a txt tile for carryout consumption.

A quick review of the Process Monitor log saw a few tool-bar-related processes (Ask toolbar, Bing toolbar, etc.). These seemed to show some time-jumps after execution. The time jumps were a few seconds, not the up-to-a-minute differentials I was expecting from the donut time. The user didn’t use them, so I uninstalled them. Unfortunately, while the donut subjectively seemed to take a bit longer to appear, it still was there.

I off-loaded both these bits to my USB stick for more analysis when I got home.

I was stumped but donut hungry.

Initial Observations: While it was possible there was some hardware/driver problem, my initial feeling was that we had a rogue process taking control of the CPU’s. Since I couldn’t see any cpu spikes during the lockup in the temporarily frozen Process Explorer I wasn’t sure which one it could be.

Back at the homestead, I poured over the system logs. While they were quite interesting, I failed to find any smoking guns.

Next I analyzed the Process Monitor data. Poring over a 7GB file was very tedious.

Part of the problem was that I didn’t ever get an error dialog, BSOD, or some other “failure” to help me narrow down the search. The system would lock up and then unlock after some time and keep running “fine”. Hmm.

I was hoping to focus on a period when I saw the time-counter take a significant jump. Unfortunately, I didn’t find any obvious time jumps.  So while it appeared the system locked up significantly I didn’t find any time-jumps to point to a rouge process.  However, I did make a few notes. Suspiciously, there were a whole lot of logging going on with “coreServiceShell.exe” and “TmListen.exe”.  These are processes related to Trend Micro AV. Looking at the Process Monitor activity related to “coreServiceShell” it was a busy little child while “TmListen” seemed to be looking for related Trend Micro log(s). uiWnMgr.exe was also present and related to Trend Micro and seemed to be focused on certificates.

I discussed this with the sysadmin who was doubtful as this behavior wasn’t seen on any of the additional systems he had used Trend Micro on, including other systems in the church. So I set it aside.

Next week I stopped by the church-house again, but the sysadmin was out. My hope was to try to disable Trend Micro to see if that banished the dreaded donut.  Unfortunately, Trend Micro is protected with an administrative layer which requires a password to disable/turn-off. I contacted the sysadmin by phone but he couldn’t remember it and couldn’t access the location where he had it stored. I was on my own again.

This time I was a bit more prepared.

Resource Monitor

First I tried using the built in Windows 7 “Resource Monitor” tool.

Resource Monitor_2012-04-15_13-20-32

The sample shot above shows some of the data it collects and displays. Running it is very simple. Just type Resmon.exe in the run-line.

While I did get the system to lock up with the donut of death again, I still couldn’t find anything obvious here. My suspect processes still seemed to be behaving themselves in terms of CPU usages.

More on using Resource Monitor:

Performance Monitor

I also considered using Windows Performance Monitor as it also comes native on the systems. Just type perfmon.exe in the run-line to execute.

Performance Monitor_2012-04-15_13-25-14

While this is a good and powerful tool as well, it just wasn’t providing me the data I was looking for. The image above shows a sample view (not related to this particular case). It can be powerful in the right hands, but is not quite as intuitive to use “out-of-the-box” in a meaningful way.

More on using Windows Performance Monitor

Windows Performance Analysis Tools (Xperf)

However, all was not lost.  I had come better prepared and now had a powerful tool in my performance troubleshooting arsenal loaded up on my USB stick; Xperf.

Basically, I had previously downloaded the Windows Performance Toolkit at home on my own Windows 7 system. I then copied the C:\Program Files\Microsoft Windows Performance Toolkit folder over to my USB stick for deployment and usage in the field.

I practiced using it first at home, but mysteriously kept getting the following error:

C:\Program Files\Microsoft Windows Performance Toolkit>xperf.exe -on DiagEasy
xperf: error: NT Kernel Logger: Cannot create a file when that file already exists. (0xb7).

That was weird as I didn’t have any traces set to run on my home system (that I was aware of at least).

That took a bit of troubleshooting but I eventually found the issue:

Basically it was tripping over Process Explorer which was running by default on my system at all times. Process Explorer uses NT Kernel Logging to capture data it uses and they were fighting. Once I disabled Process Explorer temporarily, Xperf worked fine.

Anyway, on the target system back at the church I copied the folder to the local drive, then opened a DOS box and pointed to the location and ran the following command.

C:\Microsoft Windows Performance Toolkit>xperf.exe -on DiagEasy

I then launched some applications, Internet Explorer, loaded a few web sites, ran paint, notepad, calc, etc.  Each time I would get the donut of death. Good data!

I would then stop the trace and export the file with an appropriate file name.

xperf -d testdata1.etl

I noticed complaints about dropped trace elements with a recommendation to increase buffer size. I probably should have heeded the advice, but didn’t since I was using the default diagnostic capture mode.

With three traces in hand now with sizes of 48MB, 50 MB, and a whopper of 301MB, I felt I had a pretty good sample set. During the testing, I made written notes of what I was doing (which app launched) and when the donut appeared. I hoped to correlate these events.

Now that I had the files back home, I fired up the GUI “Windows Performance Analyzer” tool, xperfview.exe.  Actually, since I had installed it on my home system the “ETL” file extension was pre-associated with that application.

CPU Scheduling Aggregate Summary Table - C__Users_Tatiana_Desktop_CM PC Logs_CM-_2012-04-15_13-32-18

You can select different “Frames” to view data from which then load into the horizontal panes. From here you can compare events as well as hone-in on specific time-slices. You can also display the results in a tabular form. The image above is a screen-capture from a real trace from the problem system.

Poking around in this data from the three captures I had taken showed time-after-time that (overall) coreServiceShell.exe was the heaviest user of CPU processes far-and-wide.

This information, coupled with additional data with filtered Process Monitor session data I had captured observing the behavior of that process made me feel much more certain that Trend Micro was in fact the cause of the donut of death.

More on using Xperf and the Windows Performance Toolkit

Bonus Xperf Material:

Today while poking away at the CodePlex - Open Source Project Hosting site I found three projects that may leverage the Xperf data in helpful ways:

  • XPerfUI - This is a GUI wrapper for the Xperf command-line performance analysis tool.  I I mentioned, my own performance traces dropped (lost) anywhere from 26301 to 194970 events during the trace capture process. The CLI arguments are too tough once you have your confidence up, but this tool might make the process a bit easier.
  • Xperf123 - Xperf perf data collection made as easy as 1-2-3 - This project provides a Wizard-based method of selecting a trace profile (it does the CLI arguments automagically for you) and it creates and fires off the command. Then you can analyze the results.

As the warning dialog I mentioned said when loading my trace files in xperfview, “This is usually caused by insufficient disk bandwidth for ETW logging. Please try increasing the minimum and maximum number of buffers and/or the buffer size. Doubling these values would be a good first attempt. Please note that this action increased the amount of memory reserved for ETW buffers, increasing memory pressure on your scenario. See “xperf -help start” for the associated command line options.”  I probably should have set -maxbuffers 1024. Will try that next time.

Getting the Windows Performance Analysis Toolkit bits

As I said earlier, I had to install the toolkit on my home Windows system. However, once I did so, I was able to copy the installation folder in the Program Files folder to a USB stick for deployment and usage on a per-system basis. As I understand it, because you can download the entire package, you could (in theory) extract the downloaded file-set and snag them that way. I found the download/install/copy-to-USB method painless myself.

Here are some links to guide you through the download/installation process.

So, back to the donut of death…I diverged for a moment.

I had some great trace data, and my eyes on some culprits, but while the output in the xperfview.exe application (and table views) was helpful, I kept feeling I was still not able to see the big-picture as clearly as I wanted. After all, I’ve got more than a few Windows process/internals books but I’m no Windows programmer or systems engineer so I was still digging around slowly in the data.

Then I found this.

Windows Performance Analysis Toolkit (WPT) for Windows (SDK 8)


This is the next-generation WPT and has some fantastic power and Xperf analysis toolsets!

Windows Perfmance Analyzer SDK 8

Compare the data views I had selected and was working with above in the Windows Performance Analyzer (wpa.exe) as compared to the prior screenshot in xperfview.exe. The difference is almost night-to-day. The image above is a screen-capture from a real trace from the problem system.

I found this interface much more powerful and intuitive to use.

xperf.exe, xperfview.exe and all the previous tools are still present but wpa.exe is much more user-friendly for analysis work IMHO. Also added to the mix is the tool Windows Performance Recorder (wpr.exe) as well as a GUI-based wizard WPRUI.exe to help set your performance recording sessions. I didn’t use this tool in this troubleshooting but will be working to figure out what neat things I can do with it beyond xperf.

Getting and installing this one on my Windows 7 x64 bit system was a bit more challenging getting started. It doesn’t seem to have a full download package set to just extract these bits out of. Rather you seem to download a web-based pre-installer to pick your packages and it then fetches and installs the bits. All that said, once I figured out what I needed, the process went smoothly, and I was also able to copy the installed folder location to my USB to take with me in the field for captures on other Windows 7 systems.

Here are the links I studied to download and install it:

With the additional views and drill-downs I was able to do in the Windows Performance Analyzer, I had all the proof I needed to convince me that the most likely culprit for the donut-of-death on the church’s system was Trend Micro in general and the “coreServiceShell” process in particular.

Apparently “coreServiceShell.exe” thinks it’s always time to make the donuts!

Armed with this information, I now invested myself in the Google with “coreServiceShell” and found some interesting stuff. Lots of complaints were seen about CPU hogging with Trend Micro and coreServiceShell in particular.

Turns out that process in particular is a busy little bugger (as I’ve seen in my 7 GB Process Monitor capture). It appears to be the main scan-engine for the AV product, not only checking files upon access and execution, but also acting as a web-proxy while pulling in pages in the web-browsers. I could find and locate the proxy activity in the Process Monitor traces. Very interesting thing.

In fact, the issues were so bad that Trend Micro offers a Hot Fix for just this performance issue.

I copied the Hot Fix down to the sysadmin’s network share folder with a few more useful links to the problem.

Since I couldn’t get admin rights to install the Hot Fix myself on the user’s system (yet) I’m waiting for feedback if the Hot Fix has been applied and if it fixes the issue. Based on all my work I’m confident this is the trick.

I suspect that while the CPU activity loads looked normal while I was watching them in Process Explorer -- just before donut of death -- when it went overboard Process Explorer couldn’t keep up and froze so I didn’t see the jump in CPU’s while it hogged/chugalug’ed away. Once it was done, things returned to normal and there I was at normal CPU levels again…until the next lockup.

Additionally, since I didn’t set my Xperf CLI to -maxbuffers 1024, I probably dropped a good portion of the trace capture events during the lockup process as well. That said, the tools above gave me sufficient information to say pretty confidently this was the donut maker.

While this isn’t for the average Joe, these tools and techniques are extremely powerful and once mastered, can give the sysadmin tremendous confidence in working your way through performance issues on a Windows system. Since they can tag-along on a USB stick, you can deploy them as needed as long as you have them handy.

Additional Windows Performance and Analysis Linkage

Here are a few more links and blog and resources I thought were very insightful to getting up to speed tracking out Windows Performance issues. More than a few of these have been added to my RSS feed list.

Hopefully this super-long post and linkage has provided some good comparison views between different tools and techniques tracing out Windows performance problems.

I learned a lot in the process of this one system, and it just another reason why I find benefit in taking a look at problem systems. I usually walk away with a more honed skill-set in the process.

I know I will be coming back to this post myself for some time to come re-reviewing the linkage and information here.


--Claus V.


Patrick Olsen said...

Awesome run through. I'll have to PDF this post and review it again. A lot of information. Good stuff. Keep it up!

Bozo said...

One common area where I've seen procexp appear to freeze without recording CPU activity that is disk I/O related is during scans of huge directory listings (>200k files). This shows up clearly in ProcMon when you filter by PID. The offending process will have long sequences of directory queries, each taking several seconds. Was there any evidence of coreServiceShell.exe creating (or scanning) large numbers of files in a single folder?

Claus said...

@ Bozo - Thanks for the tip. What the traces are telling me is that the TrendMicro service is a real resource-hungry bugger. However the System (PID4) apparently is doing something even more mighty as during the apparent "lockup" periods, the processes are still churning away nicely on the CPU, but the poor HDD is getting a real thrashing from read/write/flushing activity by that System process.

I want to go back and correlate (filter as you suggested) the PID (4) in my Process Monitor trace to see if I can figure out what it is telling me and why it wants to read/write/flush so much.

I also need to work out a more sophisticated XPERF capture argument to better capture the trace into buffer/RAM rather than what it is doing now. Need to experiment with that a bit.

I've not had an opportunity to go hands-on on the system since that post. Maybe this coming weekend? I remain convinced that the TrendMicro service isn't helping in the least, but am not wondering if something specific is happening under the sheets between the System process and the poor suffering HDD.

The mystery deepens!

Thanks for the tips and guidance.

This post will have some followups!


--Claus V.