You should probably be using ETW tracing, which in its (really) easiest form uses the TraceLogging APIs. If the system crashes, and writes a crash dump (i.e. your past the boot critical OS startup), you can extract any unwritten part of an ETW log from the memory image. ETW is usually good up to a million events/sec or so, and the OS uses per core log buffers so you’re not thrashing a shared lock. Having a debugger connected to see kernel DbgPrints is usually the easiest way to get small amounts of data right up to a crash.
If your system fully boots, but will not write a crash dump, it’s a fair chance you have a hung processor or something. Crash dumps run with all interrupts disabled, with a single processor doing polling of the disk controller completion (one reason the storage miniport model is strange is so it still works in this extremely limited execution environment).
A plus of an ETW trace is you can have a bunch of structured data, and if you only want to see the data recently before a crash, you can specify circular in-memory buffers, and new events just keep overwriting old events, with the circular buffers in the crash dump. You can run a system for days/weeks logging to the circular buffer. You can also flip a switch and log many gigabytes of events to disk, saving every event. You also will need some of the tools to decode ETW traces. Even though it’s a bit slow, Microsoft’s Message Analyzer has among the best filtering (expressions). MA is kind of memory inefficient too, million event traces are ok, but 100 million events traces aren’t. Microsoft Performance Toolkit tools also analyzes ETW traces, and sometimes it has a really useful view of the data, and other times it seems to just have almost useless views.
A HUGE plus of ETW traces is also the OS had hundreds of event providers built-in, so you can get unified traces of OS events along with your driver events. I’m a huge fan of ETW tracing, and even though the TraceLogging flavor has more overhead than the manifest flavor, I generally use the TraceLogging flavor because it’s just a lot easier to add new events. ETW also includes the correlation id feature, which is not well explained, but incredibly useful for making logical sense of the raw events (like associating a specific interrupt with its originating specific I/O operation).
Forcing a crash dump on command is also important, and you know about the keyboard sequence. Many server motherboards can also generate an NMI (non-maskable interrupt), which sometimes can be initiated from a server management interface (like on many Dell servers). Some servers also have the hardware to generate an NMI, but you have to add a little switch to the motherboard header tp trigger the NMI (many SuperMicro motherboards). Some motherboards know how to generate an NMI on a watchdog timeout too. An NMI has the advantage that it will interrupt through an interrupt storm or a looping IPI, although will not interrupt a processor hung on a bus access (like an infinite PCIe config retry). It would take a PCIe analyzer to verify an infinite config retry, and only if the PCIe device is in a slot you can put the PCIe interposer on. At one point , there were JTAG like debug interfaces for special x86 motherboards. It’s always seemed odd that a $2 ARM microcontroller has a better low level debugger interface (that uses a $15 interface tool) than a $2000 Intel server processor.
If you looking for CHEAP ways to get really low level debugging info out: 1) if you motherboard supports an old legacy parallel or serial interface, these were accessed with port i/o instructions. You can figure out the port address, which often would not change between boots, and make your drive toggle pins on the port (like dtr for a serial port). I suppose you could even do bit-banged serial, and use an oscilloscope ($350 or $15 Arduino) to read the serial stream. PCs also had debug port at 0x80, and for PCI (and maybe PCIe too) you could get cheap little boards than would read the debug port value and display it on a seven segment led. I’ve also connected logic analyzers to parallel ports. These methods all require your code somehow get a cpu, which can be an interrupt, or crashdump callback. Sometimes you can also monopolize a processor at high IRQL for a while, like make a worker thread that raises IRQL to some higher level for a millisecond, and periodically lowers it back down to PASSIVE_LEVEL, to avoid hanging any other threads scheduled on this core. If you forever monopolize a CPU at elevated IRQL, you will get a DPC watchdog crash. Monopolizing a cpu is not good in a production driver (proper power management is toast), but for debugging use it does give you control of a processor a large percentage of the time.
ETW tracing has an option to use the TSC timestamp instead of the QPF timestamp, although I don’t remember if that actually works on current OS versions and CPUs. My fuzzy memory is the TSC timestamp works, but is not so useful because writing the event takes a few hundred cycles and if it used a synchronizing RDTSCD, you end up only having a little bit better resolution, not 1024x better.
If you need really fast timing measurements, Intel has a great white paper on how to get the very best resolution from the TSC. Like it talks about the pros and cons or using RDTSDC which does not synchronize the pipeline vs RDTSCP which does synchronize the pipeline. I read that even for the synchronizing RDTSCD, the processor will start executing new instruction before the timer finishes reading so you can’t full escape the timing fuzziness of out-of-order execution.
I think one of the reasons for QPC being TSC/1024 is that on modern processers, which are out-of-order execution and speculatively executed, with potentially close to 100 instructions in some state of execution, TSC/1024 is a number that is above the jitter. RDTSC is NOT a synchronizing instruction, so the value is just someplace in the out-of-order queue. The OS will also choose an “optimal” high resolution time source on a particular system. On some processor models, the TSC tracks with power management clock speed changes. On these systems, the OS may choose a global hardware timer for QPC, like the high resolution event counter in the chipset. I’ve often found QPC is sufficient resolution for performance analysis of most C code. For tight inner loops, even the TSC is insufficient, and you need to look at things more statistically (like time artificially large test cases). For inner loop optimization, it’s still useful to look at the instruction stream and manually add up instruction throughput cycles, result latencies, and dependency stalls. You generally don’t directly write assembler, but you certainly can carefully tune your C code to persuade the compiler to generate the code you want.
For fine grain performance analysis, Intel’s VTune is usually my tool of choice, which answers question like: how long does that interlocked memory operation really take (an enormous amount of time if it’s a cache miss to a different cpu package).
Jan
-----Original Message-----
From: xxxxx@lists.osr.com [mailto:xxxxx@lists.osr.com] On Behalf Of xxxxx@rolle.name
Sent: Saturday, March 3, 2018 10:07 PM
To: Windows System Software Devs Interest List
Subject: RE:[ntdev] Tips on diagnosing system hangs?
Correction – I meant R-CTL+ScrLock+ScrLock to create a bug check.
Another question…
7. If I put DbgPrint calls in my driver, can you show me an easy tutorial on how to use the output. In particular, I’d want to save the data QUICKLY to disk in case of a crash or hangup.
—
NTDEV is sponsored by OSR
Visit the list online at: http:
MONTHLY seminars on crash dump analysis, WDF, Windows internals and software drivers!
Details at http:
To unsubscribe, visit the List Server section of OSR Online at http:</http:></http:></http:>