Quirky would match my experience of TraceLogging. At times I curse it and at times I really appreciate it. The stack space use is a little strange, and I’ve seen some strange errors reported by code analysis (cpp only). As TraceLogging is implemented as all macros, syntax errors can give pretty meaningless messages.
A thing that’s not apparent in TraceLogging is the use of the activity id fields, by using the TraceLoggingWriteActivity instead of TraceLoggingWrite. The event is always written with one activity id anyway, all zeros by default, so always making this a meaningful value has no significant impact on performance.
What’s very fuzzy is how to generate activity ids, which are defined as GUIDs, any 128-bit value seems to be fine. Currently I have a little function that does InterlockedIncrement64 on a counter in the device context, giving the low 64-bits, and I append the device context address as the high 64-bits, making the set of events for each device instance unique. A higher performance implementation could have a per core counter, with the initial values spread, so each core gets a huge range of unique counter values. The second activity id field can be used to create an association between groups of activity’s, and normally the second activity field is not written. I have also added fields to some of my data structures to store the activity id for a groups of events.
There is supposed to be an activity id field associated with a thread context, but I don’t see how to tell TraceLogging in a driver to use it instead of zeros. The concept is you are supposed to be able to set a new activity id in a user mode application, and that id flows down into the kernel, and kernel components can associate work with a parent activity id. In theory, if everybody properly set activity ids and related activity ids, you could follow a user mode event down though all the driver layers and associate the I/O completion with the original user mode work. This would be very useful for debugging and performance analysis, as it could connect things like the hardware accesses to an application HTTP request.
To give an example of how activity id’s fit together, on the NIC driver I have:
- Tx packets get a context structure allocated to hold tx state information, this is not extra for tracing but there is a field to store an activity id. If tracing of the tx path is enabled (there is a function to test keywords) I generate a unique activity id for the specific packet, and may write a begin transmit event, using the TraceLoggingOpcode macro to tag this event as the beginning of an operation. You can tag events as begin/end and use that to calculate the elapsed time. I believe Windows Performance Analyzer understands these Opcode tags, although I’ve never figured out exactly what It wants. Microsoft should write a document called “Writing events formatted to get the best use of Windows Performance Analyzer”.
- Events are written for tx activity, like triggering the mailbox register for the hw, all tagged with the activity of the matching packet.
- The tx ISR can generate a new activity id for the interrupt, and stores it in an interrupt context. My hardware has a mask bit for each interrupt, and the first time the tx interrupt fires, I set the mask bit (and software flag) and generate a new activity id. My hardware takes a bit of time for disabling the interrupt to happen (posted write), and I can see spurious interrupts (MSIX writes already inflight). My ISR notices if an interrupt happens when the software flag says they are disabled, and generates a trace event, using the activity id from the first interrupt. This allows me to easily see when analyzing the trace, the initial interrupt and any spurious interrupts after that initial one.
- My DPC uses the activity id generated in the ISR to tag the ring descriptors processed, making it easy to see which descriptors were associated with a specific interrupt. As descriptors get processed, the associated tx packet context can also be used to write a completion event for the original packet. This is one of the places I use the second activity id. This tags events such that I can find the interrupt activity id that caused the completion of a specific tx packet, and can find all the tx packets that were completed as a result of a single interrupt, or the time from when you wrote the tx hw mailbox to the time you got the completion interrupt for that packet.
- My driver also needs to avoid triggering the DPC watchdog, so it may pass work off to a thread to continue processing tx completion descriptors. I generate a new activity id when this thread starts up, and specify the original interrupt activity id and the new thread invocation activity id. My thread uses the thread activity id when processing descriptors, so I can separately group the descriptors processed in the DPC and the thread, and by following the activity id chains, can determine which packets were completed by the DPC, and which by the thread, and associate them all back to the original interrupt. All these events have timestamps, and it’s very useful see the time interval between ISR, and DPC activation, and handing work off to a passive thread.
I frequently generate traces with a few (or tens) million events and it becomes really difficult to find the correct events for debugging some issue. When viewing, subsetting events based on activity ids that are stored in some context, generally eliminates the need to subset events per processor. The full trace in timestamp order, across many processors, can be pretty hard to make sense of. Having a number of keywords, and subsetting for one keyword and one core makes the trace a lot more readable without any activity ids.
A limitation I find of this tracing systems is it’s not zero overhead. The system seems to use a clock of processor timestamp counter / 1024, so I get about 0.3 microsecond timestamp resolution on a 3 Ghz processor, and Message Analyzer seems to displays timestamps to 100 nanosecond resolution. Back to back events sometimes have the same timestamp, so I believe the event processing time is in the 0.1 to 0.3 microsecond range. I believe it would be possible to have a tracing system with an order of magnitude (or more) better resolution, and lower overhead. I can trace a few hundred thousand interrupts/sec but can’t trace 20 million descriptors/sec, unless I write summary events with the current system. There are TraceLogging macros for arrays of values, so you can sometimes write summary events with much of the same data, although the viewing tools don’t know that one event is really data from multiple events, and you only get one timestamp, unless you capture your own.
Something that would be super useful would be a tool to view the events that understood how activity ids can form chains. Using Message Analyzer, you can build up complex filter expressions by looking at the activity id field (right click add to filter), which is one of the ETW event header fields. It would be nicer to have a display mode that allows traversing forward and backward across the network of activity id associations. It’s like you’re trying to get a display of all packets at step 7, that had a common activity at step 4. This is really useful because often when debugging, the root cause problem fans out, and you only detect the problem after its spread. You really want to interactively navigate this node network, expanding/contracting future/past branches on demand.
Windows Performance Analyzer seems to understand TraceLogging events, but I’ve never had much success getting it to do what I want. Perhaps I’m just holding it wrong. I do almost always use Windows Performance Recorder to capture the traces, although the control files you have to make are a pain. Message Analyzer 1.4 can do event captures, but seems too slow and drops events compared to Performance Recorder.
I also use perfview, see https://github.com/Microsoft/perfview Perfview is fast, and can really easily filter on the event name and time range, and can do text pattern filters, but can’t easily build compound filter expressions like you get in Message Analyzer. It also has a bunch of .Net specific stuff you need to ignore.
Jan
On 11/6/16, 5:38 AM, “xxxxx@lists.osr.com on behalf of xxxxx@osr.com” wrote:
It’s quirky (in some really weird and unexpected ways … like its use of stack space) but it’s also very handy. Definitely a good tool to have in the chest.