PCIe over Thunderbolt, here are good guidelines

This document is the best set of guidelines i’ve seen for supporting PCIe over Thunderbolt. It covered all the things that I learned while updating a PCIe driver and PCIe hardware for use on Thunderbolt.
Thunderbolt Device Driver Programming Guide
Yes it is from Apple, so you can ignore all the macOS specific stuff but most of the issues are the same on any OS.

These sections cover the topics I would emphasize to anyone adding support for Thunderbolt:

Tolerating PCI Latency: This is the area where we also made hardware changes to the PCIe device along with driver changes.

Using Hot Plug Operation with PCI Devices: Check for device gone everywhere.

Hopefully someone finds this helpful.

Eric

Do you have any experience with the Kernel DMA Protection introduced in Windows 10 1803? I think we are encountering a failure due to this with a legacy WDM PCI driver. We are getting a DRIVER_VERIFIER_DMA_VIOLATION bugcheck with Arg1 0x26 (IOMMU detected DMA violation) when running the hardware over Thunderbolt. My suspicion is that this driver does not use the Windows DMA APIs and thus does correctly deal with DMA remapping. I found it interesting that it reports as a DRIVER_VERIFIER violation even though Driver Verifier is not running on this system. I have this vague nagging feeling that I’ve encountered one other bugcheck in the past that had this behavior (i.e., saying it was a DV bugcheck when DV was not running), but it still caught me by surprise.

My suspicion is that this driver does not use the Windows DMA APIs and thus does correctly deal with DMA remapping

That would, of course, cause the problem.

Are you SURE verifier isn’t running at the time of the crash? You know Windows WILL automatically enable it following certain crashes.

Peter

@“Peter_Viscarola_(OSR)” said:

My suspicion is that this driver does not use the Windows DMA APIs and thus does correctly deal with DMA remapping

That would, of course, cause the problem.

Yeah, I need to dig through the (maze of twisty passages) code and make sure of this, but from what I know about it it’s not a bad hunch. (I just noticed that I put “does correctly” instead of “does not correctly” in the OP. :/)

Are you SURE verifier isn’t running at the time of the crash? You know Windows WILL automatically enable it following certain crashes.

I haven’t had my hands on the system yet, but it is with a colleague that I trust and he said that DV was not enabled. I didn’t realize that it would automatically be enabled following certain crashes. Are those documented somewhere, or is it just tribal knowledge?

Are those documented somewhere, or is it just tribal knowledge?

I dunno. I’m just telling you what I’ve experienced. You can check from the dump using !verifier.

Peter

@“Peter_Viscarola_(OSR)” said:
I dunno. I’m just telling you what I’ve experienced. You can check from the dump using !verifier.

!verifier 0x1 doesn’t show any drivers being verified, and the only flag set is (0x00000000) Automatic Checks. Notably, (0x00000080) DMA checking is not enabled.

That is super interesting.

Google says there are other folks seeing issues like the one you’re reporting. One guy, like you, is very clear that Verifier is not running. In some cases, people have solved their problem by flashing the BIOS with the latest. Others have resulted from errors in the Dell Thunderbolt dock driver.

So, in addition to your observation (which is definitive) there’s additional evidence that the IOMMU checks are being done even when Verifier is not enabled. I guess this makes sense… they can figure out of the IOMMU isn’t being used (properly) without having to go to the extreme of forcing data to be double-buffered (which is what the DMA Verification option does. Assuming this is the case, I can see how they might just use the Driver Verifier bugcheck code to indicate that the error results from checking on the activity of an errant driver (there aren’t an unlimited number of bugcheck codes, afterall). But what they’ve done is make things confusing to us devs… as you’ve pointed out.

They can fix this in the documentation… and it’d be nice if they told us SOMEthing about what this check is and what a violation means.

Peter

Last year we were debugging a crash with DRIVER_VERIFIER_DMA_VIOLATION, and I got verification from Microsoft folks that this is indeed poorly named - the check is done irrespective of Verifier settings.

Diane

2 Likes

@Diane said:
Last year we were debugging a crash with DRIVER_VERIFIER_DMA_VIOLATION, and I got verification from Microsoft folks that this is indeed poorly named - the check is done irrespective of Verifier settings.

Diane,

Thanks for the confirmation. Did they say that was true of all DRIVER_VERIFIER_DMA_VIOLATION subtypes, or just specific failure modes?

Gabe

Hi Gabe,

Sorry, I don’t know about all subtypes. I was focused on my particular failure.

Diane

1 Like

Thank you @Diane! That’s very helpful.

Did they say that was true of all DRIVER_VERIFIER_DMA_VIOLATION

I can tell you for sure that “ordinary” DMA verification is absolutely not enabled without Verifier. The overhead would be untenable.

Peter

I did a little assembly searching for fun on whatever Win10 version of nt/hal I’m running. So, this shouldn’t be taken as definitive, but it’s something.

From what I see the HAL appears to generated two different DRIVER_VERIFIER_DMA_VIOLATION bugchecks, regardless of whether or not Verifier is enabled:

  1. From HalpDmaControllerFlushChannel with Parameter1 == 0x23 - “Cannot flush a channel that hasn’t been completed or cancelled”.
  2. From IvtHandleInterrupt with Parameter1 == 0x26 - “IOMMU detected DMA violation”. This appears to be in response to certain interrupts generated by the IOMMU.

The kernel is a bit more confusing…Most of the bugchecks come from DMA Verifier being enabled. However, if you’re running on a system where DMA is not cache coherent (as controlled by the nt!KiSystemFullyCoherent global) you might see a DRIVER_VERIFIER_DMA_VIOLATION from KeFlushIoBuffers with a Parameter1 == 4 (“Driver has freed too many simultaneous adapter channels”) or 5 (“Freed too many map registers”) even without Verifier enabled

Soooo, yeah, they should have picked a different crash code. Probably seemed like a good idea at the time :slight_smile:

1 Like