MSI: Multiple Message Enable

vsm · January 28, 2011, 9:01am

Hello all,

I’m writing a KMDF driver for a PCIe device that is capable of sending 16 different MSIs.

Inside the My_EvtDriverDeviceAdd() callback, I do:
WDF_INTERRUPT_CONFIG_INIT(&InterruptConfig, My_EvtInterruptIsr, My_EvtInterruptDpc);
InterruptConfig.EvtInterruptEnable = My_EvtInterruptEnable;
InterruptConfig.EvtInterruptDisable = My_EvtInterruptDisable;
InterruptConfig.AutomaticSerialization = TRUE;
for (Index = 0; Index < 16; Index++)
{
Status = WdfInterruptCreate(DeviceContext->Device, &InterruptConfig, WDF_NO_OBJECT_ATTRIBUTES, &(DeviceContext->Interrupt[Index]));
}
Status is always success.

Inside the My_EvtDevicePrepareHardware() callback, I go through the CM_PARTIAL_RESOURCE_DESCRIPTOR structure (extracted from ResourcesTranslated) and I see only one CmResourceTypeInterrupt.
I expected to have 16 CmResourceTypeInterrupt, but I’m not sure about that.
The flag is equal to (CM_RESOURCE_INTERRUPT_LATCHED | CM_RESOURCE_INTERRUPT_MESSAGE), which is correct.

Here is a TraceView extract when I activate the driver:
01 My_EvtDriverDeviceAdd(---->)
02 My_EvtDriverDeviceAdd(<----)
03 My_EvtDevicePrepareHardware(---->)
04 PartialResourceDescriptor->Type == 0x3
05 PartialResourceDescriptor->ShareDisposition == 0x1
06 PartialResourceDescriptor->Flags == 0x84
07 PartialResourceDescriptor->u.Memory.Start == 0xfcaff000
08 PartialResourceDescriptor->u.Memory.Length == 0x1000
09 PartialResourceDescriptor->Type == 0x81
10 PartialResourceDescriptor->ShareDisposition == 0x1
11 PartialResourceDescriptor->Flags == 0x0
12 PartialResourceDescriptor->Type == 0x3
13 PartialResourceDescriptor->ShareDisposition == 0x1
14 PartialResourceDescriptor->Flags == 0x84
15 PartialResourceDescriptor->u.Memory.Start == 0xfcafe000
16 PartialResourceDescriptor->u.Memory.Length == 0x1000
17 PartialResourceDescriptor->Type == 0x81
18 PartialResourceDescriptor->ShareDisposition == 0x1
19 PartialResourceDescriptor->Flags == 0x0
20 PartialResourceDescriptor->Type == 0x2
21 PartialResourceDescriptor->ShareDisposition == 0x1
22 PartialResourceDescriptor->Flags == 0x3
23 PartialResourceDescriptor->u.MessageInterrupt.Raw.MessageCount == 0x0
24 PartialResourceDescriptor->u.MessageInterrupt.Raw.Vector == 0x62
25 PartialResourceDescriptor->u.MessageInterrupt.Raw.Affinity == 0x1
26 PartialResourceDescriptor->u.MessageInterrupt.Translated.Level == 0x5
27 PartialResourceDescriptor->u.MessageInterrupt.Translated.Vector == 0x62
28 PartialResourceDescriptor->u.MessageInterrupt.Translated.Affinity == 0x1
29 My_EvtDevicePrepareHardware(<----)
30 My_EvtInterruptEnable(---->)
31 My_EvtInterruptEnable(<----)

Here is a TraceView extract when I receive an interrupt (the device sends a message number equal to 0x3, and Message ID received is always 0x0):
32 My_EvtInterruptIsr(---->)
33 MessageID = 0x0
34 My_EvtInterruptIsr(<----)
35 My_EvtInterruptDpc(---->)
36 My_EvtInterruptDpc(<----)

When I read the MSI capability structure of the PCI configuration space, I see:
00896005 (Message Control + Next Pointer + Capability ID)
FFE01000 (Message Address)
00000000 (Message Upper Address)
xxxx40B0 (Message Data)
The Capability ID (0x05) is correct (corresponds to MSI per PCI specification).
The Next Pointer (0x60) is strange because it doesn’t seem to point on another structure…
The Message Control (0x89) means:

Per-vector masking capable = 0
64 bit address capable = 1
Multiple Message Enable = 000 (ONLY 1 MESSAGE ALLOCATED BY WINDOWS)
Multiple Message Capable = 100 (the 16 messages asked by the device)
The Message Address seems correct.
The Message Data doesn’t mean anything for me

In summary, my device asks 16 MSIs and Windows allocates only 1 MSI.
And I don’t understand why, since this is the only PCIe device in the system.
And there are not a lot of devices in that system (tiny portable system).
Do you have an idea to help understand that issue?

Sorry for this huge post, but I want you to have as much information as possible.

Thanks a lot.
Best regards,
Vincent

vsm · January 28, 2011, 10:07am

I tried with 4 messages, it works !
I will try with 8 now.
What is the rule to know how many messages Windows can allocate for a given device ?

Don_Burn_1 · January 28, 2011, 10:21am

It is highly dependent on the system and the peripherals (I am sure Jake
Oshins can give a more detailed explanation). Basically you have to
assume you may or may not get the number of MSI messages you want, and
program accordingly.

Don Burn (MVP, Windows DKD)
Windows Filesystem and Driver Consulting
Website: http://www.windrvr.com
Blog: http://msmvps.com/blogs/WinDrvr

“xxxxx@cea.fr” wrote in
message news:xxxxx@ntdev:

> I tried with 4 messages, it works !
> I will try with 8 now.
> What is the rule to know how many messages Windows can allocate for a given device ?

Alex_Grig · January 28, 2011, 10:48am

Why do you need that many MSI interrupts? One interrupt should be enough for a reasonable design. Remember, that ONE interrupt can be directed to as many processors, as your hardware supports. You don’t need 16 MSI interrupts to support RSS to 16 processors.

Peter_Viscarola_OSR · January 28, 2011, 11:01am

In your original post, are you sure you were getting an MSI and not a conventional LBI as an alternative? I can’t really tell from your output… u.MessageInterrupt.Raw.MessageCount == 0x0 seems a lot like you’re getting ZERO MSIs.

And to clarify your follow-up post: You changed your DEVICE to request a different number of MSI? That is, you tried this with your device asking for 16 MSIs, and with your device asking for 4 MSIs, on the SAME host system?

You’re INF isn’t limiting the number of MSIs by any chance, is it?

You’ve read the MSI document from here? http://download.microsoft.com/download/5/7/7/577a5684-8a83-43ae-9272-ff260a9c20e2/MSI.doc

Peter
OSR

craig_howard · January 28, 2011, 6:13pm

Alex wrote …
Why do you need that many MSI interrupts? One interrupt should be enough for a reasonable design. Remember, that ONE interrupt can be directed to as many processors, as your hardware supports. You don’t need 16 MSI interrupts to support RSS to 16 processors.

… and I reply …
Multiple MSI interrupts are very useful in two situations; virtualization and very high speed data exchange applications. For virtualization (and assuming the HW has been designed to segment it’s universe this way) you can have the device dedicate n regions, one for each VM and dedicate an MSI interrupt for each VM. When the driver (running on the host) gets in an interrupt it can channel this directly via a hypercall to the VM based on the MSI table without needing to decode the destination – very handy, and gets the host driver out from between the guest driver and the hardware.

For very high speed data exchange applications, such as Infiniband (which utilizes various methods to implement zero-copy data transfers up to 90% of possible PCIe Gen2 x16 bandwidth, typically 2.1 to 2.4GBytes/ sec) the latency involved in decoding an interrupt type in the ISR and dispatching the DPC from there is unacceptable – by using MSI to determine the “type” of interrupt (DMA complete, DMA empty, error, etc.) and handling the event directly in the ISR for the MSI you can keep up with the HW even at the 2.4GBytes/ sec …

As to the OP, even though there is a registry method that allows for up to 32 MSI interrupts to be programmed (and even assuming that HW can support that) I in the field never got it working reliably above 8 MSI entries. For further info I would look at the Infiniband OpenFabrics work, specifically the open source Mellanox drivers, to see how they implement MSI in their network drivers. Google for WinOF, link to the SVN codebase and you’re set …

Cheers!

Jake_Oshins · January 29, 2011, 2:27am

If you read Volume 3, Chapter 8, Section 11 of Intel’s Programmer’s
Reference Manual, you’ll see the interaction of the APIC and the interrupt
message format. The short form of it is that you need a block of IDT
entries that are naturally aligned, that is to say that the starting IDT
entry must be a multiple of the number of messages that you’re claiming, in
order to support multi-message MSI.

It’s very rarely the case that Windows can allocate a block of 16
consecutive, aligned IDT entries on a single-processor system with a lot of
devices. (I’m assuming from your other post that a “tiny portable system”
has only one processor.)

4 or 8 is much more likely to be successfully allocated. The fact that the
space tends to be fragmented is another artifact of the way an APIC works.
The older APICs could only queue two interrupts at any priority band (which,
under NT, is IRQL.) So the old code would allocate a single IDT entry at
one IRQL and then the next one would come from a different IRQL. (IRQLs, or
APIC priorities, are 16 IDT entries wide.) Once every usable device IRQL
had one of its 16 entries allocated, the next next device would claim from
the first IRQL again.

Consequently, after starting just a few devices on in a uniprocessor system,
there are no IRQLs which have 16 entries left.

When I re-wrote the code to support MSI, I thought about changing this. But
I was more worried that I’d inadvertently change behavior of older drivers
and break a lot of systems than I worried that people would build a lot of
devices with 16 MSI messages, particularly since MSI-X isn’t subject to
these restrictions. So many-message MSI only really works well if you have
a larger ratio of cores/threads to devices.

In the end, I think I made the right choice. You’re about the second person
I’ve every heard of having a problem with this.

Jake Oshins
Hyper-V I/O Architect (former interrupt guy)
Windows Kernel Group

This post implies no warranties and confers no rights.

wrote in message news:xxxxx@ntdev…

I tried with 4 messages, it works !
I will try with 8 now.
What is the rule to know how many messages Windows can allocate for a given
device ?

Alex_Grig · January 29, 2011, 3:10pm

Jake O:

I saw that for a MSI-X “vector” requested by the device, WIndows generates InterruptMessageTable with the first message of “all processors” affinity, a few messages that look like “all processors in a node” affinity, and then a message per each processor. Is it documented anywhere? Also, for systems with >64 processors, how is affinity encoded in that table?

Does such “vector” require only one IDT entry?

Jake_Oshins · January 30, 2011, 12:27am

If that’s true, that’s an aspect of the code that’s changed since the last
time I read it. I suspect that your driver or your INF is applying policy
for the messages.

The last time I read that part of the code, every message got the default
policy for the entire machine unless you overrode that. The policy for the
machine depends on the processor architecture and whether the chipset
supports MSI at all. X64 policy for a machine that supports MSI is that, in
the absence of policy from the driver or the INF, interrupts are sent to any
one of a group of processors. This was chosen to maintain compatibility
with drivers written before the introduction of MSI. It’s still useful for
network drivers that don’t support RSS.

In practice, with today’s processors, it’s likely that you have more than 8
hardware threads, which forces the OS to go into APIC cluster mode, where
interrupts can only target processors in groups of four. So your interrupt
will be assigned to some group of four processors. If there are fewer than
9 hardware threads (cores, HyperThreads, whatever) then the machine will be
in flat mode, where the interrupt target group is all of the processors.

Each processor has a separate IDT. It would be possible to use a single IDT
for more than one processor, but Windows doesn’t do this for various
reasons.

For machines with more than 64 but fewer than 128 processors , nothing much
changes. The local APIC in a processor core or thread belongs to a cluster.
There are typically 4 processors in each cluster.

For machines with more than 127 processors, we depend on X2APIC mode and
VT-d to target interrupts at processor numbers which couldn’t be represented
in the old APIC modes. In this case, clusters are more fluid and can have
more processors in them. For machines that support this mode, the BIOS
picks the clusters. You would typically see clusters that are either an
entire socket or half of a socket. For instance, I was debugging a machine
the other day that had six-core, twelve-thread processors in it. The X2APIC
clusters mostly had 6 threads apiece, though some had only two.

Typically, though people using MSI-X will chose a policy of “one processor”
for their messages. Thus they get to use MSI-X to route an interrupt to a
specific thread. In this case, none of the above matters.

Jake Oshins
Hyper-V I/O Architect (former interrupt guy who’s been spending some time on
it lately)
Windows Kernel Group

This post implies no warranties and confers no rights.

wrote in message news:xxxxx@ntdev…

Jake O:

I saw that for a MSI-X “vector” requested by the device, WIndows generates
InterruptMessageTable with the first message of “all processors” affinity, a
few messages that look like “all processors in a node” affinity, and then a
message per each processor. Is it documented anywhere? Also, for systems
with >64 processors, how is affinity encoded in that table?

Does such “vector” require only one IDT entry?

vsm · January 31, 2011, 4:30am

Thanks to all for your answers.

In fact, I code both the firmware of the device and the driver for that device.
And as the platform is controlled by us (both the software and the hardware) and because that device is not supposed to be used on another platform, I can do “what I want” (only limited by the OS and the hardware around the device).

Because we have strong latency requirements, that’s why I’m trying to use as much MSI as possible.
But for the moment, it’s more about curiosity than anything else.

For information, it works with 8 MSI, but not 16.

Vincent

Pavel_A1 · January 31, 2011, 5:13am

wrote in message news:xxxxx@ntdev…
> Thanks to all for your answers.
>
> In fact, I code both the firmware of the device and the driver for that
> device.
> And as the platform is controlled by us (both the software and the
> hardware) and because that device is not supposed to be used on another
> platform, I can do “what I want” (only limited by the OS and the hardware
> around the device).
>
> Because we have strong latency requirements, that’s why I’m trying to use
> as much MSI as possible.
> But for the moment, it’s more about curiosity than anything else.
>
> For information, it works with 8 MSI, but not 16.
>
> Vincent

Well, all that takes to get 16 MSI working is just to code your own kernel
- if I understood Mr. Oshins’ posting correctly.

–pa

Maxim_S_Shatskih · January 31, 2011, 5:37am

> Well, all that takes to get 16 MSI working is just to code your own kernel

- if I understood Mr. Oshins’ posting correctly.

I don’t think so

I think Mr. Oshins told us that we must design our own APIC to bypass this limitation.

–
Maxim S. Shatskih
Windows DDK MVP
xxxxx@storagecraft.com
http://www.storagecraft.com

Jake_Oshins · January 31, 2011, 11:29pm

And that’s exactly what Intel (and AMD) did with VT-d (and IOMMU.) The
problem is that no existing, shipping Windows uses those.

Jake Oshins
Hyper-V I/O Architect
Windows Kernel Group

This post implies no warranties and confers no rights.

“Maxim S. Shatskih” wrote in message news:xxxxx@ntdev…

Well, all that takes to get 16 MSI working is just to code your own kernel
- if I understood Mr. Oshins’ posting correctly.

I don’t think so

I think Mr. Oshins told us that we must design our own APIC to bypass this
limitation.

–
Maxim S. Shatskih
Windows DDK MVP
xxxxx@storagecraft.com
http://www.storagecraft.com

Peter_Viscarola_OSR · February 1, 2011, 9:27am

With all due respect, it’s not clear to me how using more MSIs will help reduce latency problems in practice. An initial analysis might suggest it COULD (by keeping 16 separate ISRs running on 16 separate processors, perhaps?)… but when you think this through carefully, I’d suggest that unless you’re doing a ton of work in your ISR (such as retrieving the data from a set of PIO device registers), having all those separate interrupt sources probably isn’t helpful.

What WILL help is carefully coding your DpcForIsr and servicing multiple device events per DPC invocation.

Sorry… I realize that wasn’t you’re question, but it’s the sort of issue we often spend time thinking about here at OSR.

Peter
OSR