Painfully long DPC latencies - who's to blame?

anton_bassov · January 24, 2008, 10:35pm

> Perhaps this means that 10GBE adapters are intended for

high end systems that have high CPU, DRAM speed, and don’t h
ave any “bad” devices/drivers that may be ok in PCs or laptops.

Actually, these days desktops and laptops are “not-so-low-end” systems either. If you go to the shop you are very unlikely to see anything that does not support multiple cores and has less than 2 G of RAM on display…

Anton Bassov

Maxim_S_Shatskih · January 24, 2008, 11:17pm

> Not until Vista…

Under the earlier OSes DPC routine had to be executed on the same CPU where
it got queued. However, Vista allows DPCs to target some particular processor
(oops, it looks like a probable explanation to the whole thing - according to
the
OP, this problem arises only under Vista)…

I think even NT4 allowed the CPU-targeted DPCs.

By default, DPC is assigned to the same CPU on which KeInsertQueueDpc was
called. Such a way will not require interprocessor interrupt within
KeInsertQueueDpc.

Targeted DPCs require an IPI.

–
Maxim Shatskih, Windows DDK MVP
StorageCraft Corporation
xxxxx@storagecraft.com
http://www.storagecraft.com

Maxim_S_Shatskih · January 24, 2008, 11:25pm

> From the software side of things, our URB completion DPC copies the data out

Why do you need your own explicit DPC? just do this from the usual
IO_COMPLETION_ROUTINE called by the USB stack (from the HCI driver’s DpcFoIsr).

With your current design, your DPCs are competing for a queue/CPU with the HCI
driver’s DpcForIsr DPCs, which is a very bad idea I think - delaying DpcForIsr
can possibly cause some timing issues with the HC which will worse the whole
picture a lot.

–
Maxim Shatskih, Windows DDK MVP
StorageCraft Corporation
xxxxx@storagecraft.com
http://www.storagecraft.com

Tim_Roberts · January 25, 2008, 12:56pm

Michal Vodicka wrote:

Hardware is more and more powerful every day and dual core CPUs already became standard. What about quad core CPU with one or more cores dedicated just for this purpose? I’m just speculating; I’m not sure if processor affinity can be set the way which’d allow to completely separate networking from the rest of OS.

This is not a processor issue. It’s an I/O bandwidth issue. Even a
4-processor system generally has a bottleneck at the south bridge.

Think about the numbers here. A continuous 10Gb stream is going to fill
most of a 16-lane PCIExpress connection. 8-lane is not enough. It’s
way more than even a 66MHz 64-bit PCI bus could possibly do.

If your bridges are busy routing 10Gb of bus traffic, does it have any
bandwidth left for routing memory traffic?

–
Tim Roberts, xxxxx@probo.com
Providenza & Boekelheide, Inc.

OSR_Community_User · January 25, 2008, 7:38pm

“Tim Roberts” wrote in message news:xxxxx@ntdev…
> Michal Vodicka wrote:
>> Hardware is more and more powerful every day and dual core CPUs already became standard. What about quad core CPU with one or
>> more cores dedicated just for this purpose? I’m just speculating; I’m not sure if processor affinity can be set the way
>> which’d allow to completely separate networking from the rest of OS.
>>
>
> This is not a processor issue. It’s an I/O bandwidth issue. Even a 4-processor system generally has a bottleneck at the
> south bridge.
>

Well in this specific case it may be I/O bandwidth limitation, but
Michal’s idea looks attractive. If we can think of dedicating a CPU to
some application, why not to dedicate a CPU to processor-intensive
kernel drivers?
One objection could be that CPU intensive tasks don’t belong in the kernel,
but as we’ve seen, the border between kernel and user modes erodes,
OTOH there are CPU hungry drivers (crypto etc) that can’t go to UMDF.

If a driver could monopolize a CPU, it could behave very differently,
and overall the system could benefit more ( less interrupts, poll instead,
less context switches, etc ).

Regards,
–PA

Tim_Roberts · January 28, 2008, 1:15pm

Pavel A. wrote:

Well in this specific case it may be I/O bandwidth limitation, but
Michal’s idea looks attractive. If we can think of dedicating a CPU to
some application, why not to dedicate a CPU to processor-intensive
kernel drivers?
One objection could be that CPU intensive tasks don’t belong in the kernel,
but as we’ve seen, the border between kernel and user modes erodes,
OTOH there are CPU hungry drivers (crypto etc) that can’t go to UMDF.

If a driver could monopolize a CPU, it could behave very differently,
and overall the system could benefit more ( less interrupts, poll instead,
less context switches, etc ).

This is quite true. Seymour Cray had this idea in 1960, and it
manifested itself in the “peripheral processors” that distinguished the
CDC 6000 and Cyber mainframes. The CPUs were dedicated to straight
number crunching (in his vision), while the 10 (or 20) PPs worried about
operating system overhead and I/O. The PPs ran at 1/10th the processor
clock (later 1/5th), but had total control and were uninterruptable. PP
programming was fun, although difficult to debug.

This worked great in an environment where memory (magnetic core, at the
time) was horribly slow, because you could let the PPs do the waiting
while the CPU crunched. As memory speeds caught up, the PPs became a
bottleneck in the memory path, and more and more of the operating system
migrated to the CPU.

The architecture you describe would be a great project for some masters
candidate in computer science at a university. My hunch is that a
4-general-processor system would outperform a
3-general-plus-1-IO-processor system on the typical workload.

–
Tim Roberts, xxxxx@probo.com
Providenza & Boekelheide, Inc.