32-bit driver much faster than 64-bit

I have written a UMDF driver for a USB device, using the ATL-based template provided by VS 2012, with the guts largely patterned on the OSR FX2 driver. The board it supports is a simple endpoint 1 in-and-out device (a microcontroller programmer). For some reason I haven’t been able to determine, the 32-bit version of the driver runs significantly faster (about 35%) than the 64-bit version. The 64-bit version is running on a high performance modern laptop (i7-2860QM @2.5Ghz). The 32-bit version is running on a six year old, much slower, machine (Core 2 Duo T8300 @2.4GHz).

Any ideas what might be going on here, or how I can try to track the issue?

– Bob Ammerman
RAm Systems

Hmm.

How are you timing this/what are you measuring?

Also, just to double check the low hanging fruit - you’re not accidentally
comparing debug v. release?

mm

-----Original Message-----
From: xxxxx@lists.osr.com
[mailto:xxxxx@lists.osr.com] On Behalf Of Bob Ammerman
Sent: Sunday, September 15, 2013 5:15 PM
To: Windows System Software Devs Interest List
Subject: [ntdev] 32-bit driver much faster than 64-bit

I have written a UMDF driver for a USB device, using the ATL-based template
provided by VS 2012, with the guts largely patterned on the OSR FX2 driver.
The board it supports is a simple endpoint 1 in-and-out device (a
microcontroller programmer). For some reason I haven’t been able to
determine, the 32-bit version of the driver runs significantly faster (about
35%) than the 64-bit version. The 64-bit version is running on a high
performance modern laptop (i7-2860QM @2.5Ghz). The 32-bit version is running
on a six year old, much slower, machine (Core 2 Duo T8300 @2.4GHz).

Any ideas what might be going on here, or how I can try to track the issue?

– Bob Ammerman
RAm Systems


NTDEV is sponsored by OSR

Visit the list at: http://www.osronline.com/showlists.cfm?list=ntdev

OSR is HIRING!! See http://www.osr.com/careers

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at
http://www.osronline.com/page.cfm?name=ListServer

Is it USB 1.1 device?

If you install 32bit flavor of the OS to your laptop, do you see speed difference?

If you limit the RAM size to 3 GB in your 64bit SUT, do you still see the speed difference?

I am measuring the time it takes to program the target microcontroller as reported by the application.

I dragged out my trusty beagle and saw some interesting numbers (both for the same operation):

32-Bit:

IN 301077
OUT 312328
ACK 7407
NAK 605999
STALL 0
DATA 319747

64-bit:
IN 397509
OUT 397492
ACK 6388
NAK 788614
STALL 0
DATA 401073

I am not quite sure how to interpret these numbers, but it looks like the 64-bitter is so fast that it is resulting in extra packets on the USB bus?!?

Not a debug vs. release issue, although you remind me that both drivers are debug right now. I’ll try recompiling as release and try them again.

– Bob Ammerman
RAm Systems

-----Original Message-----
From: xxxxx@lists.osr.com [mailto:xxxxx@lists.osr.com] On Behalf Of Martin O’Brien
Sent: Sunday, September 15, 2013 8:21 PM
To: Windows System Software Devs Interest List
Subject: RE: [ntdev] 32-bit driver much faster than 64-bit

Hmm.

How are you timing this/what are you measuring?

Also, just to double check the low hanging fruit - you’re not accidentally comparing debug v. release?

mm

-----Original Message-----
From: xxxxx@lists.osr.com
[mailto:xxxxx@lists.osr.com] On Behalf Of Bob Ammerman
Sent: Sunday, September 15, 2013 5:15 PM
To: Windows System Software Devs Interest List
Subject: [ntdev] 32-bit driver much faster than 64-bit

I have written a UMDF driver for a USB device, using the ATL-based template provided by VS 2012, with the guts largely patterned on the OSR FX2 driver.
The board it supports is a simple endpoint 1 in-and-out device (a microcontroller programmer). For some reason I haven’t been able to determine, the 32-bit version of the driver runs significantly faster (about
35%) than the 64-bit version. The 64-bit version is running on a high performance modern laptop (i7-2860QM @2.5Ghz). The 32-bit version is running on a six year old, much slower, machine (Core 2 Duo T8300 @2.4GHz).

Any ideas what might be going on here, or how I can try to track the issue?

– Bob Ammerman
RAm Systems


NTDEV is sponsored by OSR

Visit the list at: http://www.osronline.com/showlists.cfm?list=ntdev

OSR is HIRING!! See http://www.osr.com/careers

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at http://www.osronline.com/page.cfm?name=ListServer


NTDEV is sponsored by OSR

Visit the list at: http://www.osronline.com/showlists.cfm?list=ntdev

OSR is HIRING!! See http://www.osr.com/careers

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at http://www.osronline.com/page.cfm?name=ListServer

-----Original Message-----
From: xxxxx@lists.osr.com [mailto:xxxxx@lists.osr.com] On Behalf Of xxxxx@broadcom.com
Sent: Sunday, September 15, 2013 8:27 PM
To: Windows System Software Devs Interest List
Subject: RE:[ntdev] 32-bit driver much faster than 64-bit

Is it USB 1.1 device?

USB 2.0 full speed

If you install 32bit flavor of the OS to your laptop, do you see speed difference?

Can’t really do that too easily. The high speed machine is my daily driver. I do have another, slower, 64-bit machine available to test on. I’ll try that later tonight.

If you limit the RAM size to 3 GB in your 64bit SUT, do you still see the speed difference?

Dumb question: how do I do that?

As mentioned in my last message: I am going to run release drivers and see if that makes much difference. I don’t expect it will.

– Bob Ammerman
RAm Systems

What is the selected power profile on the laptop? Make sure you test with"hugh performance" or whatever doesn’t throttle the CPU and lets it run full blast

Thx
d

Bent from my phone


From: Bob Ammermanmailto:xxxxx
Sent: ?9/?15/?2013 5:35 PM
To: Windows System Software Devs Interest Listmailto:xxxxx
Subject: RE: [ntdev] 32-bit driver much faster than 64-bit

I am measuring the time it takes to program the target microcontroller as reported by the application.

I dragged out my trusty beagle and saw some interesting numbers (both for the same operation):

32-Bit:

IN 301077
OUT 312328
ACK 7407
NAK 605999
STALL 0
DATA 319747

64-bit:
IN 397509
OUT 397492
ACK 6388
NAK 788614
STALL 0
DATA 401073

I am not quite sure how to interpret these numbers, but it looks like the 64-bitter is so fast that it is resulting in extra packets on the USB bus?!?

Not a debug vs. release issue, although you remind me that both drivers are debug right now. I’ll try recompiling as release and try them again.

– Bob Ammerman
RAm Systems

-----Original Message-----
From: xxxxx@lists.osr.com [mailto:xxxxx@lists.osr.com] On Behalf Of Martin O’Brien
Sent: Sunday, September 15, 2013 8:21 PM
To: Windows System Software Devs Interest List
Subject: RE: [ntdev] 32-bit driver much faster than 64-bit

Hmm.

How are you timing this/what are you measuring?

Also, just to double check the low hanging fruit - you’re not accidentally comparing debug v. release?

mm

-----Original Message-----
From: xxxxx@lists.osr.com
[mailto:xxxxx@lists.osr.com] On Behalf Of Bob Ammerman
Sent: Sunday, September 15, 2013 5:15 PM
To: Windows System Software Devs Interest List
Subject: [ntdev] 32-bit driver much faster than 64-bit

I have written a UMDF driver for a USB device, using the ATL-based template provided by VS 2012, with the guts largely patterned on the OSR FX2 driver.
The board it supports is a simple endpoint 1 in-and-out device (a microcontroller programmer). For some reason I haven’t been able to determine, the 32-bit version of the driver runs significantly faster (about
35%) than the 64-bit version. The 64-bit version is running on a high performance modern laptop (i7-2860QM @2.5Ghz). The 32-bit version is running on a six year old, much slower, machine (Core 2 Duo T8300 @2.4GHz).

Any ideas what might be going on here, or how I can try to track the issue?

– Bob Ammerman
RAm Systems


NTDEV is sponsored by OSR

Visit the list at: http://www.osronline.com/showlists.cfm?list=ntdev

OSR is HIRING!! See http://www.osr.com/careers

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at http://www.osronline.com/page.cfm?name=ListServer


NTDEV is sponsored by OSR

Visit the list at: http://www.osronline.com/showlists.cfm?list=ntdev

OSR is HIRING!! See http://www.osr.com/careers

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at http://www.osronline.com/page.cfm?name=ListServer


NTDEV is sponsored by OSR

Visit the list at: http://www.osronline.com/showlists.cfm?list=ntdev

OSR is HIRING!! See http://www.osr.com/careers

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at http://www.osronline.com/page.cfm?name=ListServer</mailto:xxxxx></mailto:xxxxx>

> I have written a UMDF driver for a USB device, using the ATL-based

template provided by VS 2012, with the guts largely patterned on the OSR
FX2 driver. The board it supports is a simple endpoint 1 in-and-out device
(a microcontroller programmer). For some reason I haven’t been able to
determine, the 32-bit version of the driver runs significantly faster
(about 35%) than the 64-bit version. The 64-bit version is running on a
high performance modern laptop (i7-2860QM @2.5Ghz). The 32-bit version is
running on a six year old, much slower, machine (Core 2 Duo T8300
@2.4GHz).

Not enough information here to draw any conclusions

(a) what is your measurement method, and why do you trust it?

(b) is the device capable of generating 64-bit addresses on a 64-bit bus,
or is it restricted to 32bits?

(c) How much memory is on each machine?

(d) is the app a 32-bit app running on a 64-bit machine?

(e) are you certain Driver Verifier is off on both machines?

(f) [just for completeness, since this question was already asked] are
both compiled as debug or both compiled as release?

(g) are there any passive-level threads in your completion path?

(h) does the device have exclusive use of its interrupt level on both
machies, or does it share interrupts on one of them?

(i) does the 64-bit system have any device that runs at a higher DIRQL
than your device and therefore may be stealing cycles

(j) does a lower-DIRQL device, by queueing DPCs, force the 64-bit system
into a priority-inversion condition?

(k) is there a lot of locking, and is lock contention higher on the faster
processor?

(l) is it something else I haven’t thought of, but might yet, that
represents a system difference between the two platforms? [note: when I
first wrote this point, it was (h); while writing the following text, and
fixing typos, I thought of a few more]

When you see major differences like this, you need to make sure you have
first eliminated all knowable artifacts that may impact the validity of
your measurements. You might not be measuring the performance of two
drivers as the performance of two systems, and there may be artifacts that
impact the behavior at the system level. My first reaction to numbers
like this is “what is the measurement tool, what is it measuring, and why
do you trust it?” and my next reaction is that it is not a 32/64 bit
issue, but a much more global hardware and/or software issue that is
causing the differences.

btw, when I have run some compute-intensive single-core apps on my 32-bit
and 64-bit systems, which happen to have the same clock speed, then for
debug mode, t(64) ~= 0.75*t(32), but I’ve found a few that, for full
optimizations on, t(64) ~= 0.1*t(32), so what is being measured in the
second case is not the quality of the execution engine or the memory
system, but the performance of the compiler (it’s only about 0.5*t(32)
without /LTCG, for example). I believe a lot of the 0.75 factor comes
from larger caches and TLB.

So when I hear of an inverse situation, I’m more inclined to credit it to
artifacts other than code.
joe

Any ideas what might be going on here, or how I can try to track the
issue?

– Bob Ammerman
RAm Systems


NTDEV is sponsored by OSR

Visit the list at: http://www.osronline.com/showlists.cfm?list=ntdev

OSR is HIRING!! See http://www.osr.com/careers

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at
http://www.osronline.com/page.cfm?name=ListServer

> I am measuring the time it takes to program the target microcontroller as

reported by the application.

I dragged out my trusty beagle and saw some interesting numbers (both for
the same operation):

32-Bit:

IN 301077
OUT 312328
ACK 7407
NAK 605999
STALL 0
DATA 319747

64-bit:
IN 397509
OUT 397492
ACK 6388
NAK 788614
STALL 0
DATA 401073

I am not quite sure how to interpret these numbers, but it looks like the
64-bitter is so fast that it is resulting in extra packets on the USB
bus?!?

OHO! It is a USB device! So the only way to reliably measure this is to
make sure this device is the only one plugged into the root hub, or that
both systems have identical devices plugged into the root hub which are
behaving identically. You also need to make sure it is not an artifact of
the implementations of the USB stack by measuring the performance of other
USB devices; if you see comparable degradations, it is not your problem.

I haven’t counted, but I think this takes me up to (p) in my previous list.

You have still not produced numbers we can draw any conclusions from.
joe

Not a debug vs. release issue, although you remind me that both drivers
are debug right now. I’ll try recompiling as release and try them again.

– Bob Ammerman
RAm Systems

-----Original Message-----
From: xxxxx@lists.osr.com
[mailto:xxxxx@lists.osr.com] On Behalf Of Martin O’Brien
Sent: Sunday, September 15, 2013 8:21 PM
To: Windows System Software Devs Interest List
Subject: RE: [ntdev] 32-bit driver much faster than 64-bit

Hmm.

How are you timing this/what are you measuring?

Also, just to double check the low hanging fruit - you’re not accidentally
comparing debug v. release?

mm

-----Original Message-----
From: xxxxx@lists.osr.com
[mailto:xxxxx@lists.osr.com] On Behalf Of Bob Ammerman
Sent: Sunday, September 15, 2013 5:15 PM
To: Windows System Software Devs Interest List
Subject: [ntdev] 32-bit driver much faster than 64-bit

I have written a UMDF driver for a USB device, using the ATL-based
template provided by VS 2012, with the guts largely patterned on the OSR
FX2 driver.
The board it supports is a simple endpoint 1 in-and-out device (a
microcontroller programmer). For some reason I haven’t been able to
determine, the 32-bit version of the driver runs significantly faster
(about
35%) than the 64-bit version. The 64-bit version is running on a high
performance modern laptop (i7-2860QM @2.5Ghz). The 32-bit version is
running on a six year old, much slower, machine (Core 2 Duo T8300
@2.4GHz).

Any ideas what might be going on here, or how I can try to track the
issue?

– Bob Ammerman
RAm Systems


NTDEV is sponsored by OSR

Visit the list at: http://www.osronline.com/showlists.cfm?list=ntdev

OSR is HIRING!! See http://www.osr.com/careers

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at
http://www.osronline.com/page.cfm?name=ListServer


NTDEV is sponsored by OSR

Visit the list at: http://www.osronline.com/showlists.cfm?list=ntdev

OSR is HIRING!! See http://www.osr.com/careers

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at
http://www.osronline.com/page.cfm?name=ListServer


NTDEV is sponsored by OSR

Visit the list at: http://www.osronline.com/showlists.cfm?list=ntdev

OSR is HIRING!! See http://www.osr.com/careers

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at
http://www.osronline.com/page.cfm?name=ListServer

Ahhh…forgot that trick…

[a very old joke: young apprentice comes to martial arts master, asks to
study with him. Apprentice is 20, Master is 40. For 20 years, apprentice
is pummeled daily. Finally, the Master says to the Apprentice, “I have
taught you everything I know, go forth, and take on an apprentice of your
own.” The Apprentice thinks, “Aha! I know everything he knows. I can
counter every move, so after 20 years I can take him!”

Several hours later, the Apprentice wakes up. The master is looking down
at him and says, “Ahh…forgot that trick.”]
joe

What is the selected power profile on the laptop? Make sure you test
with"hugh performance" or whatever doesn’t throttle the CPU and lets it
run full blast

Thx
d

Bent from my phone


From: Bob Ammermanmailto:xxxxx
> Sent: ý9/ý15/ý2013 5:35 PM
> To: Windows System Software Devs Interest Listmailto:xxxxx
> Subject: RE: [ntdev] 32-bit driver much faster than 64-bit
>
> I am measuring the time it takes to program the target microcontroller as
> reported by the application.
>
> I dragged out my trusty beagle and saw some interesting numbers (both for
> the same operation):
>
> 32-Bit:
>
> IN 301077
> OUT 312328
> ACK 7407
> NAK 605999
> STALL 0
> DATA 319747
>
> 64-bit:
> IN 397509
> OUT 397492
> ACK 6388
> NAK 788614
> STALL 0
> DATA 401073
>
> I am not quite sure how to interpret these numbers, but it looks like the
> 64-bitter is so fast that it is resulting in extra packets on the USB
> bus?!?
>
> Not a debug vs. release issue, although you remind me that both drivers
> are debug right now. I’ll try recompiling as release and try them again.
>
> – Bob Ammerman
> RAm Systems
>
>
> -----Original Message-----
> From: xxxxx@lists.osr.com
> [mailto:xxxxx@lists.osr.com] On Behalf Of Martin O’Brien
> Sent: Sunday, September 15, 2013 8:21 PM
> To: Windows System Software Devs Interest List
> Subject: RE: [ntdev] 32-bit driver much faster than 64-bit
>
> Hmm.
>
> How are you timing this/what are you measuring?
>
> Also, just to double check the low hanging fruit - you’re not accidentally
> comparing debug v. release?
>
>
> mm
>
> -----Original Message-----
> From: xxxxx@lists.osr.com
> [mailto:xxxxx@lists.osr.com] On Behalf Of Bob Ammerman
> Sent: Sunday, September 15, 2013 5:15 PM
> To: Windows System Software Devs Interest List
> Subject: [ntdev] 32-bit driver much faster than 64-bit
>
> I have written a UMDF driver for a USB device, using the ATL-based
> template provided by VS 2012, with the guts largely patterned on the OSR
> FX2 driver.
> The board it supports is a simple endpoint 1 in-and-out device (a
> microcontroller programmer). For some reason I haven’t been able to
> determine, the 32-bit version of the driver runs significantly faster
> (about
> 35%) than the 64-bit version. The 64-bit version is running on a high
> performance modern laptop (i7-2860QM @2.5Ghz). The 32-bit version is
> running on a six year old, much slower, machine (Core 2 Duo T8300
> @2.4GHz).
>
> Any ideas what might be going on here, or how I can try to track the
> issue?
>
> – Bob Ammerman
> RAm Systems
>
> —
> NTDEV is sponsored by OSR
>
> Visit the list at: http://www.osronline.com/showlists.cfm?list=ntdev
>
> OSR is HIRING!! See http://www.osr.com/careers
>
> For our schedule of WDF, WDM, debugging and other seminars visit:
> http://www.osr.com/seminars
>
> To unsubscribe, visit the List Server section of OSR Online at
> http://www.osronline.com/page.cfm?name=ListServer
>
>
> —
> NTDEV is sponsored by OSR
>
> Visit the list at: http://www.osronline.com/showlists.cfm?list=ntdev
>
> OSR is HIRING!! See http://www.osr.com/careers
>
> For our schedule of WDF, WDM, debugging and other seminars visit:
> http://www.osr.com/seminars
>
> To unsubscribe, visit the List Server section of OSR Online at
> http://www.osronline.com/page.cfm?name=ListServer
>
> —
> NTDEV is sponsored by OSR
>
> Visit the list at: http://www.osronline.com/showlists.cfm?list=ntdev
>
> OSR is HIRING!! See http://www.osr.com/careers
>
> For our schedule of WDF, WDM, debugging and other seminars visit:
> http://www.osr.com/seminars
>
> To unsubscribe, visit the List Server section of OSR Online at
> http://www.osronline.com/page.cfm?name=ListServer
>
> —
> NTDEV is sponsored by OSR
>
> Visit the list at: http://www.osronline.com/showlists.cfm?list=ntdev
>
> OSR is HIRING!! See http://www.osr.com/careers
>
> For our schedule of WDF, WDM, debugging and other seminars visit:
> http://www.osr.com/seminars
>
> To unsubscribe, visit the List Server section of OSR Online at
> http://www.osronline.com/page.cfm?name=ListServer</mailto:xxxxx></mailto:xxxxx>

Checked: Device is only device on root hub on both machines

I could try to investigate whether there is an issue with the USB stack which slows it down on the x64 machine, but I would expect that if there were performance issues with Win7-64 vs. Win7-32 we would have heard about it by now :slight_smile:

What numbers do you want to see?

I’d still like to hear from a USB bus-level guru about the differences in the packet counts on the bus between the two systems. That looks pretty fishy to me.

– Bob Ammerman
RAm Systems

-----Original Message-----
From: xxxxx@lists.osr.com [mailto:xxxxx@lists.osr.com] On Behalf Of xxxxx@flounder.com
Sent: Sunday, September 15, 2013 9:17 PM
To: Windows System Software Devs Interest List
Subject: RE: [ntdev] 32-bit driver much faster than 64-bit

I am measuring the time it takes to program the target microcontroller
as reported by the application.

I dragged out my trusty beagle and saw some interesting numbers (both
for the same operation):

32-Bit:

IN 301077
OUT 312328
ACK 7407
NAK 605999
STALL 0
DATA 319747

64-bit:
IN 397509
OUT 397492
ACK 6388
NAK 788614
STALL 0
DATA 401073

I am not quite sure how to interpret these numbers, but it looks like
the 64-bitter is so fast that it is resulting in extra packets on the
USB bus?!?

OHO! It is a USB device! So the only way to reliably measure this is to make sure this device is the only one plugged into the root hub, or that both systems have identical devices plugged into the root hub which are behaving identically. You also need to make sure it is not an artifact of the implementations of the USB stack by measuring the performance of other USB devices; if you see comparable degradations, it is not your problem.

I haven’t counted, but I think this takes me up to (p) in my previous list.

You have still not produced numbers we can draw any conclusions from.
joe

Joe, thanks for your time. I have tried to answer your questions as best I can, although many of them just bring up more questions.

Quoting you: “So when I hear of an inverse situation, I’m more inclined to credit it to artifacts other than code.”

Yep! Just need to figure out what the artifacts are.

– Bob Ammerman
RAm Systems

-----Original Message-----
From: xxxxx@lists.osr.com [mailto:xxxxx@lists.osr.com] On Behalf Of xxxxx@flounder.com
Sent: Sunday, September 15, 2013 9:08 PM
To: Windows System Software Devs Interest List
Subject: Re: [ntdev] 32-bit driver much faster than 64-bit

I have written a UMDF driver for a USB device, using the ATL-based
template provided by VS 2012, with the guts largely patterned on the
OSR
FX2 driver. The board it supports is a simple endpoint 1 in-and-out
device (a microcontroller programmer). For some reason I haven’t been
able to determine, the 32-bit version of the driver runs significantly
faster (about 35%) than the 64-bit version. The 64-bit version is
running on a high performance modern laptop (i7-2860QM @2.5Ghz). The
32-bit version is running on a six year old, much slower, machine
(Core 2 Duo T8300 @2.4GHz).

Not enough information here to draw any conclusions

(a) what is your measurement method, and why do you trust it?

Clock time as reported by application program. It seems to match my perception of elapsed time :slight_smile:

(b) is the device capable of generating 64-bit addresses on a 64-bit bus, or is it restricted to 32bits?

That is in layers way below where I am, somewhere in the guts of the Windows USB stack and controllers. I don’t really know the answer to this.

(c) How much memory is on each machine?

Fast 64-bit machine: 16GB, slow 32-bit machine: 4GB

(d) is the app a 32-bit app running on a 64-bit machine?

Yes

(e) are you certain Driver Verifier is off on both machines?

Yes

(f) [just for completeness, since this question was already asked] are both compiled as debug or both compiled as release?

Yes: both are debug. I was able to get the release driver going on the 64-bit machine. There is no noticible difference in times between debug and release. I am having an issue getting the release driver to install on the 32-bit machine. Windows keeps insisting that the driver is up-to-date when trying to update it using device manager. I suppose I’ll have to get down and do it manually.

(g) are there any passive-level threads in your completion path?

UMDF driver: the whole thing runs as passive-level, doesn’t it?

(h) does the device have exclusive use of its interrupt level on both machies, or does it share interrupts on one of them?

It looks like the USB controller is sharing its interrupt with other USB controllers on both machines. In any case, there isn’t much else going on in either case. I did test with the beagle unplugged.

(i) does the 64-bit system have any device that runs at a higher DIRQL than your device and therefore may be stealing cycles

I don’t know, and I don’t know how to find out. The machine isn’t doing much, though.

(j) does a lower-DIRQL device, by queueing DPCs, force the 64-bit system into a priority-inversion condition?

I don’t think so, but I can’t be sure. I suppose I could look at DPC counts in perfmon.

(k) is there a lot of locking, and is lock contention higher on the faster processor?

Not sure why that would happen. The machine is more or less idle.

(l) is it something else I haven’t thought of, but might yet, that represents a system difference between the two platforms? [note: when I first wrote this point, it was (h); while writing the following text, and fixing typos, I thought of a few more]

When you see major differences like this, you need to make sure you have first eliminated all knowable artifacts that may impact the validity of your measurements. You might not be measuring the performance of two drivers as the performance of two systems, and there may be artifacts that impact the behavior at the system level. My first reaction to numbers like this is “what is the measurement tool, what is it measuring, and why do you trust it?” and my next reaction is that it is not a 32/64 bit issue, but a much more global hardware and/or software issue that is causing the differences.

btw, when I have run some compute-intensive single-core apps on my 32-bit and 64-bit systems, which happen to have the same clock speed, then for debug mode, t(64) ~= 0.75*t(32), but I’ve found a few that, for full optimizations on, t(64) ~= 0.1*t(32), so what is being measured in the second case is not the quality of the execution engine or the memory system, but the performance of the compiler (it’s only about 0.5*t(32) without /LTCG, for example). I believe a lot of the 0.75 factor comes from larger caches and TLB.

So when I hear of an inverse situation, I’m more inclined to credit it to artifacts other than code.
joe

Any ideas what might be going on here, or how I can try to track the
issue?

– Bob Ammerman
RAm Systems


NTDEV is sponsored by OSR

Visit the list at: http://www.osronline.com/showlists.cfm?list=ntdev

OSR is HIRING!! See http://www.osr.com/careers

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at
http://www.osronline.com/page.cfm?name=ListServer


NTDEV is sponsored by OSR

Visit the list at: http://www.osronline.com/showlists.cfm?list=ntdev

OSR is HIRING!! See http://www.osr.com/careers

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at http://www.osronline.com/page.cfm?name=ListServer

Further input…

I have now tested on three systems:

(1) my fast Win7-x64 development laptop: 7.7 seconds

(2) my old Win7-x86 laptop: 5.2 seconds

(3) a rather slow Win7-x64 desktop box: 4.4 seconds

This is really driving me crazy. It looks like (2) and (3) make sense. I just have to deal with the outlier (1). It looks like I must have some kind of debugging/tracing slowing things down there.

What would that be likely to be?

– Bob Ammerman
RAm Systems

-----Original Message-----
From: /o=Copeland Data Systems/ou=CDS/cn=Recipients/cn=rammerman
Sent: Sunday, September 15, 2013 9:56 PM
To: ‘Windows System Software Devs Interest List’
Subject: RE: [ntdev] 32-bit driver much faster than 64-bit

Checked: Device is only device on root hub on both machines

I could try to investigate whether there is an issue with the USB stack which slows it down on the x64 machine, but I would expect that if there were performance issues with Win7-64 vs. Win7-32 we would have heard about it by now :slight_smile:

What numbers do you want to see?

I’d still like to hear from a USB bus-level guru about the differences in the packet counts on the bus between the two systems. That looks pretty fishy to me.

– Bob Ammerman
RAm Systems

-----Original Message-----
From: xxxxx@lists.osr.com [mailto:xxxxx@lists.osr.com] On Behalf Of xxxxx@flounder.com
Sent: Sunday, September 15, 2013 9:17 PM
To: Windows System Software Devs Interest List
Subject: RE: [ntdev] 32-bit driver much faster than 64-bit

I am measuring the time it takes to program the target microcontroller
as reported by the application.

I dragged out my trusty beagle and saw some interesting numbers (both
for the same operation):

32-Bit:

IN 301077
OUT 312328
ACK 7407
NAK 605999
STALL 0
DATA 319747

64-bit:
IN 397509
OUT 397492
ACK 6388
NAK 788614
STALL 0
DATA 401073

I am not quite sure how to interpret these numbers, but it looks like
the 64-bitter is so fast that it is resulting in extra packets on the
USB bus?!?

OHO! It is a USB device! So the only way to reliably measure this is to make sure this device is the only one plugged into the root hub, or that both systems have identical devices plugged into the root hub which are behaving identically. You also need to make sure it is not an artifact of the implementations of the USB stack by measuring the performance of other USB devices; if you see comparable degradations, it is not your problem.

I haven’t counted, but I think this takes me up to (p) in my previous list.

You have still not produced numbers we can draw any conclusions from.
joe

-----Original Message-----
From: xxxxx@lists.osr.com [mailto:xxxxx@lists.osr.com] On Behalf Of Bob Ammerman
Sent: Sunday, September 15, 2013 9:56 PM
To: Windows System Software Devs Interest List
Subject: RE: [ntdev] 32-bit driver much faster than 64-bit

Joe, thanks for your time. I have tried to answer your questions as best I can, although many of them just bring up more questions.

Quoting you: “So when I hear of an inverse situation, I’m more inclined to credit it to artifacts other than code.”

Yep! Just need to figure out what the artifacts are.

– Bob Ammerman
RAm Systems

-----Original Message-----
From: xxxxx@lists.osr.com [mailto:xxxxx@lists.osr.com] On Behalf Of xxxxx@flounder.com
Sent: Sunday, September 15, 2013 9:08 PM
To: Windows System Software Devs Interest List
Subject: Re: [ntdev] 32-bit driver much faster than 64-bit

I have written a UMDF driver for a USB device, using the ATL-based
template provided by VS 2012, with the guts largely patterned on the
OSR
FX2 driver. The board it supports is a simple endpoint 1 in-and-out
device (a microcontroller programmer). For some reason I haven’t been
able to determine, the 32-bit version of the driver runs significantly
faster (about 35%) than the 64-bit version. The 64-bit version is
running on a high performance modern laptop (i7-2860QM @2.5Ghz). The
32-bit version is running on a six year old, much slower, machine
(Core 2 Duo T8300 @2.4GHz).

Not enough information here to draw any conclusions

(a) what is your measurement method, and why do you trust it?

Clock time as reported by application program. It seems to match my perception of elapsed time :slight_smile:

(b) is the device capable of generating 64-bit addresses on a 64-bit bus, or is it restricted to 32bits?

That is in layers way below where I am, somewhere in the guts of the Windows USB stack and controllers. I don’t really know the answer to this.

(c) How much memory is on each machine?

Fast 64-bit machine: 16GB, slow 32-bit machine: 4GB

(d) is the app a 32-bit app running on a 64-bit machine?

Yes

(e) are you certain Driver Verifier is off on both machines?

Yes

(f) [just for completeness, since this question was already asked] are both compiled as debug or both compiled as release?

Yes: both are debug. I was able to get the release driver going on the 64-bit machine. There is no noticible difference in times between debug and release. I am having an issue getting the release driver to install on the 32-bit machine. Windows keeps insisting that the driver is up-to-date when trying to update it using device manager. I suppose I’ll have to get down and do it manually.

(g) are there any passive-level threads in your completion path?

UMDF driver: the whole thing runs as passive-level, doesn’t it?

(h) does the device have exclusive use of its interrupt level on both machies, or does it share interrupts on one of them?

It looks like the USB controller is sharing its interrupt with other USB controllers on both machines. In any case, there isn’t much else going on in either case. I did test with the beagle unplugged.

(i) does the 64-bit system have any device that runs at a higher DIRQL than your device and therefore may be stealing cycles

I don’t know, and I don’t know how to find out. The machine isn’t doing much, though.

(j) does a lower-DIRQL device, by queueing DPCs, force the 64-bit system into a priority-inversion condition?

I don’t think so, but I can’t be sure. I suppose I could look at DPC counts in perfmon.

(k) is there a lot of locking, and is lock contention higher on the faster processor?

Not sure why that would happen. The machine is more or less idle.

(l) is it something else I haven’t thought of, but might yet, that represents a system difference between the two platforms? [note: when I first wrote this point, it was (h); while writing the following text, and fixing typos, I thought of a few more]

When you see major differences like this, you need to make sure you have first eliminated all knowable artifacts that may impact the validity of your measurements. You might not be measuring the performance of two drivers as the performance of two systems, and there may be artifacts that impact the behavior at the system level. My first reaction to numbers like this is “what is the measurement tool, what is it measuring, and why do you trust it?” and my next reaction is that it is not a 32/64 bit issue, but a much more global hardware and/or software issue that is causing the differences.

btw, when I have run some compute-intensive single-core apps on my 32-bit and 64-bit systems, which happen to have the same clock speed, then for debug mode, t(64) ~= 0.75*t(32), but I’ve found a few that, for full optimizations on, t(64) ~= 0.1*t(32), so what is being measured in the second case is not the quality of the execution engine or the memory system, but the performance of the compiler (it’s only about 0.5*t(32) without /LTCG, for example). I believe a lot of the 0.75 factor comes from larger caches and TLB.

So when I hear of an inverse situation, I’m more inclined to credit it to artifacts other than code.
joe

Any ideas what might be going on here, or how I can try to track the
issue?

– Bob Ammerman
RAm Systems


NTDEV is sponsored by OSR

Visit the list at: http://www.osronline.com/showlists.cfm?list=ntdev

OSR is HIRING!! See http://www.osr.com/careers

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at
http://www.osronline.com/page.cfm?name=ListServer


NTDEV is sponsored by OSR

Visit the list at: http://www.osronline.com/showlists.cfm?list=ntdev

OSR is HIRING!! See http://www.osr.com/careers

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at http://www.osronline.com/page.cfm?name=ListServer


NTDEV is sponsored by OSR

Visit the list at: http://www.osronline.com/showlists.cfm?list=ntdev

OSR is HIRING!! See http://www.osr.com/careers

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at http://www.osronline.com/page.cfm?name=ListServer

> Joe, thanks for your time. I have tried to answer your questions as best I

can, although many of them just bring up more questions.

Quoting you: “So when I hear of an inverse situation, I’m more inclined to
credit it to artifacts other than code.”

Yep! Just need to figure out what the artifacts are.

– Bob Ammerman
RAm Systems

-----Original Message-----
From: xxxxx@lists.osr.com
[mailto:xxxxx@lists.osr.com] On Behalf Of
xxxxx@flounder.com
Sent: Sunday, September 15, 2013 9:08 PM
To: Windows System Software Devs Interest List
Subject: Re: [ntdev] 32-bit driver much faster than 64-bit

> I have written a UMDF driver for a USB device, using the ATL-based
> template provided by VS 2012, with the guts largely patterned on the
> OSR
> FX2 driver. The board it supports is a simple endpoint 1 in-and-out
> device (a microcontroller programmer). For some reason I haven’t been
> able to determine, the 32-bit version of the driver runs significantly
> faster (about 35%) than the 64-bit version. The 64-bit version is
> running on a high performance modern laptop (i7-2860QM @2.5Ghz). The
> 32-bit version is running on a six year old, much slower, machine
> (Core 2 Duo T8300 @2.4GHz).
>
Not enough information here to draw any conclusions

(a) what is your measurement method, and why do you trust it?

Clock time as reported by application program. It seems to match my
perception of elapsed time :slight_smile:

Note that you have to look at the granularity of clock time. The clock
only has 15ms resolution. If you are measuring total duration by
comparing two clock reads, and you see 35%. it is very likely correct. If
you are summing up a set of durations of values as being 35%, remember
that there would be some “error creep” in each iteration, of up to 15ms,
so the number is suspect. (Note, I didn’t say “wrong”).

(b) is the device capable of generating 64-bit addresses on a 64-bit bus,
or is it restricted to 32bits?

That is in layers way below where I am, somewhere in the guts of the
Windows USB stack and controllers. I don’t really know the answer to this.

If the device (in this case, the USB controller, can it generate 64-bit
addresses?) If the controller can only generate 32-bit addresses, then
there is a distinct possibility that there will be a lot of copying of
data from the actual buffers (which may lie above the 32-bit address
boundary) to internal buffers that are reserved below the 4GB limit. This
can add some overhead

(c) How much memory is on each machine?

Fast 64-bit machine: 16GB, slow 32-bit machine: 4GB

Which increases the probability that the user buffers will be > 4GB in
physical memory, see previous query

(d) is the app a 32-bit app running on a 64-bit machine?

Yes

There is some additional overhead in Wow64. I don’t know what it might
be. I don’t think it could account for 35%, but it might contribute to
it.

(e) are you certain Driver Verifier is off on both machines?

Yes

(f) [just for completeness, since this question was already asked] are
both compiled as debug or both compiled as release?

Yes: both are debug. I was able to get the release driver going on the
64-bit machine. There is no noticible difference in times between debug
and release. I am having an issue getting the release driver to install on
the 32-bit machine. Windows keeps insisting that the driver is up-to-date
when trying to update it using device manager. I suppose I’ll have to get
down and do it manually.

If there is no noticeable difference between debug and release, you can
eliminate this as an issue. It also suggests that the problems in
performance are not in your code at all.

(g) are there any passive-level threads in your completion path?

UMDF driver: the whole thing runs as passive-level, doesn’t it?

Which versions of Windows are you using? Note that newer versions have
handled how thread preemption and restoration are handled, especially with
respect to the timer ticks (see the Windows Internals book for details;
note also that the performance glitches of the older versions turn out to
look like the summing-fine-intervals-over-coarse-intervals summation
errors (even a microsecond too late can cause your thread to be delayed by
an entire timer tick).

Note that as soon as you end up in UMDF, you are a potential victim of all
other threads in the machine; hence your measurements may be including all
kinds of other threads. You’d have to eliminate all other competition for
your UMDF thread, including other kernel threads, from your measurements.
This is not easy, by the way, but it is something to consider.

(h) does the device have exclusive use of its interrupt level on both
machies, or does it share interrupts on one of them?

It looks like the USB controller is sharing its interrupt with other USB
controllers on both machines. In any case, there isn’t much else going on
in either case. I did test with the beagle unplugged.

OK.

(i) does the 64-bit system have any device that runs at a higher DIRQL
than your device and therefore may be stealing cycles

I don’t know, and I don’t know how to find out. The machine isn’t doing
much, though.

You can find this out from a variety of ways, including the device
manager. It’s tedious, though.

(j) does a lower-DIRQL device, by queueing DPCs, force the 64-bit system
into a priority-inversion condition?

I don’t think so, but I can’t be sure. I suppose I could look at DPC
counts in perfmon.

Well, it would be a number. Its significance could not be determined,
though, because it wouldn’t indicate if any of those DPCs added latency to
your device handling.

(k) is there a lot of locking, and is lock contention higher on the faster
processor?

Not sure why that would happen. The machine is more or less idle.

“Idle” has many meanings. For example, just because it isn’t dispatching
user threads in your login session doesn’t mean that no other threads are
running. You’ve got all kinds of background tasks, system services,
kernel threads (such as file system threads) and goodness knows how many
others. Factoring these out of your numbers can be difficult.

(l) is it something else I haven’t thought of, but might yet, that
represents a system difference between the two platforms? [note: when I
first wrote this point, it was (h); while writing the following text, and
fixing typos, I thought of a few more]

Well, you’ve eliminated several possibilities, but introduced new sets of
issues. Sorting this out is not always easy. Generally, one technique is
to run many experiments and look at the statistical distribution so you
have some measurable confidence in the numbers (mean/standard deviation).
This will not tell you where the 35% is, but if sometimes you get 10% and
sometimes you get 35%, it tells you there is something unaccountable going
on, which has a lot of variability.

You also need to characterize the app. For example, if the app is
shipping N requests, fixed, then you should see the same number of packets
going out and coming back. If you see more packets, then it sounds like
your app is doing more. You also need to determine how much data is being
shipped on the USB channel. You should be sending the same number of
bits, and seeing the same number of bits in response. If this is
different between the two, then you are conducting two different
experiments that superficially resemble each other, and you are trying to
compare them as if they are the same experiment.

The trick in performance measurement is to either eliminate all other
potential perturbations, or account for them. You hold constant what you
can, and then perturb the experiment on your own to see if your
perturbations make expected changes. For example, does running the
program for 2N requests run twice as long (including the 35% difference).
If the twice-as-many experiment doesn’t run twice-as-long, you have an
interesting result. Note also that cache differences, TLB differences,
and other such considerations have the potential for making differences;
even if the caches and TLBs on the 64-bit are bigger, they also respond to
different patterns of access than the 32-bit version. So there might be
an issue here, I don’t know. Overall, never think that performance
measurement is an easy task. I spent a lot of years doing it, and while I
could often make statements about one of the parameters of the
performance, it is hard make definitive statements about overall
evaluation of complex systems. For example, “small code” used to equate
with “fast code”, but in modern systems it is not at all clear that the
minimalist approach to code generate produces the best code. Concepts
such as register lifetimes, when there are a pool of invisible registers
lurking behind the programmer-visible names like “EAX” or “RAX”, means
that certain kinds of analysis for optimization has to come up with
different answers. Code sequences that “break the pipe” are no longer
best-case scenarios. When you put multiple layers between you and
reality, layers whose behavior you can neither understand nor control, you
have to either eliminate them from the data, or account for them. And you
may find your performance issues lie with those lower layers. In the 15
years I did performance measurement, I learned a lot about deceptive
practices of operating systems, in terms of not telling you what they are
really doing. When your measured data doesn’t match what you expect, you
have to figure out why. This can take weeks.
joe

When you see major differences like this, you need to make sure you have
first eliminated all knowable artifacts that may impact the validity of
your measurements. You might not be measuring the performance of two
drivers as the performance of two systems, and there may be artifacts that
impact the behavior at the system level. My first reaction to numbers
like this is “what is the measurement tool, what is it measuring, and why
do you trust it?” and my next reaction is that it is not a 32/64 bit
issue, but a much more global hardware and/or software issue that is
causing the differences.

btw, when I have run some compute-intensive single-core apps on my 32-bit
and 64-bit systems, which happen to have the same clock speed, then for
debug mode, t(64) ~= 0.75*t(32), but I’ve found a few that, for full
optimizations on, t(64) ~= 0.1*t(32), so what is being measured in the
second case is not the quality of the execution engine or the memory
system, but the performance of the compiler (it’s only about 0.5*t(32)
without /LTCG, for example). I believe a lot of the 0.75 factor comes
from larger caches and TLB.

So when I hear of an inverse situation, I’m more inclined to credit it to
artifacts other than code.
joe
> Any ideas what might be going on here, or how I can try to track the
> issue?
>
> – Bob Ammerman
> RAm Systems
>
> —
> NTDEV is sponsored by OSR
>
> Visit the list at: http://www.osronline.com/showlists.cfm?list=ntdev
>
> OSR is HIRING!! See http://www.osr.com/careers
>
> For our schedule of WDF, WDM, debugging and other seminars visit:
> http://www.osr.com/seminars
>
> To unsubscribe, visit the List Server section of OSR Online at
> http://www.osronline.com/page.cfm?name=ListServer
>


NTDEV is sponsored by OSR

Visit the list at: http://www.osronline.com/showlists.cfm?list=ntdev

OSR is HIRING!! See http://www.osr.com/careers

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at
http://www.osronline.com/page.cfm?name=ListServer


NTDEV is sponsored by OSR

Visit the list at: http://www.osronline.com/showlists.cfm?list=ntdev

OSR is HIRING!! See http://www.osr.com/careers

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at
http://www.osronline.com/page.cfm?name=ListServer

> Further input…

I have now tested on three systems:

(1) my fast Win7-x64 development laptop: 7.7 seconds

(2) my old Win7-x86 laptop: 5.2 seconds

(3) a rather slow Win7-x64 desktop box: 4.4 seconds

This is really driving me crazy. It looks like (2) and (3) make sense. I
just have to deal with the outlier (1). It looks like I must have some
kind of debugging/tracing slowing things down there.

What would that be likely to be?

You have to measure code in the absence of any perturbations. Having
DbgPrint statements would be a significant perturbation that you would
have to eliminate.

Ultimately, the definitive performance measurements involve doing
instruction traces using passive bus monitors. For multicore systems,
this gets to be nightmarish. I know people who now do this, and their
“trade secrets” are the nature of their code that analyzes these traces.
It isn’t easy.
joe

– Bob Ammerman
RAm Systems

-----Original Message-----
From: /o=Copeland Data Systems/ou=CDS/cn=Recipients/cn=rammerman
Sent: Sunday, September 15, 2013 9:56 PM
To: ‘Windows System Software Devs Interest List’
Subject: RE: [ntdev] 32-bit driver much faster than 64-bit

Checked: Device is only device on root hub on both machines

I could try to investigate whether there is an issue with the USB stack
which slows it down on the x64 machine, but I would expect that if there
were performance issues with Win7-64 vs. Win7-32 we would have heard about
it by now :slight_smile:

What numbers do you want to see?

I’d still like to hear from a USB bus-level guru about the differences in
the packet counts on the bus between the two systems. That looks pretty
fishy to me.

– Bob Ammerman
RAm Systems

-----Original Message-----
From: xxxxx@lists.osr.com
[mailto:xxxxx@lists.osr.com] On Behalf Of
xxxxx@flounder.com
Sent: Sunday, September 15, 2013 9:17 PM
To: Windows System Software Devs Interest List
Subject: RE: [ntdev] 32-bit driver much faster than 64-bit

> I am measuring the time it takes to program the target microcontroller
> as reported by the application.
>
> I dragged out my trusty beagle and saw some interesting numbers (both
> for the same operation):
>
> 32-Bit:
>
> IN 301077
> OUT 312328
> ACK 7407
> NAK 605999
> STALL 0
> DATA 319747
>
> 64-bit:
> IN 397509
> OUT 397492
> ACK 6388
> NAK 788614
> STALL 0
> DATA 401073
>
> I am not quite sure how to interpret these numbers, but it looks like
> the 64-bitter is so fast that it is resulting in extra packets on the
> USB bus?!?

OHO! It is a USB device! So the only way to reliably measure this is to
make sure this device is the only one plugged into the root hub, or that
both systems have identical devices plugged into the root hub which are
behaving identically. You also need to make sure it is not an artifact of
the implementations of the USB stack by measuring the performance of other
USB devices; if you see comparable degradations, it is not your problem.

I haven’t counted, but I think this takes me up to (p) in my previous
list.

You have still not produced numbers we can draw any conclusions from.
joe

-----Original Message-----
From: xxxxx@lists.osr.com
[mailto:xxxxx@lists.osr.com] On Behalf Of Bob Ammerman
Sent: Sunday, September 15, 2013 9:56 PM
To: Windows System Software Devs Interest List
Subject: RE: [ntdev] 32-bit driver much faster than 64-bit

Joe, thanks for your time. I have tried to answer your questions as best I
can, although many of them just bring up more questions.

Quoting you: “So when I hear of an inverse situation, I’m more inclined to
credit it to artifacts other than code.”

Yep! Just need to figure out what the artifacts are.

– Bob Ammerman
RAm Systems

-----Original Message-----
From: xxxxx@lists.osr.com
[mailto:xxxxx@lists.osr.com] On Behalf Of
xxxxx@flounder.com
Sent: Sunday, September 15, 2013 9:08 PM
To: Windows System Software Devs Interest List
Subject: Re: [ntdev] 32-bit driver much faster than 64-bit

> I have written a UMDF driver for a USB device, using the ATL-based
> template provided by VS 2012, with the guts largely patterned on the
> OSR
> FX2 driver. The board it supports is a simple endpoint 1 in-and-out
> device (a microcontroller programmer). For some reason I haven’t been
> able to determine, the 32-bit version of the driver runs significantly
> faster (about 35%) than the 64-bit version. The 64-bit version is
> running on a high performance modern laptop (i7-2860QM @2.5Ghz). The
> 32-bit version is running on a six year old, much slower, machine
> (Core 2 Duo T8300 @2.4GHz).
>
Not enough information here to draw any conclusions

(a) what is your measurement method, and why do you trust it?

Clock time as reported by application program. It seems to match my
perception of elapsed time :slight_smile:

(b) is the device capable of generating 64-bit addresses on a 64-bit bus,
or is it restricted to 32bits?

That is in layers way below where I am, somewhere in the guts of the
Windows USB stack and controllers. I don’t really know the answer to this.

(c) How much memory is on each machine?

Fast 64-bit machine: 16GB, slow 32-bit machine: 4GB

(d) is the app a 32-bit app running on a 64-bit machine?

Yes

(e) are you certain Driver Verifier is off on both machines?

Yes

(f) [just for completeness, since this question was already asked] are
both compiled as debug or both compiled as release?

Yes: both are debug. I was able to get the release driver going on the
64-bit machine. There is no noticible difference in times between debug
and release. I am having an issue getting the release driver to install on
the 32-bit machine. Windows keeps insisting that the driver is up-to-date
when trying to update it using device manager. I suppose I’ll have to get
down and do it manually.

(g) are there any passive-level threads in your completion path?

UMDF driver: the whole thing runs as passive-level, doesn’t it?

(h) does the device have exclusive use of its interrupt level on both
machies, or does it share interrupts on one of them?

It looks like the USB controller is sharing its interrupt with other USB
controllers on both machines. In any case, there isn’t much else going on
in either case. I did test with the beagle unplugged.

(i) does the 64-bit system have any device that runs at a higher DIRQL
than your device and therefore may be stealing cycles

I don’t know, and I don’t know how to find out. The machine isn’t doing
much, though.

(j) does a lower-DIRQL device, by queueing DPCs, force the 64-bit system
into a priority-inversion condition?

I don’t think so, but I can’t be sure. I suppose I could look at DPC
counts in perfmon.

(k) is there a lot of locking, and is lock contention higher on the faster
processor?

Not sure why that would happen. The machine is more or less idle.

(l) is it something else I haven’t thought of, but might yet, that
represents a system difference between the two platforms? [note: when I
first wrote this point, it was (h); while writing the following text, and
fixing typos, I thought of a few more]

When you see major differences like this, you need to make sure you have
first eliminated all knowable artifacts that may impact the validity of
your measurements. You might not be measuring the performance of two
drivers as the performance of two systems, and there may be artifacts that
impact the behavior at the system level. My first reaction to numbers
like this is “what is the measurement tool, what is it measuring, and why
do you trust it?” and my next reaction is that it is not a 32/64 bit
issue, but a much more global hardware and/or software issue that is
causing the differences.

btw, when I have run some compute-intensive single-core apps on my 32-bit
and 64-bit systems, which happen to have the same clock speed, then for
debug mode, t(64) ~= 0.75*t(32), but I’ve found a few that, for full
optimizations on, t(64) ~= 0.1*t(32), so what is being measured in the
second case is not the quality of the execution engine or the memory
system, but the performance of the compiler (it’s only about 0.5*t(32)
without /LTCG, for example). I believe a lot of the 0.75 factor comes
from larger caches and TLB.

So when I hear of an inverse situation, I’m more inclined to credit it to
artifacts other than code.
joe
> Any ideas what might be going on here, or how I can try to track the
> issue?
>
> – Bob Ammerman
> RAm Systems
>
> —
> NTDEV is sponsored by OSR
>
> Visit the list at: http://www.osronline.com/showlists.cfm?list=ntdev
>
> OSR is HIRING!! See http://www.osr.com/careers
>
> For our schedule of WDF, WDM, debugging and other seminars visit:
> http://www.osr.com/seminars
>
> To unsubscribe, visit the List Server section of OSR Online at
> http://www.osronline.com/page.cfm?name=ListServer
>


NTDEV is sponsored by OSR

Visit the list at: http://www.osronline.com/showlists.cfm?list=ntdev

OSR is HIRING!! See http://www.osr.com/careers

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at
http://www.osronline.com/page.cfm?name=ListServer


NTDEV is sponsored by OSR

Visit the list at: http://www.osronline.com/showlists.cfm?list=ntdev

OSR is HIRING!! See http://www.osr.com/careers

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at
http://www.osronline.com/page.cfm?name=ListServer


NTDEV is sponsored by OSR

Visit the list at: http://www.osronline.com/showlists.cfm?list=ntdev

OSR is HIRING!! See http://www.osr.com/careers

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at
http://www.osronline.com/page.cfm?name=ListServer

Bob Ammerman wrote:

32-Bit:

IN 301077
OUT 312328
ACK 7407
NAK 605999
STALL 0
DATA 319747

64-bit:
IN 397509
OUT 397492
ACK 6388
NAK 788614
STALL 0
DATA 401073

I am not quite sure how to interpret these numbers, but it looks like the 64-bitter is so fast that it is resulting in extra packets on the USB bus?!?

No, that doesn’t make sense. USB is all scheduled, and controlled by a
hardware timer independent of the processor. Also remember that your
two processors have essentially the same clock speed; the laptop isn’t
going to be that much faster at the assembly level.

Are you quite sure both of these were performing the exact same
operation? If you are sending and receiving a fixed-size buffer, then
the number of IN and OUT packets should be absolutely identical. How
much data are you sending, exactly? Is it a bulk pipe? Are you sending
one large buffer, or many small buffers?


Tim Roberts, xxxxx@probo.com
Providenza & Boekelheide, Inc.

since these numbers were presented without explanation, I wasn’t sure what
to make of them

32-bit: 1,546,555 events
64-bit: 1,991,076 events

Ratio: 1.28:1

You say the 64-bit is 35% slower. But it is doing 28% more work. Why?
joe

Bob Ammerman wrote:
> 32-Bit:
>
> IN 301077
> OUT 312328
> ACK 7407
> NAK 605999
> STALL 0
> DATA 319747
>
> 64-bit:
> IN 397509
> OUT 397492
> ACK 6388
> NAK 788614
> STALL 0
> DATA 401073
>
> I am not quite sure how to interpret these numbers, but it looks like
> the 64-bitter is so fast that it is resulting in extra packets on the
> USB bus?!?

No, that doesn’t make sense. USB is all scheduled, and controlled by a
hardware timer independent of the processor. Also remember that your
two processors have essentially the same clock speed; the laptop isn’t
going to be that much faster at the assembly level.

Are you quite sure both of these were performing the exact same
operation? If you are sending and receiving a fixed-size buffer, then
the number of IN and OUT packets should be absolutely identical. How
much data are you sending, exactly? Is it a bulk pipe? Are you sending
one large buffer, or many small buffers?


Tim Roberts, xxxxx@probo.com
Providenza & Boekelheide, Inc.


NTDEV is sponsored by OSR

Visit the list at: http://www.osronline.com/showlists.cfm?list=ntdev

OSR is HIRING!! See http://www.osr.com/careers

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at
http://www.osronline.com/page.cfm?name=ListServer

>since these numbers were presented without explanation, I wasn’t sure what to make of them

32-bit: 1,546,555 events
64-bit: 1,991,076 events

Ratio: 1.28:1

You say the 64-bit is 35% slower. But it is doing 28% more work. Why?
joe

Joe,

You ask, “Why?”. That is indeed the question.

You probably missed the earlier messages on the thread. The statistics I presented were the result of testing a UMDF USB driver I wrote. It was much faster (35%) on an old 32-bit machine than on a new high-power 64-bit machine. I was looking for an explanation of what might be happening. More importantly, I was looking for a solution to make my customer happy.

The issue became semi-moot when I tested against another 64-bit Win7 system and got somewhat better performance than the 32-bitter. It is still a mystery why the first machine, which is my daily driver and development machine, is so slow. I am now wondering if some sort of tracing or validation is going on. I commented out all TraceEvent calls in the source code, but that didn’t seem to matter one bit.

Bob Ammerman
– RAm Systems

Bob Ammerman wrote:
> 32-Bit:
>
> IN 301077
> OUT 312328
> ACK 7407
> NAK 605999
> STALL 0
> DATA 319747
>
> 64-bit:
> IN 397509
> OUT 397492
> ACK 6388
> NAK 788614
> STALL 0
> DATA 401073
>
> I am not quite sure how to interpret these numbers, but it looks like
> the 64-bitter is so fast that it is resulting in extra packets on the
> USB bus?!?

No, that doesn’t make sense. USB is all scheduled, and controlled by
a hardware timer independent of the processor. Also remember that
your two processors have essentially the same clock speed; the laptop
isn’t going to be that much faster at the assembly level.

Are you quite sure both of these were performing the exact same
operation? If you are sending and receiving a fixed-size buffer, then
the number of IN and OUT packets should be absolutely identical. How
much data are you sending, exactly? Is it a bulk pipe? Are you
sending one large buffer, or many small buffers?


Tim Roberts, xxxxx@probo.com
Providenza & Boekelheide, Inc.


NTDEV is sponsored by OSR

Visit the list at: http://www.osronline.com/showlists.cfm?list=ntdev

OSR is HIRING!! See http://www.osr.com/careers

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at
http://www.osronline.com/page.cfm?name=ListServer


NTDEV is sponsored by OSR

Visit the list at: http://www.osronline.com/showlists.cfm?list=ntdev

OSR is HIRING!! See http://www.osr.com/careers

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at http://www.osronline.com/page.cfm?name=ListServer

The app is doing the same thing, but more packets are sent in the USB
channel on the 64-bit machine…so the gating function on performance is
probably wait time for te USB transfers. Questions and approaches I would
cosider:

Maintain counters inyour driver of how many of what transactions occur.
Unlike numbers that cone from USB sniffers (which give you another
metric), these nubers would tell you what your driver is doing. If it is
doing the same thing for both architectures, you know te problem is in
lower levels.

Use a “hidden IOCTL” to retrieve these nubers.

If you have a completion routine, particularly one that re-transmits te
IRP back down, count how many times it is activated. There may be more
splitting tan you think.

For your completion routine, you probably pass a data structure to te
completion routine. Add a 64-bit tiestamp to it. On te completion of the
IRP, compute te delta-T spent in te lower levels (note that depending on
intercore clock skews, these values may have errors, and it might be
worthwhile to do mean-and-standard-deviation analysis. You only need
three scalar variables to do this. Integer computations using fixed-point
64-bit integers will suffice; this avoids the need for flosting-point in
the kernel.

If it is in the lower levels (USBD and friends) you can blame Microsoft,
unless you have doe sonething tat causes the lower layers to work harder.

That’s what I would do if it were my driver. Forgive the typos…I’m
using my iPad under less-than-ideal conditions (te sun is shining in my
eyes, which it wasn’t doing when I sat down) and a horrible bug in Safari
makes it impossible to correct more than one typo, after which I have to
do several things to get it to respond again. Painful. Te most common
error is I semm to keep missing the “h” key when traversing from the “t”
key.
joe

>since these numbers were presented without explanation, I wasn’t sure
> what to make of them

>32-bit: 1,546,555 events
>64-bit: 1,991,076 events

>Ratio: 1.28:1

>You say the 64-bit is 35% slower. But it is doing 28% more work. Why?
> joe

Joe,

You ask, “Why?”. That is indeed the question.

You probably missed the earlier messages on the thread. The statistics I
presented were the result of testing a UMDF USB driver I wrote. It was
much faster (35%) on an old 32-bit machine than on a new high-power 64-bit
machine. I was looking for an explanation of what might be happening. More
importantly, I was looking for a solution to make my customer happy.

The issue became semi-moot when I tested against another 64-bit Win7
system and got somewhat better performance than the 32-bitter. It is still
a mystery why the first machine, which is my daily driver and development
machine, is so slow. I am now wondering if some sort of tracing or
validation is going on. I commented out all TraceEvent calls in the source
code, but that didn’t seem to matter one bit.

Bob Ammerman
– RAm Systems

> Bob Ammerman wrote:
>> 32-Bit:
>>
>> IN 301077
>> OUT 312328
>> ACK 7407
>> NAK 605999
>> STALL 0
>> DATA 319747
>>
>> 64-bit:
>> IN 397509
>> OUT 397492
>> ACK 6388
>> NAK 788614
>> STALL 0
>> DATA 401073
>>
>> I am not quite sure how to interpret these numbers, but it looks like
>> the 64-bitter is so fast that it is resulting in extra packets on the
>> USB bus?!?
>
> No, that doesn’t make sense. USB is all scheduled, and controlled by
> a hardware timer independent of the processor. Also remember that
> your two processors have essentially the same clock speed; the laptop
> isn’t going to be that much faster at the assembly level.
>
> Are you quite sure both of these were performing the exact same
> operation? If you are sending and receiving a fixed-size buffer, then
> the number of IN and OUT packets should be absolutely identical. How
> much data are you sending, exactly? Is it a bulk pipe? Are you
> sending one large buffer, or many small buffers?
>
> –
> Tim Roberts, xxxxx@probo.com
> Providenza & Boekelheide, Inc.
>
>
> —
> NTDEV is sponsored by OSR
>
> Visit the list at: http://www.osronline.com/showlists.cfm?list=ntdev
>
> OSR is HIRING!! See http://www.osr.com/careers
>
> For our schedule of WDF, WDM, debugging and other seminars visit:
> http://www.osr.com/seminars
>
> To unsubscribe, visit the List Server section of OSR Online at
> http://www.osronline.com/page.cfm?name=ListServer
>


NTDEV is sponsored by OSR

Visit the list at: http://www.osronline.com/showlists.cfm?list=ntdev

OSR is HIRING!! See http://www.osr.com/careers

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at
http://www.osronline.com/page.cfm?name=ListServer


NTDEV is sponsored by OSR

Visit the list at: http://www.osronline.com/showlists.cfm?list=ntdev

OSR is HIRING!! See http://www.osr.com/careers

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at
http://www.osronline.com/page.cfm?name=ListServer

The app is doing the same thing, but more packets are sent in the USB channel on the 64-bit machine…so the gating function on performance is probably wait time for te USB transfers. Questions and approaches I would
cosider:

Maintain counters inyour driver of how many of what transactions occur.
Unlike numbers that cone from USB sniffers (which give you another metric), these nubers would tell you what your driver is doing. If it is doing the same thing for both architectures, you know te problem is in lower levels.

** The app is issuing the same number of reads and writes as seen at the app level. Each of these turns into a single I/O at the driver level (for certain), and becomes a single USB “transaction” (I am pretty sure). The difference in USB bus seems to be more naks and retries. I am wondering if that could happen due to a faster host.

Use a “hidden IOCTL” to retrieve these numbers.

** Good idea. I can use that to verify my assumptions in the paragraph above.

If you have a completion routine, particularly one that re-transmits te IRP back down, count how many times it is activated. There may be more splitting tan you think.

** There shouldn’t be any splitting at all, but I can check.

For your completion routine, you probably pass a data structure to te completion routine. Add a 64-bit tiestamp to it. On te completion of the IRP, compute te delta-T spent in te lower levels (note that depending on intercore clock skews, these values may have errors, and it might be worthwhile to do mean-and-standard-deviation analysis. You only need three scalar variables to do this. Integer computations using fixed-point 64-bit integers will suffice; this avoids the need for flosting-point in the kernel.

If it is in the lower levels (USBD and friends) you can blame Microsoft, unless you have doe sonething tat causes the lower layers to work harder.

** I don’t know if you saw my message where I said it wasn’t an issue on a second Win7-64 machine. Stranger and stranger.

That’s what I would do if it were my driver. Forgive the typos…I’m using my iPad under less-than-ideal conditions (te sun is shining in my eyes, which it wasn’t doing when I sat down) and a horrible bug in Safari makes it impossible to correct more than one typo, after which I have to do several things to get it to respond again. Painful. Te most common error is I semm to keep missing the “h” key when traversing from the “t”
key.
Joe

** Again, thanks for all your time on this. I really am trying to verify the things you mention, but right now must of my efforts are directed to fighting fires in other projects.