Re: Flushing DMA Buffer Allocated with AllocateCommon Buffer

Chuck,

You offer some solutions to some of the problems posed, and I agree that
with the right structure for things, some of the potential problems you
suggest are avoidable.

As to the performance advantages of having a smaller miniport:

  1. The miniport does some performance critical work, but the actual drawing
    of things is done in the Display Driver proper (in our case 3DLDD.DLL as
    opposed to 3DLMP.SYS), so this is the place where you make sure that
    everything is cache-aligned, compact etc. (Oh, and this file is probably a
    few megabytes for most manufacturers…). So from an “half-life frames per
    second” perspective, the miniport is essentially unimportant, or at least
    the performance of the code in the miniport is. Of course, if the miniport
    decided to “not enable AGP8X, use PCI instead”, you’d surely see a BIG
    difference in performance. But this is a once and for all type setup, so not
    particularly critical in terms of how it’s being performed.

  2. In most cases, runtime checks can be avoided with clever tricks. This
    applies to miniport and display driver both. Just set things up at the start
    of day, and then you haven’t got too many runtime checks after that. Just as
    an example, if you need to know how much memory a particular board has, you
    store that as a member of the device structure, so you don’t need to check
    every time “is it card X, then video memory = 128M”.

The biggest factor, as Calvin explained, is that there is a need to keep a
large number of fair-sized tables to keep track of (for instance) video
modes for different display settings. This can expand quite quickly, and
depending on how you do things, you may end up with quite a lot more data
than the obvious X, Y, BPP and VFreq. For instance, you may want to encode
in this table some of the hard to calculate parameters to the video unit,
rather than calculating them at the time you need them. I don’t know how
much of our driver is tables, but certainly there are some pretty large
tables in quite a few places. Mostly, these are “use once” tables, so they
are only of interest to the setup code when the graphics mode is
initialized. Once it’s been set up, it’s not used until next time the screen
mode is changed (for instance when you start half-life in full-screen mode).
Of course, having a quarter meg of tables will not make much of a difference
to the performance in half-life, as the table(s) are all out of the cache by
the time you start playing half-life, and the code in the miniport may or
may not be efficient, but very little of it will actually be used when
playing half-life, anyways.

Finally, if you have a reasonably fast processor, it’s most likely that the
limiting factor for the graphics speed is mainly the graphics processor
itself. Of course, for really complex games, this may not be true. But a
significant amount of time can easily be spent on “waiting for the graphics
processor” if you don’t have lots of math to do before drawing the next
frame. This is why some benchmarks end up giving exactly the same score
whatever the processor when you get to a certain level of performance.
Obviously, optimising the code in the driver (whether miniport or display
driver) in this case will gain 0% performance increase.


Mats

-----Original Message-----
From: Chuck Batson [mailto:xxxxx@cbatson.com]
Sent: Thursday, November 20, 2003 3:45 PM
To: Windows System Software Devs Interest List
Subject: [ntdev] Re: Flushing DMA Buffer Allocated with AllocateCommon
Buffer

Hi Mats, thanks for the thorough answer. I know absolutely nothing
about Windows video drivers, so I’m mostly just curious what
the reasons
are for doing this.

> The other reason would be maintaining the different builds in some
sensible
> way, and avoiding errors. If you split the source up into different
source
> files, some update somewhere will be missed in a different version.

Presumably any code that needed to be shared among the different
versions would be in a common place (perhaps a library) to
avoid “copy &
paste” updates and other modification propogation errors.

> If you
> have lots of #if in the code, it becomes hairy for other reasons.
Having one
> large source file that contains all different variations without any
> conditional compile makes it relatively easy to maintain.

This is true, although in those places where you need
different code for
different hardware, it has to be done somehow – be it a compile-time
#if or a run-time if (meaning a conditional of some sort).

> There’s also the fact that for each package you have, you need
separate WHQL
> certification, which means that if you do a different build for a
particular
> board/SKU etc, you need to run the same tests on this build
as you did
on
> the main build, so for each variation you add one lot of
WHQL runs. A
run of
> WHQL for Display driver takes around a day to run, assuming all goes
well…
> Add to this that you have to send the logs to MS, get them
to certify
you,
> and wait for the results from MS.

This is probably the most compelling of all the reasons you mention.
=^)

> If the driver is updated often (during Beta stage when you’re
developing a
> new board + ASIC for instance), you also get problems with tracking
which
> versions of which variation of the driver has which fixes
included, as
> someone may have updated something during the daily build stage, and
the
> second of the builds for that day has a different set of fixes than
the
> first build.

These are of course real issues, but there are procedures and
processes
you can use to alleviate them. For example, having a dedicated build
engineer; building from a source “snapshot” (so you don’t get builds
when the source code is in an “intermediate” state); and building all
binaries during a single build from the source snapshot.

> Now multiply all this by the number of OS’s that you support (WinNT,
2K, XP,
> 9X etc) and you start seeing why having one driver is a real nice
thing.

This is also true. I have written application code that handles the
various OS flavors at run-time as opposed to compile-time. I’m just
surprised that something somewhat performance critical as a
video driver
would do this too. Having a single binary implies at least three
potentially performance-sapping side-effects: (1) run-time
conditionals
(extra instructions for tests, comparisons, branches; CPU performance
with regard to branches, such as incorrectly predicted branches,
instruction fetch queue and pipeline flushes, etc.); (2)
increased cache
misses due to lower spacial proximity as well as larger footprint; and
(3) swapping or memory resource consumption due to a larger binary
footprint. Do you have any thoughts on how, in a practical
“real world”
situation, having a unified driver binary affects performance? For
example, if I had a driver compiled specifically for my hardware, how
many more frames per second would I see playing Half Life? =^)

Chuck

> > -----Original Message-----
> > From: Chuck Batson [mailto:xxxxx@cbatson.com]
> > Sent: Thursday, November 20, 2003 2:48 PM
> > To: Windows System Software Devs Interest List
> > Subject: [ntdev] Re: Flushing DMA Buffer Allocated with
AllocateCommon
> > Buffer
> >
> >
> > Perhaps this is a dumb question, but is there any particular
> > reason why
> > you don’t segregate into different builds? Why is it
> > necessary to cram
> > everything into a single driver binary?
> >
> > Chuck
> >
> > ----- Original Message -----
> > From: “Calvin Guan”
> > > To: “Windows System Software Devs Interest List”
>
> > > Sent: Wednesday, November 19, 2003 2:26 AM
> > > Subject: [ntdev] Re: Flushing DMA Buffer Allocated with
> AllocateCommon
> > > Buffer
> > >
> > >
> > > > To add what Alberto said, our miniport has to support a
> huge list
> of
> > > desktop
> > > > and mobile ASICs. Each asic has different video BIOS to handle.
> The
> > > most
> > > > headache to me is the mobile ASICs on notebooks. Different OEMs
> have
> > > > different LCD panels. And different OEM requires different
> features.
> > > Also,
> > > > there are many awesome features implemented in the miniport.
> > > >
> > > > Instead of “miniport”, I would call it a griantport. It’s
> > > even larger
> > > than
> > > > ntfs.sys in size. I really miss the day when I was with
> > > NDIS miniport
> > > that I
> > > > wrote every single line of code for my driver-:slight_smile:
> > > >
> > > > -----Original Message-----
> > > > From: Moreira, Alberto [mailto:xxxxx@compuware.com]
> > > > Sent: Tuesday, November 18, 2003 10:40 AM
> > > > To: Windows System Software Devs Interest List
> > > > Subject: [ntdev] Re: Flushing DMA Buffer Allocated with
> > > AllocateCommon
> > > > Buffer
> > > >
> > > >
> > > > There’s a lot of functionality in a Miniport, it does
> most of the
> > > > non-time-critical functions of driving a graphics
> subsystem. Some
> > > people put
> > > > support for several different chips in the same piece
> of code, but
> > > even if
> > > > you only have one chip, your Miniport may end up being
> pretty big.
> > > Some of
> > > > the actual space is taken by tables, for example, every graphics
> > > driver
> > > > supports several resolutions and bit depths, and one must
> > > keep tables
> > > of
> > > > register settings that set up your chip to the
> corresponding video
> > > mode.
> > > > There’s also tables with configuration and capability settings,
> and
> > > they
> > > > take space. You must handle initialization, capabilities, mode
> > > changes,
> > > > power management, multiple screens, resource management,
> > > you name it.
> > > You
> > > > must also manage the retrace interrupt. In WinXP
> there’s even new
> > > support
> > > > for DMA. BTW, Calvin, do you guys implement and use the new
> > > DMA calls
> > > that
> > > > WinXP added to the Miniport ?
> > > >
> > > > Alberto.
> > > >
> > > >
> > > > -----Original Message-----
> > > > From: xxxxx@lists.osr.com
> > > > [mailto:xxxxx@lists.osr.com]On Behalf Of Maxim
> > > S. Shatskih
> > > > Sent: Monday, November 17, 2003 10:14 PM
> > > > To: Windows System Software Devs Interest List
> > > > Subject: [ntdev] Re: Flushing DMA Buffer Allocated with
> AllocateC
> > > > ommonBuffer
> > > >
> > > >
> > > > Wow! Am I right that this huge amount of code is due to
> > > supporting
> > > all
> > > > videocard hardware models and maintaining the backward
> > > compatibility,
> > > so
> > > > that the newest binary can work with even the old hardware?
> > > >
> > > > Maxim Shatskih, Windows DDK MVP
> > > > StorageCraft Corporation
> > > > xxxxx@storagecraft.com mailto:xxxxx
> > > > http://www.storagecraft.com http:
> > > >
> > > >
> > > > ----- Original Message -----
> > > > From: Calvin Guan mailto:xxxxx
> > > > To: Windows System Software Devs Interest
> > > mailto:xxxxx List
> > > >
> > > > Sent: Tuesday, November 18, 2003 4:02 AM
> > > > Subject: [ntdev] Re: Flushing DMA Buffer Allocated with
> AllocateC
> > > > ommonBuffer
> > > >
> > > >
> > > > Well, video miniport is a lot of code-:).
> > > > Our Radeon x86 free build miniport (ati2mtag.sys) is more than
> 600k.
> > > the chk
> > > > build doesn’t fit into a floppy…
> > > >
> > > > Calvin Guan, Software Developer xxxxx@nospam.ati.com
> > > > mailto:xxxxx
> > > > SW2D-Radeon NT Core Drivers
> > > > ATI Technologies Inc.
> > > > 1 Commerce Valley Drive East
> > > > Markham, Ontario, Canada L3T 7X6
> > > > Tel: (905) 882-2600 Ext. 8654
> > > > Find a driver: http://www.ati.com/support/driver.html
> > > > http:
> > > >
> > > >
> > > > > -----Original Message-----
> > > > > From: Maxim S. Shatskih [mailto:xxxxx@storagecraft.com
> > > > mailto:xxxxx]
> > > > > Sent: Monday, November 17, 2003 7:20 PM
> > > > > To: Windows System Software Devs Interest List
> > > > > Subject: [ntdev] Re: Flushing DMA Buffer Allocated with
> AllocateC
> > > > > ommonBuffer
> > > > >
> > > > >
> > > > > > Miniport. For example, look at the Permedia P3 sample in
> > > > > the DDK, the DMA
> > > > > > rendering is handled in the driver and not in the Miniport.
> > > > > There’s not
> > > > >
> > > > > Then why the nVidia’s miniport is THIS huge (500KB or such)?
> > > > >
> > > > > Maxim Shatskih, Windows DDK MVP
> > > > > StorageCraft Corporation
> > > > > xxxxx@storagecraft.com
> > > > > http://www.storagecraft.com http:
> > > > >
> > > > >
> > > > > —
> > > > > Questions? First check the Kernel Driver FAQ at
> > > > > http://www.osronline.com/article.cfm?id=256
> > > > http:
> > > > >
> > > > > You are currently subscribed to ntdev as: xxxxx@ati.com
> > > > > To unsubscribe send a blank email to
> > > xxxxx@lists.osr.com
> > > > >
> > > >
> > > > —
> > > > Questions? First check the Kernel Driver FAQ at
> > > > http://www.osronline.com/article.cfm?id=256
> > > >
> > > > You are currently subscribed to ntdev as: xxxxx@storagecraft.com
> > > > To unsubscribe send a blank email to
> > > xxxxx@lists.osr.com
> > > >
> > > > —
> > > > Questions? First check the Kernel Driver FAQ at
> > > > http://www.osronline.com/article.cfm?id=256
> > > >
> > > > You are currently subscribed to ntdev as:
> > > xxxxx@compuware.com
> > > > To unsubscribe send a blank email to
> > > xxxxx@lists.osr.com
> > > >
> > > > —
> > > > Questions? First check the Kernel Driver FAQ at
> > > > http://www.osronline.com/article.cfm?id=256
> > > >
> > > > You are currently subscribed to ntdev as: xxxxx@ati.com
> > > > To unsubscribe send a blank email to
> > > xxxxx@lists.osr.com
> > > >
> > > >
> > > >
> > > >
> > > > The contents of this e-mail are intended for the named
> > > addressee only.
> > > It
> > > > contains information that may be confidential. Unless
> you are the
> > > named
> > > > addressee or an authorized designee, you may not copy or use it,
> or
> > > disclose
> > > > it to anyone else. If you received it in error please notify us
> > > immediately
> > > > and then destroy it.
> > > >
> > > >
> > > >
> > > >
> > > > —
> > > > Questions? First check the Kernel Driver FAQ at
> > > http://www.osronline.com/article.cfm?id=256
> > > >
> > > > You are currently subscribed to ntdev as: xxxxx@cbatson.com
> > > > To unsubscribe send a blank email to
> > > xxxxx@lists.osr.com
> > > >
> > >
> > >
> > > —
> > > Questions? First check the Kernel Driver FAQ at
> > > http://www.osronline.com/article.cfm?id=256
> > >
> > > You are currently subscribed to ntdev as:
> xxxxx@3dlabs.com
> > > To unsubscribe send a blank email to
> xxxxx@lists.osr.com
> > >
> >
> > —
> > Questions? First check the Kernel Driver FAQ at
> http://www.osronline.com/article.cfm?id=256
> >
> > You are currently subscribed to ntdev as: xxxxx@cbatson.com
> > To unsubscribe send a blank email to
> xxxxx@lists.osr.com
> >
>
>
> —
> Questions? First check the Kernel Driver FAQ at
> http://www.osronline.com/article.cfm?id=256
>
> You are currently subscribed to ntdev as: xxxxx@3dlabs.com
> To unsubscribe send a blank email to xxxxx@lists.osr.com
></http:></http:></mailto:xxxxx></http:></mailto:xxxxx></mailto:xxxxx></mailto:xxxxx></http:></mailto:xxxxx>

We have a daily build which starts at 2:30 am and is supposed to yield a
working system the next morning. Whoever breaks the build is supposed to
give away donuts to the whole team. We have a daily labeling system that
increments the build number every new build, so that our release engineers
can recreate any build from that source control label. And long live
Perforce !

Alberto.

-----Original Message-----
From: xxxxx@lists.osr.com
[mailto:xxxxx@lists.osr.com]On Behalf Of Chuck Batson
Sent: Thursday, November 20, 2003 10:45 AM
To: Windows System Software Devs Interest List
Subject: [ntdev] Re: Flushing DMA Buffer Allocated with AllocateCommon
Buffer

If the driver is updated often (during Beta stage when you’re
developing a
new board + ASIC for instance), you also get problems with tracking
which
versions of which variation of the driver has which fixes included, as
someone may have updated something during the daily build stage, and
the
second of the builds for that day has a different set of fixes than
the
first build.

These are of course real issues, but there are procedures and processes
you can use to alleviate them. For example, having a dedicated build
engineer; building from a source “snapshot” (so you don’t get builds
when the source code is in an “intermediate” state); and building all
binaries during a single build from the source snapshot.

The contents of this e-mail are intended for the named addressee only. It
contains information that may be confidential. Unless you are the named
addressee or an authorized designee, you may not copy or use it, or disclose
it to anyone else. If you received it in error please notify us immediately
and then destroy it.

In a lot of cases, the effect of the OS is relatively minor. Our OpenGL
code, for example, had two or three .c files where we put all of our Windows
9x support, and that was mostly composed of implementations of NT calls such
as EngXxxYyyyy at the lowest level, so that the rest of the system could act
as if it was talking to WinNT. However, the chip issues are tough, not only
because some chips are different enough at run time that we must take those
differences into account at run time, but also because chips aren’t perfect
and they have bugs. I would have two revs of the same chip, one would have
bug A, the other would have bug B but bug A would be fixed: and both bugs
would be masked away by some clever software technique. So, runtime check
was unavoidable, because we didn’t know which rev of the chip and/or board
was in the machine.

Your suggestion is good, because it allows isolating chip-dependent modules
in a few places, but abstracting the hardware isn’t that simple in the case
of graphics, and it may be too slow in some cases. If the processor is fast
enough, no problem, but if the processor is slower vis-a-vis the graphics
chip, additional interfaces eat up performance. In extreme cases, I’ve seen
people replacing whole chunks of the GDI with chip-dependent code just to
shave off a few winmarks ! Also, it may be a lot of trouble to produce a
module just to accommodate one or two hot fixes to handle a chip bug that’s
only present in a specific version of a chip. My recollection was that the
code started pretty well structured, but then, as the bugs and their
circumventions piled up, things had a tendency to spaghetti away, and it was
pretty hard to organize the code in a coherent module structure.

Alberto.

-----Original Message-----
From: xxxxx@lists.osr.com
[mailto:xxxxx@lists.osr.com]On Behalf Of Phil Barila
Sent: Thursday, November 20, 2003 11:32 AM
To: Windows System Software Devs Interest List
Subject: [ntdev] Re: Flushing DMA Buffer Allocated with AllocateCommon
Buffer

“Chuck Batson” wrote in message news:xxxxx@ntdev…

[snip]

> This is also true. I have written application code that handles the
> various OS flavors at run-time as opposed to compile-time. I’m just
> surprised that something somewhat performance critical as a video driver
> would do this too. Having a single binary implies at least three
> potentially performance-sapping side-effects: (1) run-time conditionals
> (extra instructions for tests, comparisons, branches; CPU performance
> with regard to branches, such as incorrectly predicted branches,
> instruction fetch queue and pipeline flushes, etc.); (2) increased cache
> misses due to lower spacial proximity as well as larger footprint; and
> (3) swapping or memory resource consumption due to a larger binary
> footprint. Do you have any thoughts on how, in a practical “real world”
> situation, having a unified driver binary affects performance? For
> example, if I had a driver compiled specifically for my hardware, how
> many more frames per second would I see playing Half Life? =^)

I’m not going to offer anything for Mats, but I can describe one technique
to ameliorate the first two of the three: Init-time specialization.

That means that you abstract the hardware differences across a common
interface, then you implement a separate module for each hardware variant
against that interface. At init time, you select the module that matches
the hardware, pop the function pointers into your dispatch table, and then
just call them unconditionally when you need them. No run-time decisions
required. In fact, that’s pretty much what the various miniports do, but
the miniport loads first, and the common part (the port) is linked to it by
the loader. The technique I described is the same thing, only you are doing
the linking explicitly, instead of letting the OS do it for you.

Phil

Philip D. Barila Windows DDK MVP
Seagate Technology, LLC
(720) 684-1842
As if I need to say it: Not speaking for Seagate.


Questions? First check the Kernel Driver FAQ at
http://www.osronline.com/article.cfm?id=256

You are currently subscribed to ntdev as: xxxxx@compuware.com
To unsubscribe send a blank email to xxxxx@lists.osr.com

The contents of this e-mail are intended for the named addressee only. It
contains information that may be confidential. Unless you are the named
addressee or an authorized designee, you may not copy or use it, or disclose
it to anyone else. If you received it in error please notify us immediately
and then destroy it.