When will 64-bit address DMA actually fail without IoGetDmaAdapter

OSR_Community_User · October 15, 2005, 1:48pm

I am working on some drivers that are layered above a fabric I/O device.

I noticed the fabric drivers are pretty much ported Linux drivers (yuck). I
also was unhappy to notice there are no calls to IoGetDmaAdapter (or the bus
interface variant), even though the device is a complex bus master device
and will be doing transfers to buffers I supply.

I refreshed my memory on the Windows DMA model by reading dma.doc. No
surprises, just a reminder of the correct way to do DMA.

Can people give me some concrete examples of when bus master DMA will fail
for a 64-bit address capable scatter gather capable device if someone just
calls MmGetPhysicalAddress and uses those addresses for the DMA transfer?
The document dma.doc gave a potential issue of cache coherence on IA64
systems. I’d prefer to be able to say on machines X, Y, and Z or on
machines with AMD chipset R this architecture is broken, instead of just
this architecture is in theory broken on some machine which doesn’t reflect
any machine you can buy today (so it’s actually ok to bet our company on
this).

Thanks.

Jan

Calvin_Guan-2 · October 15, 2005, 6:54pm

Many yrs ago (back in NT 3.51), I 've written a driver
for a bus mastering device on DEC Alpha system. I was
naive enough to Use MmGetPhysicalAddress to do DMA and
get a nice BSOD complaining bus parity error. I learnt
that Physical Address that MMU sees may not equal to
what the device sees, for architectures that implement
real mapped registers. Sparc is another example. OTOH,
nowadays, video driver or AGP bridge driver did use
MmAllocateContiguousMemorySpecifyCache or its variant
for DMA in some cases.

(so it’s actually ok
to bet our company on
this).

Why bet? What prevents you from using the “DMA model”?

–
Calvin Guan (Windows DDK MVP)
NetXtreme Longhorn Miniport Prime
Broadcom Corp. www.broadcom.com

Find your next car at http://autos.yahoo.ca

OSR_Community_User · October 15, 2005, 7:35pm

> > (so it’s actually ok

> to bet our company on
> this).

Why bet? What prevents you from using the “DMA model”?

We didn’t write the code, and have no control over the code; it came from
the vendor of the fabric hardware. It’s a fairly large set of drivers for
doing fabric I/O, and we do get to see the source code. I think the question
is can we depend on this stuff working, or do we have a serious problem.

If I were starting at ground zero then I absolutely would be using the
proper DMA model.

Jan

Doron_Holan · October 15, 2005, 9:31pm

I don’t know a whole lot about DMA, but I do know this reason for using
the DMA adapter on 64 bit machines. Not all cards can see the full
physical address range of the machine. Additionally, I think not all
bridges can see the full address range either. The DMA adapter + HAL
know this for the current hardware configuratoin and will only translate
a system VA to a physical VA that the card can translate, while
MmGetPhysicalAddress can easily give you an inaccessible VA.

d

-----Original Message-----
From: xxxxx@lists.osr.com
[mailto:xxxxx@lists.osr.com] On Behalf Of Jan Bottorff
Sent: Saturday, October 15, 2005 4:35 PM
To: Windows System Software Devs Interest List
Subject: RE: [ntdev] When will 64-bit address DMA actually fail without
IoGetDmaAdapter

> (so it’s actually ok
> to bet our company on
> this).

Why bet? What prevents you from using the “DMA model”?

We didn’t write the code, and have no control over the code; it came
from
the vendor of the fabric hardware. It’s a fairly large set of drivers
for
doing fabric I/O, and we do get to see the source code. I think the
question
is can we depend on this stuff working, or do we have a serious problem.

If I were starting at ground zero then I absolutely would be using the
proper DMA model.

Jan

Questions? First check the Kernel Driver FAQ at
http://www.osronline.com/article.cfm?id=256

You are currently subscribed to ntdev as: xxxxx@microsoft.com
To unsubscribe send a blank email to xxxxx@lists.osr.com

OSR_Community_User · October 16, 2005, 12:43pm

I have also inherited a driver that straddles between Windows
and different versions of Unix. This driver has a core library
of OS_independent functions, and DMA is done inside it - hence
the driver doesn’t use the Windows DMA model, not at all. The
way it works is, the driver builds a scatter-gather list for the
buffer by hand, using MmGetPhysicalAddress in its Windows code,
and from that point on the DMA is OS-independent. This seems to
work fine, except that in some cases under some Unixes it may be
safer to copy the user buffer to/from a kernel buffer, and
generating the Scatter-Gather list for the kernel buffer
instead.

It all seems to work fine, but I’m always on the prowl for cases
where it could break!

Alberto.

----- Original Message -----
From: “Doron Holan”
To: “Windows System Software Devs Interest List”

Sent: Saturday, October 15, 2005 9:28 PM
Subject: RE: [ntdev] When will 64-bit address DMA actually fail
without IoGetDmaAdapter

I don’t know a whole lot about DMA, but I do know this reason
for using
the DMA adapter on 64 bit machines. Not all cards can see the
full
physical address range of the machine. Additionally, I think
not all
bridges can see the full address range either. The DMA adapter
+ HAL
know this for the current hardware configuratoin and will only
translate
a system VA to a physical VA that the card can translate, while
MmGetPhysicalAddress can easily give you an inaccessible VA.

d

-----Original Message-----
From: xxxxx@lists.osr.com
[mailto:xxxxx@lists.osr.com] On Behalf Of Jan
Bottorff
Sent: Saturday, October 15, 2005 4:35 PM
To: Windows System Software Devs Interest List
Subject: RE: [ntdev] When will 64-bit address DMA actually fail
without
IoGetDmaAdapter

> > (so it’s actually ok
> > to bet our company on
> > this).
>
> Why bet? What prevents you from using the “DMA model”?

We didn’t write the code, and have no control over the code; it
came
from
the vendor of the fabric hardware. It’s a fairly large set of
drivers
for
doing fabric I/O, and we do get to see the source code. I think
the
question
is can we depend on this stuff working, or do we have a serious
problem.

If I were starting at ground zero then I absolutely would be
using the
proper DMA model.

- Jan

—
Questions? First check the Kernel Driver FAQ at
http://www.osronline.com/article.cfm?id=256

You are currently subscribed to ntdev as:
xxxxx@microsoft.com
To unsubscribe send a blank email to
xxxxx@lists.osr.com

—
Questions? First check the Kernel Driver FAQ at
http://www.osronline.com/article.cfm?id=256

You are currently subscribed to ntdev as: unknown lmsubst tag
argument: ‘’
To unsubscribe send a blank email to
xxxxx@lists.osr.com

OSR_Community_User · October 16, 2005, 12:46pm

Hi, Calvin,

I wonder if you could elaborate ? The driver I inherited also
uses MmGetPhysicalAddress, and it seems to work fine. However, I
do have the option of copying the user buffer to/from a kernel
buffer, and doing the DMA - including building the
Scatter-Gather list - on the kernel buffer. So, any specific
cases will be highly welcome!

Alberto.

----- Original Message -----
From: “Calvin Guan”
To: “Windows System Software Devs Interest List”

Sent: Saturday, October 15, 2005 6:54 PM
Subject: Re: [ntdev] When will 64-bit address DMA actually fail
without IoGetDmaAdapter

> Many yrs ago (back in NT 3.51), I 've written a driver
> for a bus mastering device on DEC Alpha system. I was
> naive enough to Use MmGetPhysicalAddress to do DMA and
> get a nice BSOD complaining bus parity error. I learnt
> that Physical Address that MMU sees may not equal to
> what the device sees, for architectures that implement
> real mapped registers. Sparc is another example. OTOH,
> nowadays, video driver or AGP bridge driver did use
> MmAllocateContiguousMemorySpecifyCache or its variant
> for DMA in some cases.
>
>> (so it’s actually ok
>> to bet our company on
>> this).
>
> Why bet? What prevents you from using the “DMA model”?
>
> –
> Calvin Guan (Windows DDK MVP)
> NetXtreme Longhorn Miniport Prime
> Broadcom Corp. www.broadcom.com
>
>
>
>
>
>
>
> __________________________________________________________
> Find your next car at http://autos.yahoo.ca
>
> —
> Questions? First check the Kernel Driver FAQ at
> http://www.osronline.com/article.cfm?id=256
>
> You are currently subscribed to ntdev as: xxxxx@ieee.org
> To unsubscribe send a blank email to
> xxxxx@lists.osr.com

OSR_Community_User · October 16, 2005, 6:41pm

> I wonder if you could elaborate ? The driver I inherited also

uses MmGetPhysicalAddress, and it seems to work fine. However, I
do have the option of copying the user buffer to/from a kernel
buffer, and doing the DMA - including building the
Scatter-Gather list - on the kernel buffer. So, any specific
cases will be highly welcome!

I’ve been pondering what architectures will have a problem and had a few
thoughts.

The new Intel (and future AMD) cpu virtualization in hardware must create
the situation where a physical address as seen by the processor running the
OS in a virtual machine != a physical address as seen by ALL busses. Things
like VMWare must also have this issue, but also will not run arbitrary
devices. I’m planning on finding the Intel virtualization specs to
understand this better. I’m curious if part of the driving force behind the
virtual bus driver/virtual function driver architecture is to allow a path
that works on virtualized copies of the OS. Properly designed, it offhand
seems like you would run one copy of the virtual bus driver on a hypervisor
and then each instance of the OS just runs instances of the virtual function
drivers. It seems like you would need some sort of virtual resources passed
to the function driver AddDevice routine to describe how the function driver
and bus driver communicate.
In the recent past, cpu memory caches were automatically kept coherent by
hardware. When I think about things like SMP AMD systems (essentially a NUMA
architecture), it seems extremely inefficient for EVERY processor to have to
snoop its cache on EVERY cache line DMA. It seems very desirable to just DMA
data into one of the memory groups without creating snoop traffic across the
HyperTransport. I don’t know if it’s REQUIRED for PCI(X) and PCI-e to handle
cache coherence in hardware, or if it’s just how many systems happen to be.
If hardware doesn’t handle this, it seems like a range of cache on EVERY
processor will need to get flushed before a DMA transfer happens. The OS can
just generate inter cpu interrupts (or perhaps there is a way to generate
special bus cycles that processors can snoop on to flush caches in
parallel). The Windows DMA model would just automatically do whatever is
needed. Doing this in a driver explicitly might be hard. Any hardware
engineers out there who know what modern bus specs say about cache
coherence?
The document on Windows DMA mentions IA64 processors having some issues
if you don’t use the Windows DMA model, although not being a IA64 expert
don’t know the details.
What does it mean to say something “works”. I’d personally be very
unhappy if my server corrupted a cache line of data once a month. Testing
for this seems especially difficult. It seems possible you could create a
program that generated a test I/O load, and generated predictable data.
After running for some time period, like 1 month or 6 months, you could then
verify if the data stored on disk matched what it should be. I assume you’d
have to have test systems and control systems, that only were different in
one component. I’m not really a believer that you can test quality into
software. My experience is software quality is significantly determined by
the process. I know things like disk drives have reliability data available
on uncorrected errors per X gigabytes transferred. The real question comes
down to: all computer hardware has some unavoidable level of data
corruption, and do the drivers+hardware components we add significantly
degrade the system wide level of data corruption. There also is the
question: will customers KNOW about data corruption. As an engineer, data
corruption is VERY serious to me, although at some companies management will
just view it as the loss of a few customers. I actually think the whole open
source movement, where basically nobody is legally responsible for anything,
is going in the wrong direction in terms of making computers and their
makers accountable and responsible. But that’s a whole other discussion.

It’s hard to say how many of these might be a problem in the immediate
future on current OS’s (i.e. W2K3) vs. future OS’s (i.e. Longhorn server in
2007 or 2008). It does seem like a problem for a company to sell a
potentially expensive product that will not evolve and function as customers
assume in the near future, although the computer industry also seems not to
worry much about obsolescence.

Jan

Jake_Oshins · October 17, 2005, 1:11pm

I was about to reply to your original post. But you’ve covered the issues
nicely for yourself.

To summarize, your drivers will break on chipsets that need extra cache
coherence help and on virtualized systems where there is an I/O MMU.
Neither of these is particularly common today, but they’ll be much more
common in the near future. The drivers will also break on non-x86 machines
where DMA address don’t equal CPU-relative physical address, but those
machine have become very uncommon in the last five years.

–
Jake Oshins
Windows Kernel Group

This posting is provided “AS IS” with no warranties, and confers no rights.

“Jan Bottorff” wrote in message news:xxxxx@ntdev…
>> I wonder if you could elaborate ? The driver I inherited also
>> uses MmGetPhysicalAddress, and it seems to work fine. However, I
>> do have the option of copying the user buffer to/from a kernel
>> buffer, and doing the DMA - including building the
>> Scatter-Gather list - on the kernel buffer. So, any specific
>> cases will be highly welcome!
>
> I’ve been pondering what architectures will have a problem and had a few
> thoughts.
>
> 1) The new Intel (and future AMD) cpu virtualization in hardware must
> create
> the situation where a physical address as seen by the processor running
> the
> OS in a virtual machine != a physical address as seen by ALL busses.
> Things
> like VMWare must also have this issue, but also will not run arbitrary
> devices. I’m planning on finding the Intel virtualization specs to
> understand this better. I’m curious if part of the driving force behind
> the
> virtual bus driver/virtual function driver architecture is to allow a path
> that works on virtualized copies of the OS. Properly designed, it offhand
> seems like you would run one copy of the virtual bus driver on a
> hypervisor
> and then each instance of the OS just runs instances of the virtual
> function
> drivers. It seems like you would need some sort of virtual resources
> passed
> to the function driver AddDevice routine to describe how the function
> driver
> and bus driver communicate.
>
> 2) In the recent past, cpu memory caches were automatically kept coherent
> by
> hardware. When I think about things like SMP AMD systems (essentially a
> NUMA
> architecture), it seems extremely inefficient for EVERY processor to have
> to
> snoop its cache on EVERY cache line DMA. It seems very desirable to just
> DMA
> data into one of the memory groups without creating snoop traffic across
> the
> HyperTransport. I don’t know if it’s REQUIRED for PCI(X) and PCI-e to
> handle
> cache coherence in hardware, or if it’s just how many systems happen to
> be.
> If hardware doesn’t handle this, it seems like a range of cache on EVERY
> processor will need to get flushed before a DMA transfer happens. The OS
> can
> just generate inter cpu interrupts (or perhaps there is a way to generate
> special bus cycles that processors can snoop on to flush caches in
> parallel). The Windows DMA model would just automatically do whatever is
> needed. Doing this in a driver explicitly might be hard. Any hardware
> engineers out there who know what modern bus specs say about cache
> coherence?
>
> 3) The document on Windows DMA mentions IA64 processors having some issues
> if you don’t use the Windows DMA model, although not being a IA64 expert
> don’t know the details.
>
> 4) What does it mean to say something “works”. I’d personally be very
> unhappy if my server corrupted a cache line of data once a month. Testing
> for this seems especially difficult. It seems possible you could create a
> program that generated a test I/O load, and generated predictable data.
> After running for some time period, like 1 month or 6 months, you could
> then
> verify if the data stored on disk matched what it should be. I assume
> you’d
> have to have test systems and control systems, that only were different in
> one component. I’m not really a believer that you can test quality into
> software. My experience is software quality is significantly determined by
> the process. I know things like disk drives have reliability data
> available
> on uncorrected errors per X gigabytes transferred. The real question comes
> down to: all computer hardware has some unavoidable level of data
> corruption, and do the drivers+hardware components we add significantly
> degrade the system wide level of data corruption. There also is the
> question: will customers KNOW about data corruption. As an engineer, data
> corruption is VERY serious to me, although at some companies management
> will
> just view it as the loss of a few customers. I actually think the whole
> open
> source movement, where basically nobody is legally responsible for
> anything,
> is going in the wrong direction in terms of making computers and their
> makers accountable and responsible. But that’s a whole other discussion.
>
> It’s hard to say how many of these might be a problem in the immediate
> future on current OS’s (i.e. W2K3) vs. future OS’s (i.e. Longhorn server
> in
> 2007 or 2008). It does seem like a problem for a company to sell a
> potentially expensive product that will not evolve and function as
> customers
> assume in the near future, although the computer industry also seems not
> to
> worry much about obsolescence.
>
> - Jan
>
>
>
>

OSR_Community_User · October 17, 2005, 8:57pm

When I say “seems to work”, I mean, the shipping product is kind
of mature and I haven’t heard as yet of an issue from that
general direction. Yet I’ll deal with VMs when I have to cross
that bridge. Processor performance is not an issue, this is such
a chip-intensive business that I can afford to wallow and
splurge in processor power. Some configurations, specially those
running AquariusNet, may have two or four chips in a system, so,
again, we’re talking about a lot of i/o. Also, we do handle
cache coherence in machines where we need to. Now, this is a
64-bit peripheral on a 64-bit bus, and my major worry is not
being able to see some RAM address on a bus far to my north. Yet
I thought that PCI translation went a long way towards handling
differences between I/O and System bus addressing, or am I wrong
?

Alberto.

----- Original Message -----
From: “Jake Oshins”
Newsgroups: ntdev
To: “Windows System Software Devs Interest List”

Sent: Monday, October 17, 2005 1:10 PM
Subject: Re:[ntdev] When will 64-bit address DMA actually fail
without IoGetDmaAdapter

>I was about to reply to your original post. But you’ve covered
>the issues nicely for yourself.
>
> To summarize, your drivers will break on chipsets that need
> extra cache coherence help and on virtualized systems where
> there is an I/O MMU. Neither of these is particularly common
> today, but they’ll be much more common in the near future.
> The drivers will also break on non-x86 machines where DMA
> address don’t equal CPU-relative physical address, but those
> machine have become very uncommon in the last five years.
>
> –
> Jake Oshins
> Windows Kernel Group
>
> This posting is provided “AS IS” with no warranties, and
> confers no rights.
>
>
> “Jan Bottorff” wrote in message
> news:xxxxx@ntdev…
>>> I wonder if you could elaborate ? The driver I inherited
>>> also
>>> uses MmGetPhysicalAddress, and it seems to work fine.
>>> However, I
>>> do have the option of copying the user buffer to/from a
>>> kernel
>>> buffer, and doing the DMA - including building the
>>> Scatter-Gather list - on the kernel buffer. So, any specific
>>> cases will be highly welcome!
>>
>> I’ve been pondering what architectures will have a problem
>> and had a few
>> thoughts.
>>
>> 1) The new Intel (and future AMD) cpu virtualization in
>> hardware must create
>> the situation where a physical address as seen by the
>> processor running the
>> OS in a virtual machine != a physical address as seen by ALL
>> busses. Things
>> like VMWare must also have this issue, but also will not run
>> arbitrary
>> devices. I’m planning on finding the Intel virtualization
>> specs to
>> understand this better. I’m curious if part of the driving
>> force behind the
>> virtual bus driver/virtual function driver architecture is to
>> allow a path
>> that works on virtualized copies of the OS. Properly
>> designed, it offhand
>> seems like you would run one copy of the virtual bus driver
>> on a hypervisor
>> and then each instance of the OS just runs instances of the
>> virtual function
>> drivers. It seems like you would need some sort of virtual
>> resources passed
>> to the function driver AddDevice routine to describe how the
>> function driver
>> and bus driver communicate.
>>
>> 2) In the recent past, cpu memory caches were automatically
>> kept coherent by
>> hardware. When I think about things like SMP AMD systems
>> (essentially a NUMA
>> architecture), it seems extremely inefficient for EVERY
>> processor to have to
>> snoop its cache on EVERY cache line DMA. It seems very
>> desirable to just DMA
>> data into one of the memory groups without creating snoop
>> traffic across the
>> HyperTransport. I don’t know if it’s REQUIRED for PCI(X) and
>> PCI-e to handle
>> cache coherence in hardware, or if it’s just how many systems
>> happen to be.
>> If hardware doesn’t handle this, it seems like a range of
>> cache on EVERY
>> processor will need to get flushed before a DMA transfer
>> happens. The OS can
>> just generate inter cpu interrupts (or perhaps there is a way
>> to generate
>> special bus cycles that processors can snoop on to flush
>> caches in
>> parallel). The Windows DMA model would just automatically do
>> whatever is
>> needed. Doing this in a driver explicitly might be hard. Any
>> hardware
>> engineers out there who know what modern bus specs say about
>> cache
>> coherence?
>>
>> 3) The document on Windows DMA mentions IA64 processors
>> having some issues
>> if you don’t use the Windows DMA model, although not being a
>> IA64 expert
>> don’t know the details.
>>
>> 4) What does it mean to say something “works”. I’d personally
>> be very
>> unhappy if my server corrupted a cache line of data once a
>> month. Testing
>> for this seems especially difficult. It seems possible you
>> could create a
>> program that generated a test I/O load, and generated
>> predictable data.
>> After running for some time period, like 1 month or 6 months,
>> you could then
>> verify if the data stored on disk matched what it should be.
>> I assume you’d
>> have to have test systems and control systems, that only were
>> different in
>> one component. I’m not really a believer that you can test
>> quality into
>> software. My experience is software quality is significantly
>> determined by
>> the process. I know things like disk drives have reliability
>> data available
>> on uncorrected errors per X gigabytes transferred. The real
>> question comes
>> down to: all computer hardware has some unavoidable level of
>> data
>> corruption, and do the drivers+hardware components we add
>> significantly
>> degrade the system wide level of data corruption. There also
>> is the
>> question: will customers KNOW about data corruption. As an
>> engineer, data
>> corruption is VERY serious to me, although at some companies
>> management will
>> just view it as the loss of a few customers. I actually think
>> the whole open
>> source movement, where basically nobody is legally
>> responsible for anything,
>> is going in the wrong direction in terms of making computers
>> and their
>> makers accountable and responsible. But that’s a whole other
>> discussion.
>>
>> It’s hard to say how many of these might be a problem in the
>> immediate
>> future on current OS’s (i.e. W2K3) vs. future OS’s (i.e.
>> Longhorn server in
>> 2007 or 2008). It does seem like a problem for a company to
>> sell a
>> potentially expensive product that will not evolve and
>> function as customers
>> assume in the near future, although the computer industry
>> also seems not to
>> worry much about obsolescence.
>>
>> - Jan
>>
>>
>>
>>
>
>
>
> —
> Questions? First check the Kernel Driver FAQ at
> http://www.osronline.com/article.cfm?id=256
>
> You are currently subscribed to ntdev as: xxxxx@ieee.org
> To unsubscribe send a blank email to
> xxxxx@lists.osr.com

OSR_Community_User · October 20, 2005, 3:59am

Hmmm, the plot thickens.

It was pointed out to me the Windows Sockets Direct interface may fully do
data transfers from user mode. There will be no opportunity to make the
appropriate start or end DMA calls if the hardware isn’t cache coherent. The
WSD docs specifically talk about maintaining the bus address mapping across
I/O requests. It was also mentioned driver verifier will break WSD, as the
mapping may be done before the data is placed in the buffers.

Am I missing something in the WSD service provider API, or is there a
significant conflict between the WSD architecture and “proper” Windows DMA?

A related question: is the CacheEnabled flag on AllocateCommonBuffer just a
hint of your preference. If set, and your system is cache coherent you will
get cached memory and if not you won’t. It’s the adapter object that knows
if this bit should be set, not us driver developers.

I also looked into hardware based virtualization. One of the things Intel’s
(and I assume AMD’s) hardware virtualization extensions do is trap access to
CR3, which holds the physical address of the root page directory entries.
This allows a virtualization hypervisor to substitute the root page
directory entries with ones different that the OS thinks are being used,
essentially mapping the high physical address bits to hypervisor controlled
values. This means in a virtualized environment you may have bus address ==
processor physical address != OS virtualized physical address.
MmGetPhysicalAddress would return the virtualized OS physical address, even
though bus address == processor physical address.

Jan

To summarize, your drivers will break on chipsets that need extra cache
coherence help and on virtualized systems where there is an I/O MMU.
Neither of these is particularly common today, but they’ll be much more
common in the near future. The drivers will also break on non-x86
machines
where DMA address don’t equal CPU-relative physical address, but those
machine have become very uncommon in the last five years.

–
Jake Oshins
Windows Kernel Group

>> I wonder if you could elaborate ? The driver I inherited also
>> uses MmGetPhysicalAddress, and it seems to work fine. However, I
>> do have the option of copying the user buffer to/from a kernel
>> buffer, and doing the DMA - including building the
>> Scatter-Gather list - on the kernel buffer. So, any specific
>> cases will be highly welcome!