A less cynical rational would be that a decision was made by designers at
Microsoft that with increasing RAM available on users’ computers more than
one program could be resident concurrently and so there was no need to make
users wait for paging every time they pressed alt+tab (The time saving is
even bigger when they switch back a few seconds later)
I suspect that the OP is developing some specialized HW+software as some
kind of advanced research project; the goal of which is to eliminate or
dramatically reduce inter-node traffic so telling him that he is wasting his
time is unproductive. I assume that he is trying to engraft some cell
processing like concepts but finds it difficult because the NT kernel is
fundamentally an SMP design with optimizations for HT & NUMA in specific
circumstances.
wrote in message news:xxxxx@ntdev…
This feature may be related to killing the market for the software Mark
Russinovich called “Fraudware”, where these behaviors were used by
programs that claimed to “improve” performance but charged you $39.95 for
something that did just the opposite: it forced everything to page out,
making Task Manager display numbers that looked good, but only to the
naive.
Also, it is not clear, in retrospect, why minimizing the command window in
the debugger (on a host machine). would have much impact on the behavior
of the target; I hadn’t thought that one through. But there was so little
information that it was hard to guess what might be the problem. See my
previous posts.
In my experience, it has always been hardware/driver underdesign that was
the root cause of data overruns, but clever application-level programming
has, in all cases I’ve had, been able to compensate for this. Key here is
to eliminate all multiple-orders-of-magnitude problems before worrying
about factors of two, or improvements of 10%.
There was no good performance data presented. For example, I suspect
latencymon is some kind of filter driver that timestamps the IRP going
down, and which computes the latency in its completion routine. If so,
delays caused by the I/O Manager causing page faults would not be
measured, but could have profound impact on the total trip time of an IRP
from application space. A more useful measure might be the delta-T
between successive IRPs to the device. Without this critical piece of
information, it strikes me as profoundly silly to worry about NUMA
adjacency. Note that this number would account for application time,
scheduler overheads, thread preemption by kernel threads, ISRs and DPCs,
and page fault overheads. I suspect that such an analysis will show some
huge delta-T right before an overrun occurs. Of course, this useful only
if the app does synchronous I/O.
If NUMA latency were on trial, I don’t think you could even get a grand
jury to indict on the evidence presented; should such a miscarriage of
justice occur, the prosecution would be shredded at trial.
To understand a result from a measuring tool, you need to know what it is
measuring, and how accurately it reflects what is going on. For example,
what is the clock skew of different cores when KeQueryPerformanceCounter
is executed? I am not sure there is even a way to answer this question.
But it taints any high-resolution numbers obtained from multicore systems.
I can even visualize the code. Top-level driver creates a timestamp> pair and attaches it via SetCompletionRoutine. Completion
routine gets a pair. I’d also record the
IoStatus.Status value. I’d keep a large ring buffer which the monitoring
program would read-and-clear from time to time, and it would do the data
reduction. Lots of SMOP (Small Matter Of Programming) left as an Exercise
For The Reader. Note that the raw timestamps for both entry and
completion time are kept. Now the data reduction can compute both latency
and inter-IRP times, give you graphs, statistical reliability of the data,
etc. Somtimes I just write out CSV files and I let Excel do all the work.
joe
> IMHO this used to be a great trick to page out leaked memory in buggy apps
> and prevent various crashes caused by other buggy code that didn’t check
> for allocation failure! Too bad the ‘feature’ has been removed
> (please read this as sarcasm)
>
>
> “Doron Holan” wrote in message
> news:xxxxx@ntdev…
> Reducing working set on minimize is no longer done on w7, maybe on Vista
> as well
>
> d
>
> debt from my phone
>
>
> --------------------------------------------------------------------------------
> From: xxxxx@flounder.com
> Sent: 10/20/2011 8:18 AM
> To: Windows System Software Devs Interest List
> Subject: Re:[ntdev] DMA latency
>
>
> Note that minimizing windows tends to tell the scheduler that the process
> is not terribly important, and its pages are candidates for page-out.
> Thus, its working set requirement is reduced, leaving more pages available
> in physical memory. This could reduce the paging behavior of your app,
> see my previous posts.
>
> Until you have eliminated all possible causes > 2 orders of magnitude,
> there is no point in trying to eke out the last drop of performance
> possible. Paging is six to seven orders of magnitude performance hit, per
> page fault. For a large buffer, you might take multiple page faults
> during MmProbeAndLockPages (hence my reference to eight orders of
> magnitude).
>
> It would be useful to have performance data on this device such as:
>
> Input data rate
> FIFO buffer size (ideally expressed in units of time of the input data)
> Interrupt rate
> I/O buffer size on your read request
>
> Then there are the architecture questions:
>
> Do you do internal buffering in your driver? Or do you rely on the MDL in
> the IRP?
> How much do you do in the ISR vs. the DPC?
> What priority boost do you give on IRP completion?
> How many instructions does your app execute between I/O read calls?
> Note that if there is a kernel call (other than the read request or
> an asynchronous inter-thread queue request)
> in the loop, you probably need to rewrite the loop.
> Note that if the kernel call is graphics-related (including
> SendMessage to controls), you need to rewrite the loop.
> Putting a simple loop in a separate thread can often help.
>
> I have been solving these problems for about 15 years now. The typical
> causes are
> Underdesigned hardware
> Underdesigned driver
> Underdesigned app
>
> That’s it. You can often compensate for an underdesigned driver/hardware
> combination by expending more effort on the app. But hardware should be
> robust under operating system delays (large FIFO) and the driver needs to
> be robust under operating system delays. Using some of the application
> tricks I mentioned can help make the whole thing less sensitive to the
> hardware/driver issue, and since I’m primarily an application-level
> programmer this is where I put all my effort, and have several notable
> successes, and no failures yet.
>
> And yes, to get one of those successes I had to play with scheduler
> priorities and thread affinity. You can use these very carefully with
> success, but you don’t start out saying “Well, if I just tweak this thread
> priority and change this affinity, all will be well”.
> joe
>
>
>
>> Am 19.10.2011 18:37, schrieb xxxxx@eircom.net:
>>> Complete newbie here. I’m working with a modified reference design for
>>> a
>>> PCI-e
>>> Xilinx eval board which uses DMA to transfer data from the board’s FIFO
>>> to main
>>> memory. I can monitor the size of the FIFO and I see that it is empty
>>> most of
>>> the time i.e. the DMA operations are successfully clearing out the FIFO
>>> in good
>>> time.
>>>
>>> Unfortunately, every so often the DMA loop is being delayed and the
>>> FIFO
>>> is
>>> overflowing. I need to find out what’s causing this interference. I’ve
>>> disabled
>>> as much hardware and software as I can but no luck. Two questions:
>>>
>>
>> Russel, what platform are you using (Core <i_something>, Atom…).
>> With
>> all
>> Intel processors one thing is very important, make sure your transfers
>> are
>> a
>> multiple of 64 Byte and are aligned to 64-byte boundaries. I would
>> recommend
>> setting the driver alignment requirements to 64 byte, or some multiple
>> thereof.
>> (64 byte is the cache line size). Observing this rule works wonders. On
>> an
>> Atom
>> application it gave me a performance increase of over 30%. It’s also in
>> the IA
>> manuals somewhere, unfortunately I only found the note myself after I
>> had
>> experimented for ages 
>>
>> Also, when writing to memory, try to always transfer <max_payload_size>
>> blocks.
>> This depends on your chip-set but is typically 128 byte or 256 byte.
>> This
>> gives
>> you the best usage of available credits.
>>
>> If you are using Intel Atom, you may have to disable deep-sleep S6
>> states.
>> This
>> can usually be done in the BIOS. I had a major issue with the credit
>> update
>> (UpdateFC) frequency here which caused a problem very like the one you
>> described
>>
>> Slan,
>> Charles
>>
>>> 1. I installed LatencyMon.exe to see what might be causing the
>>> interference and
>>> I can see that on an idle system, ataport.sys is handling a lot of
>>> interrupts
>>> and dpc’s BUT there are no hard page faults and there are no
>>> applications
>>> running. What could be causing this activity?
>>>
>>> 2. I’ve noticed that I get much less interference when I minimise the
>>> command
>>> window displaying application debug output, even though there is no
>>> debug
>>> output during the DMA loop itself. Am I correct in assuming that
>>> Windows
>>> would
>>> not write to the graphics chip (x4500) when there is no change to the
>>> desktop
>>> screen?
>>>
>>> 3. Does windows use DMAs to communicate screen updates to the graphics
>>> chip
>>> (that might interfere with my application DMAs)?
>>>
>>> 4. Does the graphics chip use DMAs to access it’s frame buffer (which
>>> is
>>> using
>>> system memory) during a screen refresh?
>>>
>>> Thanks.
>>>
>>
>>
>> —
>> NTDEV is sponsored by OSR
>>
>> For our schedule of WDF, WDM, debugging and other seminars visit:
>> http://www.osr.com/seminars
>>
>> To unsubscribe, visit the List Server section of OSR Online at
>> http://www.osronline.com/page.cfm?name=ListServer
>>
>
>
>
> —
> NTDEV is sponsored by OSR
>
> For our schedule of WDF, WDM, debugging and other seminars visit:
> http://www.osr.com/seminars
>
> To unsubscribe, visit the List Server section of OSR Online at
> http://www.osronline.com/page.cfm?name=ListServer
>
>
> —
> NTDEV is sponsored by OSR
>
> For our schedule of WDF, WDM, debugging and other seminars visit:
> http://www.osr.com/seminars
>
> To unsubscribe, visit the List Server section of OSR Online at
> http://www.osronline.com/page.cfm?name=ListServer</max_payload_size></i_something>