Optimizing

Hi,

I am wondering if RtlCopyMemory uses only one core in a multicore CPU? If
so, Is there a way to utilize all of the cores in a optimized custom written
Copy operation? I have an underlying reason also where I need to throw away
every n number of bytes as I copy from source to destination.
Moreover I was wondering if there is a way to use something like OpenMP in
the driver code, where a for loop would be executed in parallel based on
whether each of the iterations of the loop are independent of each other.

Thanks,
Ashok Bruno

RtlCopyMemory will not farm out to each CPU. If the thread copying memory context switches to another CPU, then yes it will use more than one core (but not at one time :wink: ). You can achieve this on your own by targeting DPCs at specific processors or creating your own thread pool and affinitizing each thread to a particular CPU. Not that I recommend you do either b/c I think for this case it is a jack hammer approach to hammering in a nail.

I am 99.9% positive that OpenMP code is not compatible with kernel programming. IIRC, openmp needs it’s a runtime underneath to abstract threads and there is no such abstraction for KM

d

-----Original Message-----
From: xxxxx@lists.osr.com [mailto:xxxxx@lists.osr.com] On Behalf Of Ashok Bruno
Sent: Thursday, July 24, 2008 8:01 AM
To: Windows System Software Devs Interest List
Subject: [ntdev] Optimizing

Hi,

I am wondering if RtlCopyMemory uses only one core in a multicore CPU? If
so, Is there a way to utilize all of the cores in a optimized custom written
Copy operation? I have an underlying reason also where I need to throw away
every n number of bytes as I copy from source to destination.
Moreover I was wondering if there is a way to use something like OpenMP in
the driver code, where a for loop would be executed in parallel based on
whether each of the iterations of the loop are independent of each other.

Thanks,
Ashok Bruno


NTDEV is sponsored by OSR

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at http://www.osronline.com/page.cfm?name=ListServer

Having been a compiler guy, and having to handle structure assigns it amazes
me the number of people who think why is this not multi-threaded when the
real question is why would you multi-thread the common case? Bottom line
the overhead of figuring out the “best” way for a block copy can consume the
cost of moving a fair number of bytes. So if you are copying multi-MB
blocks all the time yes you want to do this, otherwise you are just adding
complexity for many cases.


Don Burn (MVP, Windows DDK)
Windows 2k/XP/2k3 Filesystem and Driver Consulting
Website: http://www.windrvr.com
Blog: http://msmvps.com/blogs/WinDrvr
Remove StopSpam to reply

“Ashok Bruno” wrote in message
news:xxxxx@ntdev…
> Hi,
>
> I am wondering if RtlCopyMemory uses only one core in a multicore CPU? If
> so, Is there a way to utilize all of the cores in a optimized custom
> written Copy operation? I have an underlying reason also where I need to
> throw away every n number of bytes as I copy from source to destination.
> Moreover I was wondering if there is a way to use something like OpenMP in
> the driver code, where a for loop would be executed in parallel based on
> whether each of the iterations of the loop are independent of each other.
>
> Thanks,
> Ashok Bruno
>

Surely one CPU will drive the memory as fast as it can go anyway? M

>>
I am wondering if RtlCopyMemory uses only one core in a multicore CPU? If
so, Is there a way to utilize all of the cores in a optimized custom written
Copy operation?

Thanks much all.
I was almost certain that it would not be multi threaded, and the only
reason I would consider such a possibility is because I am doing multi-MB
transfers.

“Don Burn” wrote in message news:xxxxx@ntdev…
> Having been a compiler guy, and having to handle structure assigns it
> amazes me the number of people who think why is this not multi-threaded
> when the real question is why would you multi-thread the common case?
> Bottom line the overhead of figuring out the “best” way for a block copy
> can consume the cost of moving a fair number of bytes. So if you are
> copying multi-MB blocks all the time yes you want to do this, otherwise
> you are just adding complexity for many cases.
>
>
> –
> Don Burn (MVP, Windows DDK)
> Windows 2k/XP/2k3 Filesystem and Driver Consulting
> Website: http://www.windrvr.com
> Blog: http://msmvps.com/blogs/WinDrvr
> Remove StopSpam to reply
>
>
>
>
>
> “Ashok Bruno” wrote in message
> news:xxxxx@ntdev…
>> Hi,
>>
>> I am wondering if RtlCopyMemory uses only one core in a multicore CPU? If
>> so, Is there a way to utilize all of the cores in a optimized custom
>> written Copy operation? I have an underlying reason also where I need to
>> throw away every n number of bytes as I copy from source to destination.
>> Moreover I was wondering if there is a way to use something like OpenMP
>> in the driver code, where a for loop would be executed in parallel based
>> on whether each of the iterations of the loop are independent of each
>> other.
>>
>> Thanks,
>> Ashok Bruno
>>
>
>
>

On Thu, Jul 24, 2008 at 5:49 PM, Mike Kemp wrote:
> Surely one CPU will drive the memory as fast as it can go anyway? M

Not in the case of NUMA, but that probably doesn’t apply here when
discussing a single buffer (as it probably would only reside inside
one memory node).


Rune

Why do you believe you will be doing multi-MB copies often? That sounds
like a design flaw to me for most drivers.


Don Burn (MVP, Windows DDK)
Windows 2k/XP/2k3 Filesystem and Driver Consulting
Website: http://www.windrvr.com
Blog: http://msmvps.com/blogs/WinDrvr
Remove StopSpam to reply

“Ashok Bruno” wrote in message
news:xxxxx@ntdev…
> Thanks much all.
> I was almost certain that it would not be multi threaded, and the only
> reason I would consider such a possibility is because I am doing multi-MB
> transfers.
>
>
> “Don Burn” wrote in message news:xxxxx@ntdev…
>> Having been a compiler guy, and having to handle structure assigns it
>> amazes me the number of people who think why is this not multi-threaded
>> when the real question is why would you multi-thread the common case?
>> Bottom line the overhead of figuring out the “best” way for a block copy
>> can consume the cost of moving a fair number of bytes. So if you are
>> copying multi-MB blocks all the time yes you want to do this, otherwise
>> you are just adding complexity for many cases.
>>
>>
>> –
>> Don Burn (MVP, Windows DDK)
>> Windows 2k/XP/2k3 Filesystem and Driver Consulting
>> Website: http://www.windrvr.com
>> Blog: http://msmvps.com/blogs/WinDrvr
>> Remove StopSpam to reply
>>
>>
>>
>>
>>
>> “Ashok Bruno” wrote in message
>> news:xxxxx@ntdev…
>>> Hi,
>>>
>>> I am wondering if RtlCopyMemory uses only one core in a multicore CPU?
>>> If so, Is there a way to utilize all of the cores in a optimized custom
>>> written Copy operation? I have an underlying reason also where I need to
>>> throw away every n number of bytes as I copy from source to destination.
>>> Moreover I was wondering if there is a way to use something like OpenMP
>>> in the driver code, where a for loop would be executed in parallel based
>>> on whether each of the iterations of the loop are independent of each
>>> other.
>>>
>>> Thanks,
>>> Ashok Bruno
>>>
>>
>>
>>
>

well. I have hardware that transfers data from a digitizer which is
acquiring data at 40GS/s to the PC at around 500 MB/s. Of course there is
lots of dead time as this data is displayed etc, but there is a mode where
data comes across at much slower sample rates than what I mentioned, however
the amount of data is pretty significant so that the system is not bothered
with too many interrupts. This data has to be pulled out even though the
data is stored elsewhere, because the intermediate buffer gets full and
overflows, if the PC is not able to keep up. Of course you could go into a
debate that PCs were not meant to be operated in RTOS mode, but in reality
it does a pretty good job of keeping up. Also the system that the driver
runs on is ours, ie. the environment is known.

Ashok

“Don Burn” wrote in message news:xxxxx@ntdev…
> Why do you believe you will be doing multi-MB copies often? That sounds
> like a design flaw to me for most drivers.
>
>
> –
> Don Burn (MVP, Windows DDK)
> Windows 2k/XP/2k3 Filesystem and Driver Consulting
> Website: http://www.windrvr.com
> Blog: http://msmvps.com/blogs/WinDrvr
> Remove StopSpam to reply
>
>
>
>
> “Ashok Bruno” wrote in message
> news:xxxxx@ntdev…
>> Thanks much all.
>> I was almost certain that it would not be multi threaded, and the only
>> reason I would consider such a possibility is because I am doing multi-MB
>> transfers.
>>
>>
>> “Don Burn” wrote in message news:xxxxx@ntdev…
>>> Having been a compiler guy, and having to handle structure assigns it
>>> amazes me the number of people who think why is this not multi-threaded
>>> when the real question is why would you multi-thread the common case?
>>> Bottom line the overhead of figuring out the “best” way for a block copy
>>> can consume the cost of moving a fair number of bytes. So if you are
>>> copying multi-MB blocks all the time yes you want to do this, otherwise
>>> you are just adding complexity for many cases.
>>>
>>>
>>> –
>>> Don Burn (MVP, Windows DDK)
>>> Windows 2k/XP/2k3 Filesystem and Driver Consulting
>>> Website: http://www.windrvr.com
>>> Blog: http://msmvps.com/blogs/WinDrvr
>>> Remove StopSpam to reply
>>>
>>>
>>>
>>>
>>>
>>> “Ashok Bruno” wrote in message
>>> news:xxxxx@ntdev…
>>>> Hi,
>>>>
>>>> I am wondering if RtlCopyMemory uses only one core in a multicore CPU?
>>>> If so, Is there a way to utilize all of the cores in a optimized custom
>>>> written Copy operation? I have an underlying reason also where I need
>>>> to throw away every n number of bytes as I copy from source to
>>>> destination.
>>>> Moreover I was wondering if there is a way to use something like OpenMP
>>>> in the driver code, where a for loop would be executed in parallel
>>>> based on whether each of the iterations of the loop are independent of
>>>> each other.
>>>>
>>>> Thanks,
>>>> Ashok Bruno
>>>>
>>>
>>>
>>>
>>
>
>
>

> I am wondering if RtlCopyMemory uses only one core in a

multicore CPU? If so, Is there a way to utilize all of the
cores in a optimized custom written Copy operation? I have an
underlying reason also where I need to throw away every n
number of bytes as I copy from source to destination.
Moreover I was wondering if there is a way to use something
like OpenMP in the driver code, where a for loop would be
executed in parallel based on whether each of the iterations
of the loop are independent of each other.

If you are throwing away some of the bytes, it’s no longer a block copy.

My guess is a tight loop of appropriate SSE(2,3,4) code would do the byte
filtering and perhaps run at memory access bandwidth (a few giga
instructions/sec at 128-bits wide).

Since your’s is a controlled environment, you might consider using a NUMA
architecture machine (i.e. multiprocessor AMD). You could let your to
hardware pour the data into buffers which then get filtered by little worker
threads with defined processor affinity. You could carefully allocate the
memory such that a set of buffers was on a specific NUMA node, and that
memory would get processed by the processor on that node. Multi-cores in the
same chip don’t improve chip to memory bandwidth, although spreading it
across cores in different processors on differetn NUMA nodes might improve
bandwidth.

You might also want to use large memory pages (2M as opposed to 4k) as at
higher data rates processor TLB thrashing starts to hurt performance. It
sounds like you need the processor to touch the memory, and can’t just dma
it in from your capture hardware and dma it back out to disk/network.

If you need to do some more serious calculation on the data (say like an
fft) before sending it to a disk/metwork, you might also have a look at the
Nvidia CUDA stuff, which uses the graphics processor (GPU). I believe a
modern GPU can achieve something like 500 gigaflops, multiple cpu cards can
achieve over a terraflop on your desktop.

Another strategy would be to have the raw data flow into a central node, and
use something like InfinBand RDMA transfers to spread it across a cluster
for processing and storage. For large transfers, you will get nearly wire
speed (20 gbps) out of InfinBand. There are some pretty capable 10G Ethernet
cards too, which I believe can do offloaded TCP transfers at wire speed.

It seems like if the issue is your buffers overflow, copying the buffers to
different buffers is not the answer. You need to either better use the
available memory, or you need to free up the memory by sending the data
someplace else. It might be useful to think about the problem in terms of
memory bandwidth. Modern systems have memory bandwidth of at least 5-6
Gbytes/sec, and can get to like 12-24 Gbytes/sec. A memory copy, consumes
bandwidth both to read and write, so a memory copy performance of like 2.5-3
Gbytes can equal 100% cpu utilization.

You said your data came in at 500 Mbytes/sec, but you didn’t say where the
data goes or how big the data sets are. Do you need to store intermittent
bursts of 500 Mbytes on disk, or is it like a continuous process that will
stream 500 Mbytes/sec for a week, and that data needs to go someplace.

Jan

On Thu, Jul 24, 2008 at 10:50 PM, Jan Bottorff wrote:
> Since your’s is a controlled environment, you might consider using a NUMA
> architecture machine (i.e. multiprocessor AMD). You could let your to
> hardware pour the data into buffers which then get filtered by little worker
> threads with defined processor affinity. You could carefully allocate the
> memory such that a set of buffers was on a specific NUMA node, and that
> memory would get processed by the processor on that node. Multi-cores in the

In that case, when the hw transfers to the buffer in question, he
should then make sure that the memory copy is executed on the CPU
socket that is hooked up to the memory node that was targetted by the
hw. On my Tyan board, different devices are handled by each CPU socket
(if I pull my second CPU, I no longer have access to my second NIC).


Rune

And what is the need in copies? isn’t DMA enough?


Maxim Shatskih, Windows DDK MVP
StorageCraft Corporation
xxxxx@storagecraft.com
http://www.storagecraft.com

“Ashok Bruno” wrote in message news:xxxxx@ntdev…
> well. I have hardware that transfers data from a digitizer which is
> acquiring data at 40GS/s to the PC at around 500 MB/s. Of course there is