SSE2 v/s SSE3 - Performance

OSR_Community_User · October 14, 2009, 9:49pm

All,

I am trying to improve the data copy performance of Intel processor to a
memory mapped area of a PCIe device. I have following things in my mind.

Make the memory mapped region “Cached”.
Use SSE2 instructions

Please suggest me, if I could use any other option.

Does anybody know - how much is the effective throughput difference
between SSE2 and SSE3 optimizations on an Atom processor?

Thanks & Regards,

Abhishek Joshi

Tim_Roberts · October 14, 2009, 10:46pm

You wrote:

I am trying to improve the data copy performance of Intel processor to a
memory mapped area of a PCIe device.

What are you getting, and what do you expect?

I have following things in my mind.

Make the memory mapped region “Cached”.

This will only help for copying data FROM the device. For copying TO the device, you can try “write combining”, but you’ll need to do experiments to make sure it works for you.

Use SSE2 instructions

This is silly. The basic REP MOVSD runs WAAAAY faster than the PCIExpress bus. CPU instructions are not your bottleneck.

If you want maximum throughput, you must use bus mastering. There is no alternative.

Tim Roberts, xxxxx@probo.com
Providenza & Boekelheide, Inc.

OSR_Community_User · October 15, 2009, 12:05am

Thanks Tim. I am interested in copying data TO device using PC CPU.
Regarding SSE2,I have already implemented SSE2 and I see almost double
throughput while writing “TO” device with SSE2. Do you think it is
happening because of something else?
When you say bus mastering- do you mean device dma?

Thanks & Regards,
Abhishek Joshi

-----Original Message-----
From: xxxxx@lists.osr.com
[mailto:xxxxx@lists.osr.com] On Behalf Of Tim Roberts
Sent: Wednesday, October 14, 2009 7:45 PM
To: Windows System Software Devs Interest List
Subject: Re: [ntdev] SSE2 v/s SSE3 - Performance

You wrote:

I am trying to improve the data copy performance of Intel processor to
a
memory mapped area of a PCIe device.

What are you getting, and what do you expect?

I have following things in my mind.

Make the memory mapped region “Cached”.

This will only help for copying data FROM the device. For copying TO
the device, you can try “write combining”, but you’ll need to do
experiments to make sure it works for you.

Use SSE2 instructions

This is silly. The basic REP MOVSD runs WAAAAY faster than the
PCIExpress bus. CPU instructions are not your bottleneck.

If you want maximum throughput, you must use bus mastering. There is no
alternative.

Tim Roberts, xxxxx@probo.com
Providenza & Boekelheide, Inc.

NTDEV is sponsored by OSR

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at
http://www.osronline.com/page.cfm?name=ListServer

Maxim_S_Shatskih · October 15, 2009, 3:46am

>Regarding SSE2,I have already implemented SSE2 and I see almost double

throughput while writing “TO” device with SSE2.

Have you tried using write combining?

–
Maxim S. Shatskih
Windows DDK MVP
xxxxx@storagecraft.com
http://www.storagecraft.com

OSR_Community_User · October 15, 2009, 9:48am

wrote on 10/14/2009 07:45:04 PM:

[snip]

>
> >2. Use SSE2 instructions
>
> This is silly. The basic REP MOVSD runs WAAAAY faster than the
> PCIExpress bus. CPU instructions are not your bottleneck.
>

This depends. If the system can combine the writes into larger PCIe
packets, then this could be true. But if each MOVD results in a separate
PCIe packet, then it is actually quite slow. Looking at the overhead, to
send a single 4 byte packet using 32 bit addressing over the PCIe bus
requires a 16 byte packet across the bus. That is only a 25% utilization
factor. In order to get better performance, you have to get the payload
size increased. If you assume that the maximum payload size supported is
the minimum required by the PCIe spec, then that is 128 bytes, with 12
bytes of overhead then you have 91.4% utilization of the PCIe bandwidth.

> If you want maximum throughput, you must use bus mastering. There
> is no alternative.

The only way to guarantee that you can utilize larger packets and get
maximum throughput is to have the PCIe device implement its own DMA.

As an example, I am researching PCIe bus options for a project I am
working on. I have an Intel Xeon 5420 processor on a Tyan MB with a PLX
8619 based RDK (this has 4 DMA engines built-in) driving data to a Xilinx
Virtex 5 FPGA based endpoint using its PCIe hard IP core (4 lanes gen 1)
and getting actual throughput rates of 800MBytes/sec (Measured using a
PCIe protocol analyzer) since it is able to drive 128 byte payloads. This
is utilizing 90+% of the PCIe bandwidth to the FPGA. The 8619 is capable
of 8 lanes gen2 which is theoretically 4 times faster, but I don’t yet
have a faster endpoint to see if I can squeeze more out of the
architecture.

By contrast, using a program that delivers the same data, but using CPU
instructions results in only 4 byte payload packets and can only achieve
60MBytes/sec

Russ Poffenberger
xxxxx@ltx=credence.com

Tim_Roberts · October 16, 2009, 3:56pm

Abhishek Joshi wrote:

Thanks Tim. I am interested in copying data TO device using PC CPU.
Regarding SSE2,I have already implemented SSE2 and I see almost double
throughput while writing “TO” device with SSE2.

What were you using before? If you were using RtlMoveMemory or
READ_REGISTER_BUFFER_ULONG, then the bottleneck will be the bus, and I
find it impossible to believe you could improve on it just with CPU
instructions, unless the transfers were all so short that the overhead
swamped the transfer.

Do you think it is happening because of something else?
When you say bus mastering- do you mean device dma?

Yes. It’s possible to get 130 MB/s on a 1-lane PCIExpress card using
bus mastering.

–
Tim Roberts, xxxxx@probo.com
Providenza & Boekelheide, Inc.

Chris_Aseltine · October 16, 2009, 3:59pm

Tim Roberts wrote:

Yes. It’s possible to get 130 MB/s on a 1-lane PCIExpress card
using bus mastering.

Glug glug?

Igor_Sharovar · October 16, 2009, 5:41pm

>Regarding SSE2,I have already implemented SSE2 and I see almost double >throughput while writing “TO” device with SSE2. Do you think it is happening >because of something else?
You need to give more details about your tests. How did you measure throughput? How long were PCIe packets in both cases?

Igor Sharovar