WRITE_REGISTER_BUFFER_UCHAR vs WRITE_REGISTER_BUFFER_ULONG

Chris_Troester · December 2, 2024, 2:20pm

I have a PCI Express card (PCIe Gen1 x1 with 2,5 GT/s) and mapped the BARs with MmMapIoSpaceEx.

lspci shows in DevCap (Device Capabilities): MaxPayload 128 bytes.
"Maximum Payload Size supported by the Function. Can be configured as 000 (128 bytes) or 001 (256 bytes)"

I want to write and read a chunk of memory to/from the BAR via programmed IO. The typical size is 256 Byte. What is more efficient? What is faster?

A

WRITE_REGISTER_BUFFER_UCHAR(dest, src, 256);

B

WRITE_REGISTER_BUFFER_ULONG(dest, src, 64);

C

WRITE_REGISTER_BUFFER_ULONG64(dest, src, 32);

Are different TLPs created?
AFAIK, the PCIe core will do a buffering to combine multiple write requests into one. But at some point in time, the buffer is full.

Pavel_A1 · December 2, 2024, 11:36pm

Just try it and measure the transfer rate? Begin from ULONG64. PCI BARs usually are mapped non-cacheable and aligned writes should translate to "atomic" transfer of the same size. So likely yes you can expect TLPs of different sizes. Will write-combining jump to action - this IIRC depends on the BAR properties (prefetchable bit)

Tim_Roberts · December 3, 2024, 2:41am

YOU do not need to worry about this. Assuming it does write-combining, the root complex will divide your large transfer into smaller blocks, exactly like the Internet does with TCP.

Note that write-combining isn't typically as helpful as you'd think. Virtually the only way to get TLPs larger than 8 bytes is with bus-master DMA in the device. Think about it from the perspective of the root complex. It has no way of knowing that you're doing a long buffer write. All it sees is individual write requests from the CPU.