I have a PCI Express card (PCIe Gen1 x1 with 2,5 GT/s) and mapped the BARs with MmMapIoSpaceEx.
lspci shows in DevCap (Device Capabilities): MaxPayload 128 bytes.
"Maximum Payload Size supported by the Function. Can be configured as 000 (128 bytes) or 001 (256 bytes)"
I want to write and read a chunk of memory to/from the BAR via programmed IO. The typical size is 256 Byte. What is more efficient? What is faster?
A
WRITE_REGISTER_BUFFER_UCHAR(dest, src, 256);
B
WRITE_REGISTER_BUFFER_ULONG(dest, src, 64);
C
WRITE_REGISTER_BUFFER_ULONG64(dest, src, 32);
Are different TLPs created?
AFAIK, the PCIe core will do a buffering to combine multiple write requests into one. But at some point in time, the buffer is full.
Just try it and measure the transfer rate? Begin from ULONG64. PCI BARs usually are mapped non-cacheable and aligned writes should translate to "atomic" transfer of the same size. So likely yes you can expect TLPs of different sizes. Will write-combining jump to action - this IIRC depends on the BAR properties (prefetchable bit)
YOU do not need to worry about this. Assuming it does write-combining, the root complex will divide your large transfer into smaller blocks, exactly like the Internet does with TCP.
Note that write-combining isn't typically as helpful as you'd think. Virtually the only way to get TLPs larger than 8 bytes is with bus-master DMA in the device. Think about it from the perspective of the root complex. It has no way of knowing that you're doing a long buffer write. All it sees is individual write requests from the CPU.