Limiting number of SG elements

Hello,

I am trying to limit the number of SG elements sent by the NDIS stack. Below is the scenario:

I am using iperf3 to send TSO traffic. The miniport driver support LSO V2 with maximum payload of 256k bytes. Once the NetBuffer is received by the miniport driver from NDIS stack, the driver calls NdisMAllocateNetBufferSGList() for NDIS to prepare and provide the SG list to the driver. The number of SG elements represent the non-contiguous pages/memory in the MDL. For a packet of 128K bytes, I see that the driver receives about 33 SG elements (1 for header and rest for that payload with each of maximum 4K bytes). Each SG element is mapped to the transmit ring descriptor. With more payload, number of SG elements increases and hence the use of number of descriptors. Now I want to limit this number of SG elements to say for example 20 with each having more than 4K of payload.

  1. Is there a way to tell the NDIS stack to provide max number of SG elements? With each element having more than 4K size?

  2. I researched a bit and found no way to do 1. So, another question is is there a way to allocate large memory (contiguous pages) and tell NDIS to use it to store the payload and create SG list from that. As pages are contiguous the number of SG elements would reduce. Any pointer to this would be great.

  3. Another way I was thinking is once the SG elements are provided, the driver will use large private buffers and copy those elements to the buffers and use the buffer addresses to pass it to descriptors. My concern here is it will lead to copy operation which will hit performance and I would like to keep the hit to minimum.

  4. Other possibility I was thinking is is there a way to split the NetBuffer into 2 NB with each having half of original payload and pass each of them separately to create SG list? How complicated this would be?

Any other way to achieve this to limit the number of SG elements?

Thanks

Now I want to limit this number of SG elements to say for example 20 with each having more than 4K of payload.

That’s not something you get to choose. NDIS is mapping the user’s buffer, and the user’s buffer is mapped wherever it happened to get mapped.

Why do you need to limit the number of SG elements? Is your hardware defective? It is certainly possible to allocate a common buffer at startup time and copy the data there, but that defeats the purpose of mapping the user buffer directly.

Are you sure you’re seeing only 4K max per SG element? Because, at least in the core DMA stack, physically adjacent elements should be combined (unless you specify a max SG element size that would prevent this).

Peter

Are you sure you’re seeing only 4K max per SG element? Because, at least in the core DMA stack, physically adjacent elements should be combined (unless you specify a max SG element size that would prevent this).

Yes, I see max of 4K bytes. I too thought why isn’t there atleast 1 physically adjacent page to be combined? But in my logs I always see max of 4k per SG element.
2020-10-26T22:45:38.032Z| sg[0]: addr 13fff0ee0 len 288
2020-10-26T22:45:38.032Z| sg[1]: addr 13ffb9000 len 4096
2020-10-26T22:45:38.033Z| sg[2]: addr 13ffb6000 len 4096
2020-10-26T22:45:38.033Z| sg[3]: addr 13ff6f000 len 4096
2020-10-26T22:45:38.033Z| sg[4]: addr 13ff6c000 len 4096
2020-10-26T22:45:38.034Z| sg[5]: addr 13ff69000 len 4096
2020-10-26T22:45:38.034Z| sg[6]: addr 13ff66000 len 4096
2020-10-26T22:45:38.034Z| sg[7]: addr 13ff59000 len 4096
2020-10-26T22:45:38.035Z| sg[8]: addr 13ff52000 len 4096
2020-10-26T22:45:38.035Z| sg[9]: addr 13ff1b000 len 4096

(unless you specify a max SG element size that would prevent this).
Not sure of this. Where/When/How can we specify max SG element size?

Thanks

2020-10-26T22:45:38.032Z| sg[0]: addr 13fff0ee0 len 288
2020-10-26T22:45:38.032Z| sg[1]: addr 13ffb9000 len 4096
2020-10-26T22:45:38.033Z| sg[2]: addr 13ffb6000 len 4096
2020-10-26T22:45:38.033Z| sg[3]: addr 13ff6f000 len 4096
2020-10-26T22:45:38.033Z| sg[4]: addr 13ff6c000 len 4096
2020-10-26T22:45:38.034Z| sg[5]: addr 13ff69000 len 4096
2020-10-26T22:45:38.034Z| sg[6]: addr 13ff66000 len 4096
2020-10-26T22:45:38.034Z| sg[7]: addr 13ff59000 len 4096
2020-10-26T22:45:38.035Z| sg[8]: addr 13ff52000 len 4096
2020-10-26T22:45:38.035Z| sg[9]: addr 13ff1b000 len 4096

Hmmmm… do you see any physical adjacent pages in that list?

Peter

But SG elements don’t have a minimum size? or do they?
If there’s no minimum, a mere 4K payload can give you 4K worth of SG elements. In theory, of course. Or if you run under ndistest :wink:
– pa

Hmmmm… do you see any physical adjacent pages in that list?
No, But I haven’t seen any SG element having more than 4K length i.e. isn’t there any 1 physical page which is contiguous? I have ran 100s of traffic tests and same is the scenario.

2020-10-26T22:45:38.093Z| sg[13]: addr 13ff0d000 len 4096
2020-10-26T22:45:38.094Z| sg[14]: addr 13ff0a000 len 4096
2020-10-26T22:45:38.094Z| sg[15]: addr 13ff07000 len 4096
2020-10-26T22:45:38.094Z| sg[16]: addr 13ff04000 len 4096
2020-10-26T22:45:38.095Z| sg[17]: addr 13ff01000 len 4096
2020-10-26T22:45:38.095Z| sg[18]: addr 13fefe000 len 4096
2020-10-26T22:45:38.095Z| sg[19]: addr 13fefb000 len 4096
2020-10-26T22:45:38.096Z| sg[20]: addr 13fef8000 len 4096
2020-10-26T22:45:38.096Z| sg[21]: addr 13fef5000 len 4096
2020-10-26T22:45:38.096Z| sg[22]: addr 13fef2000 len 4096

Also most of the times I see the consequent SG elements have addresses 12k apart ie. 3 pages apart considering Page size of 4k. Is there some relation?

I faced a very similar delima when writing an NDIS driver from scratch for Windows Embedded Compact.  (NDIS is about the only area the two share the same driver model).  I was obsessing 
 about not doing any copying at all and to try to force the use of SG lists, MDLs and bus-mastering DMA directly from user buffers.

Another driver developer from this group pointed out:

  • A single frame will never have more than 1500+/- bytes unless jumbo frames are used.  I’m assuming you are doing that.
  • The CPU overhead probe and lock memory pages and clean caches for those pages is FAR greater than the overhead to copy those packets into a preallocated, locked down, shared memory buffer.

At first, I didn’t believe it until I did the work to implement it that way.  My driver achieves 90%+ of theoretical max of the hardware at gigabit speed with a LOT less complexity.  (Unnecessary complexity is a killer in a driver.)

This was on an i.MX6 (Arm Cortex-A9) running about 900MHz so it was in no way a CPU powerhouse.

Now, to your situation, again, as was explained to me, you will almost never see adjacent physical pages being passed in a NET_BUFFER.  NDIS makes absolutely no attempt to do so.  The overhead to try to do that would be greater than just processing physically noncontiguous.  What you see is, for all intents and purposes, random pages.  Anything to the contrary is purely coincidence.

Have you determined this is causing a problem?  Premature optimization quite often leads to overly complex implementation that, in the end, is less performant.

Just my experience and $0.02 worth.

Greg
 

  • The CPU overhead probe and lock memory pages and clean caches for those pages is FAR greater than the overhead to copy those packets into a preallocated, locked down, shared memory buffer.

This is an extremely important point. The real cost when you use Direct I/O is in the “TLB shootdown” as Mr. Dyess notes. This is a pretty sophisticated argument, but it basically asks you to actively understand tradeoffs that involve overall system impact vs your specific DMA performance.

Awesome contribution, Mr. Dyess.

Peter

Thank you Peter and Greg for the explanation. It has been helpful. The reason to do this is the hardware wants to limit the number of SG elements for efficiency. For 256k packet size, NDIS will send minimum of 64 (65 considering hdr) SG elements given that pages will most likely be non-contiguous. But thanks for the help, I will try to evaluate the perf hit of limiting it in software and compare it with the hit which hardware would see.

One more question I had is regarding point 4 in my original question.

Other possibility I was thinking is is there a way to split the NetBuffer into 2 NB with each having half of original payload and pass each of them separately to create SG list? How complicated this would be?

Is there a way or API to split NetBuffer into 2 to treat them as 2 different packets?

What kind of device do you have that uses 256K packet size? 
Are you sure that’s 256K PACKETS (i.e. Ethernet frame) and not just a 256K transfer?

As to your question, think of things this way: NET_BUFFER = Single Ethernet Frame.  You can’t split an ethernet FRAME at the driver level.  NDIS has to do that.

What you CAN probably do is report to NDIS that the MAX frame size is something smaller, like 64K.  That would force NDIS to create a series of fragmented packets which can send semi-independently (but in order).  You would have a NET_BUFFER_LIST consisting of a linked list of NET_BUFFER objects, one for each frame.

I’m still curious what kind of network device can transmit 1/4 MB FRAMES.

Greg

On Tue, 27 Oct 2020 20:09:46 +0000 (UTC), ronakdoshi wrote:

OSR https://community.osr.com/

ronakdoshi commented on Limiting number of SG elements

Thank you Peter and Greg for the explanation. It has been helpful. The reason to do this is the hardware wants to limit the number of SG elements for efficiency. For 256k packet size, NDIS will send minimum of 64 (65 considering hdr) SG elements given that pages will most likely be non-contiguous. But thanks for the help, I will try to evaluate the perf hit of limiting it in software and compare it with the hit which hardware would see.

One more question I had is regarding point 4 in my original question.

> Other possibility I was thinking is is there a way to split the NetBuffer into 2 NB with each having half of original payload and pass each of them separately to create SG list? How complicated this would be?

Is there a way or API to split NetBuffer into 2 to treat them as 2 different packets?

The reason to do this is the hardware wants to limit the number of SG elements for efficiency.

That’s not an argument for efficiency. That’s an argument for laziness.

Is there a way or API to split NetBuffer into 2 to treat them as 2 different packets?

You can certainly do that yourself. Just send half of the SG entries at first, and in your completion interrupt, go submit the other half.