Various NDIS questions...

Hi all,

My NDIS miniport is somewhat complete, however before I get into WHQL
testing, I’m trying to finish a few optimizations.

Recall that I am writing a virtual NIC and bus that uses a shared memory
ring buffer on top of a Linux kvm host.

At the moment I’m very concerned about my rx performance, which is a
surprise since essentially all I do is pull frames off the ring and call
*Indicate*(). Currently, tx performance is 3x better than rx
performance.

I’m using netperf for performance testing at the moment. netperf merely
sends packets to a netperf server instance for a specified period of
time and records results.

My questions…

  1. Currently I set the TCP/UDP checksum flags (as appropriate) on rx
    since the host’s NIC driver performs the checksums. I assumed that if I
    do not, the protocol stack above me would perform the checksums,
    correct?

  2. Based on a previous email, I chain together the incoming frames as a
    list of NET_BUFFER_LISTs and forward that along to
    NdisMIndicateReceiveNetBufferLists(). Supposedly this would be faster
    than calling *indicate* for each NBL separately. However, one test I
    ran changed the processing to forward individual NBLs to indicate and I
    got slightly better rx results. Its not clear to me that these results
    are statistically significant, however I want verify that chaining is
    ‘better’ from a Microsoft point of view. Is it?

  3. I currently set OID_802_3_MAXIMUM_LIST_SIZE to 0 (zero) for
    simplicity’s sake. This specific test is not using multicast addressing
    however would it be better to add multicast addressing code to the
    driver? Since the upper layers need to deal with overflow, its unclear
    to me how it would be better to include multicast address handling, but
    I am unsure.

  4. Should I consider chaining NET_BUFFERs instead of chaining
    NET_BUFFER_LISTS?

  5. I have the possibility of a currently executing DPC re-queuing itself
    and I strongly suspect this happens. IIUC, you cannot *queue* the same
    DPC more than once, however once the DPC is in execution, you can queue
    another instance. Is this correct? (Yes, I am aware of smp issues wrt
    same)

  6. During a PAUSE, I may (I have to think about this some) I may have
    the situation of an ‘inflight’ tx multi-NET_BUFFER NET_BUFFER_LIST being
    uncompleted. Is this an error? (I suspect so) If so, how would this
    manifest itself?

Thanks,
-PWM

Ok, a dense set of questions.

RE (4): No. I am pretty sure that in the Rx Path, you may only indicate a single ‘frame’ (one NB per NBL).

Regarding the asymmetry in performance - how does you receive processing become aware of frames in the ring-buffer to be delivered? Is it polling them with the DPC & a timer or does the driver get an ‘interrupt’ to indicate the presence of rx frames? When you get such an indication (or discover frames to be delivered) do you ‘drain’ the ring buffer entirely and indicate all frames?

Do you check for RX frames during (at the end of or before) TX frame processing?

Regards,
Dave Cattley

-----Original Message-----
From: xxxxx@lists.osr.com [mailto:xxxxx@lists.osr.com] On Behalf Of Peter W. Morreale
Sent: Wednesday, March 10, 2010 10:24 AM
To: Windows System Software Devs Interest List
Subject: [ntdev] Various NDIS questions…

Hi all,

My NDIS miniport is somewhat complete, however before I get into WHQL
testing, I’m trying to finish a few optimizations.

Recall that I am writing a virtual NIC and bus that uses a shared memory
ring buffer on top of a Linux kvm host.

At the moment I’m very concerned about my rx performance, which is a
surprise since essentially all I do is pull frames off the ring and call
*Indicate*(). Currently, tx performance is 3x better than rx
performance.

I’m using netperf for performance testing at the moment. netperf merely
sends packets to a netperf server instance for a specified period of
time and records results.

My questions…

  1. Currently I set the TCP/UDP checksum flags (as appropriate) on rx
    since the host’s NIC driver performs the checksums. I assumed that if I
    do not, the protocol stack above me would perform the checksums,
    correct?

  2. Based on a previous email, I chain together the incoming frames as a
    list of NET_BUFFER_LISTs and forward that along to
    NdisMIndicateReceiveNetBufferLists(). Supposedly this would be faster
    than calling *indicate* for each NBL separately. However, one test I
    ran changed the processing to forward individual NBLs to indicate and I
    got slightly better rx results. Its not clear to me that these results
    are statistically significant, however I want verify that chaining is
    ‘better’ from a Microsoft point of view. Is it?

  3. I currently set OID_802_3_MAXIMUM_LIST_SIZE to 0 (zero) for
    simplicity’s sake. This specific test is not using multicast addressing
    however would it be better to add multicast addressing code to the
    driver? Since the upper layers need to deal with overflow, its unclear
    to me how it would be better to include multicast address handling, but
    I am unsure.

  4. Should I consider chaining NET_BUFFERs instead of chaining
    NET_BUFFER_LISTS?

  5. I have the possibility of a currently executing DPC re-queuing itself
    and I strongly suspect this happens. IIUC, you cannot *queue* the same
    DPC more than once, however once the DPC is in execution, you can queue
    another instance. Is this correct? (Yes, I am aware of smp issues wrt
    same)

  6. During a PAUSE, I may (I have to think about this some) I may have
    the situation of an ‘inflight’ tx multi-NET_BUFFER NET_BUFFER_LIST being
    uncompleted. Is this an error? (I suspect so) If so, how would this
    manifest itself?

Thanks,
-PWM


NTDEV is sponsored by OSR

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at http://www.osronline.com/page.cfm?name=ListServer