High-Performance ACK Generation in NDIS Filter Driver for 10 GbE FPGA Traffic

I'm developing an NDIS Filter Driver that handles high-throughput UDP traffic from an FPGA connected via a 10 GbE NIC. The workload is as follows:

  1. The FPGA sends UDP packets (1448 bytes payload) at line rate.

  2. My filter driver intercepts these packets in FilterReceiveNetBufferLists.

  3. For each matching packet, the driver must generate and send a small ACK response immediately.

However, profiling with Windows Performance Analyzer (WPA) shows that calling NdisFSendNetBufferLists() directly from the receive path introduces significant overhead and becomes a bottleneck under load

my code:

 if (!NdisIsNblCountedQueueEmpty(&ackQ)) {
   UINT32 ackQCnt = (UINT32)ackQ.NblCount;
   RC_ETW_NBL_CHAIN_ACK_START(PortNumber, ackQCnt, cpuIndex);
   NdisFSendNetBufferLists(pFilter->FilterHandle,
                           NdisGetNblChainFromNblCountedQueue(&ackQ),
                           PortNumber,
                           sendFlags  // SendFlags
   );
   RC_ETW_NBL_CHAIN_ACK_STOP(PortNumber, ackQCnt, cpuIndex);
 }

Question:
Is there a better way to send ACKs efficiently in this scenario?

RevealerFilterDriver-Trace/NblChainAck/Start NBL Chain Ack Started: Port=0, Count=10, CPU=10 10 1 0.785557000
RevealerFilterDriver-Trace/NblChainAck/Stop NBL Chain Ack Completed: Port=0, Count=10, CPU=10 10 1 0.785566500