Best approach to have a dynamic buffer that we use to send the received packets from NDIS to users?

I want to write an NDIS driver that stores every received packet in a buffer (for example the simplest way is store each of them in a linked list)

and then i want the user to periodically query the NDIS driver and receive these packets, check them, drop the ones that are not OK and send the rest of them.

My questions are :

  1. What is the best way to store these received NBLs (or basically just the packet contents) ? one simple approach would be inserting each of them in a linked list, but that means we have to do a for loop everytime a user sends a receive request, instead of just sending all of them at once.

  2. What is the best way to implement the user checking part? i cannot implement the packet checking in the NDIS, and the user needs to somehow get the packets and check them and then somehow tell the driver which of them are OK, will this be possible with Flt Communication Port? Is sending an IOCTL periodically to get and check all the received packets the best approach?

I want to write an NDIS driver that stores every received packet in a buffer

Have you even for a moment stopped to consider the performance implications of what you’re suggesting? Every packet? On a 10GB local network, you could be sucking up a gigabyte every second. How do you think kernel/user/kernel communication is going to work with that kind of load?

Getting the data to UM for analysis is probably the easy part. The biggest problem is the latency of whatever analysis you plan to do on these packets.

presuming that most of your network traffic is TCP based, and that in addition to the KM to UM latency, whatever ‘scanning’ that you plan to do both takes time, and, more importantly, requires a significant window of reconstructed TCP stream, the main problem is going to be the bandwidth / latency problem.

you can certainly help your problems to a great degree by fanning our the packets to be processed by multiple threads in groups based on correlated streams. But your overall performance will vary to a great degree based on the profile of the traffic being passed

Getting gigabytes of data from the card to usermode is trivial; infiniband drivers do it all the time with zero-copy RDMA techniques for high performance computing [https://en.wikipedia.org/wiki/Zero-copy] across compute clusters and that’s in the terabyte range or better … here’s a github of a similar type of project [https://github.com/AmbrSb/KUP] …

That’s the easy part; the harder part, as @MBond2 mentioned, is what you’re going to be doing with those packets …

So, rather than ask “the wings on my pig keep falling off, should I use a polyurethane glue to attach them or weld them on?” a better question would be “I need my pig to be able to travel from point to point quickly, what are my options”

Rather than ask “I want to write an NDIS driver that stores all packets received in a linked list” I would rephrase it to the actual problem you’re trying to solve … packet introspection? Quality of service tuning? What is the target user, and what kind of machine is this running on? What kind of network speeds, and traffic types? Start with the “here’s the problem I’m trying to solve and here’s what I have to work with” rather than “… is solution A better than solution B?”

…

@Tim_Roberts said:

I want to write an NDIS driver that stores every received packet in a buffer

Have you even for a moment stopped to consider the performance implications of what you’re suggesting? Every packet? On a 10GB local network, you could be sucking up a gigabyte every second. How do you think kernel/user/kernel communication is going to work with that kind of load?

Well its not THAT bad in terms of memory usage when the user mode keeps emptying the list few times a second, i already implemented this using a linked list and as you guessed it does reduce the speed by around 75%… I store received packets in a linked list and empty the list everytime the user queries the NDIS, it sends it all to the user, then the user sends all the ones that are good to go to NDIS and it sends them.

But the performance hit is not because of the user mode’s packet inspection, that part is really optimized and fast, so I think the main problem is this linked list approach, is there any better and more optimized way of doing this? If you had to solve this problem how would you have done it, if you really had to inspecting every TCP/UDP packet and had to use user-mode to inspect them?

@craig_howard said:
Getting gigabytes of data from the card to usermode is trivial; infiniband drivers do it all the time with zero-copy RDMA techniques for high performance computing [https://en.wikipedia.org/wiki/Zero-copy] across compute clusters and that’s in the terabyte range or better … here’s a github of a similar type of project [https://github.com/AmbrSb/KUP] …

That’s the easy part; the harder part, as @MBond2 mentioned, is what you’re going to be doing with those packets …

So, rather than ask “the wings on my pig keep falling off, should I use a polyurethane glue to attach them or weld them on?” a better question would be “I need my pig to be able to travel from point to point quickly, what are my options”

Rather than ask “I want to write an NDIS driver that stores all packets received in a linked list” I would rephrase it to the actual problem you’re trying to solve … packet introspection? Quality of service tuning? What is the target user, and what kind of machine is this running on? What kind of network speeds, and traffic types? Start with the “here’s the problem I’m trying to solve and here’s what I have to work with” rather than “… is solution A better than solution B?”

…

The speed is 1-10Gb/s, and the goal is packet inspection for malicious packets. And as i said the user-mode inspection is optimized and fast, i need to optimize the process of storing the packets, sending them to user to inspect, and then sending those that are good to go from NDIS. I have never heard of zero-copy RDMA techniques, so i should use this technique in my NDIS to send packets to user? I coudln’t find anything regarding using zero-copy RDMA techniques in the NDIS at MSDN, and the github you shared is for FreeBSD?

@MBond2 said:
Getting the data to UM for analysis is probably the easy part. The biggest problem is the latency of whatever analysis you plan to do on these packets.

presuming that most of your network traffic is TCP based, and that in addition to the KM to UM latency, whatever ‘scanning’ that you plan to do both takes time, and, more importantly, requires a significant window of reconstructed TCP stream, the main problem is going to be the bandwidth / latency problem.

The user-mode inspection part is optimized and does not cause much performance hit, the problem is caused by the process of storing the packets in a linked list, emptying it and sending it to user everytime it queries it, and then sending the ones that are marked as OK by the user using the NDIS.

you can certainly help your problems to a great degree by fanning our the packets to be processed by multiple threads in groups based on correlated streams. But your overall performance will vary to a great degree based on the profile of the traffic being passed

How exactly should i implement this group based multi thread approach in my NDIS receive callback? or are you talking about the user-mode inspection part?

@brad_H said:

…

The speed is 1-10Gb/s, and the goal is packet inspection for malicious packets. And as i said the user-mode inspection is optimized and fast, i need to optimize the process of storing the packets, sending them to user to inspect, and then sending those that are good to go from NDIS. I have never heard of zero-copy RDMA techniques, so i should use this technique in my NDIS to send packets to user? I coudln’t find anything regarding using zero-copy RDMA techniques in the NDIS at MSDN, and the github you shared is for FreeBSD?

Google is your friend, GoogleFu those terms and enlightenment will follow … and the FreeBSD is provided as an example of RDMA techniques, it’s about as far away from copy/paste code as you can get and is there to understand how it’s done under one OS, not as a guide for Windows …

how have you concluded that your scanning is not a significant cause of performance reduction? What specific measurements have you made or analysis done?

perhaps your logic is that that part can’t be made to go any faster, so it must not be a problem. if the analysis is for TCP data, and a significant window into the stream is required, you are going to introduce significant latency. That’s going to have a significant effect on the way that layers above you respond

also note that RDMA is almost certainly irrelevant to your problem. It requires coordination from multiple levels of the stack that you probably have no control over

1 Like

And as i said the user-mode inspection is optimized and fast,

Doesn’t matter how fast it is if it’s only getting called “a few times a second”. Every interval you delay adds latency to the packets. Also, you are ignoring the time it takes to transition between kernel mode, where the packets arrive, and user mode, where they get reviewed, and back to kernel mode, where they’ll be approved/declined.

@MBond2 said:
how have you concluded that your scanning is not a significant cause of performance reduction? What specific measurements have you made or analysis done?

perhaps your logic is that that part can’t be made to go any faster, so it must not be a problem. if the analysis is for TCP data, and a significant window into the stream is required, you are going to introduce significant latency. That’s going to have a significant effect on the way that layers above you respond

@Tim_Roberts said:

And as i said the user-mode inspection is optimized and fast,

Doesn’t matter how fast it is if it’s only getting called “a few times a second”. Every interval you delay adds latency to the packets. Also, you are ignoring the time it takes to transition between kernel mode, where the packets arrive, and user mode, where they get reviewed, and back to kernel mode, where they’ll be approved/declined.

I measured it with procmon, so the user-mode checking takes around 10% of the time and most of it is happening in kernel.

My main question here right now is what is the most efficient way of storing and transmitting these packets to user mode, and sending the ones that are marked as OK by the user-mode using the NDIS driver?

@brad_H said:

@MBond2 said:
how have you concluded that your scanning is not a significant cause of performance reduction? What specific measurements have you made or analysis done?

perhaps your logic is that that part can’t be made to go any faster, so it must not be a problem. if the analysis is for TCP data, and a significant window into the stream is required, you are going to introduce significant latency. That’s going to have a significant effect on the way that layers above you respond

@Tim_Roberts said:

And as i said the user-mode inspection is optimized and fast,

Doesn’t matter how fast it is if it’s only getting called “a few times a second”. Every interval you delay adds latency to the packets. Also, you are ignoring the time it takes to transition between kernel mode, where the packets arrive, and user mode, where they get reviewed, and back to kernel mode, where they’ll be approved/declined.

I measured it with procmon, so the user-mode checking takes around 10% of the time and most of it is happening in kernel.

My main question here right now is what is the most efficient way of storing and transmitting these packets to user mode, and sending the ones that are marked as OK by the user-mode using the NDIS driver?

ProcMon isn’t the way to measure kernel space timings … that’s limited to usermode applications. There are ways to get timings in kernel mode, I’d do some searching on this list for “precision timings”

The answer to your question isn’t as clear cut as it looks, because it’s not clear who’s doing what …

You’ve got data coming in from the NIC card, which is going to be in the context of the OS as it works it’s way from ISR to DPC. Now the data is sitting in a preSniffed packet buffer queue in the driver waiting for some thread to scoop it out …

You’ve got a usermode service running, with some threads waiting for something to do …

So what it sounds like you’re going to do (and this is all a wild guess) is have a thread in the service post an inverted call to the driver waiting for the packet buffer to have something to look at. The driver moves a (single) packet from the packet buffer into the inverted call thread buffer and that call completes, moving the packet into the service where it is looked at. If all is good then the thread then makes another inverted call into the driver (this time passing the packet or a packet ID back) and waits for the next packet to be looked at.

The driver takes the sniffed packet (or packet ID) from the inverted call and puts it into a postSniffed packet queue

At some point another driver in the network stack (and it needs to be at the kernel level, because Kernel Winsock exists and if you don’t handle that then bypassing your packet sniffer is not only trivial, it’s expected behaviour) pulls packets from the postSniffed packet queue

Do you see how long and tortuous of a journey every single packet is going to have to make, and how long all of those kernel to usermode transitions are going to take?

Most (actually all, not most) packet introspection happens entirely in the kernel and that’s where you’re going to have to put your sniffing. Most (actually all) use system thread pools and most (actually all) don’t do any packet copying, they work entirely on a packet as they are DMA’ed from the NIC or the offload engine.

tldr: Over in LinuxLand (which because of it’s nature you can read the source code) there are things that do packet sniffing; I would really strongly recommend you see what they are doing and try to emulate that in your product … you can’t cut and paste the code and declare victory [love that Doron, I’m starting to use that!] and you can’t lock a processor at ISR level and keep it there like Linux can but it will give you an idea of what is working for packet scanning. Your approach, unfortunately, at best is going to turn a 10GB network connection into a 1980’s 300baud modem connection …

@craig_howard said:

@brad_H said:

@MBond2 said:
how have you concluded that your scanning is not a significant cause of performance reduction? What specific measurements have you made or analysis done?
Most (actually all, not most) packet introspection happens entirely in the kernel and that’s where you’re going to have to put your sniffing. Most (actually all) use system thread pools and most (actually all) don’t do any packet copying, they work entirely on a packet as they are DMA’ed from the NIC or the offload engine.

Yes, one solution would be move all the inspection to kernel mode. but if this was not possible, what is the next best option?

I thought maybe instead of using a linked list, we can somehow store all the received packets in a continuous buffer, and therefore when the user asks for the new list of packets, we just need to do one large copy of this buffer to the user buffer and that’s it (instead of looping and removing packets from the linked list one by one and moving it to the user buffer), do you agree? there are obviously many ways to do this, but any tips on what is the most optimized way of doing this?

@brad_H said:

@craig_howard said:

@brad_H said:

@MBond2 said:
how have you concluded that your scanning is not a significant cause of performance reduction? What specific measurements have you made or analysis done?
Most (actually all, not most) packet introspection happens entirely in the kernel and that’s where you’re going to have to put your sniffing. Most (actually all) use system thread pools and most (actually all) don’t do any packet copying, they work entirely on a packet as they are DMA’ed from the NIC or the offload engine.

Yes, one solution would be move all the inspection to kernel mode. but if this was not possible, what is the next best option?

I thought maybe instead of using a linked list, we can somehow store all the received packets in a continuous buffer, and therefore when the user asks for the new list of packets, we just need to do one large copy of this buffer to the user buffer and that’s it (instead of looping and removing packets from the linked list one by one and moving it to the user buffer), do you agree? there are obviously many ways to do this, but any tips on what is the most optimized way of doing this?

You have a different issue here … recall that TCP traffic is a stream of continuous packets, not like UDP where packets are singular. Suppose that you find the “one bad apple” packet and remove it; that’s packet 20 out of a sequence of 30 that is now missing. First thing that the downstream object does is request a retransmit of packet 20 (if you’re lucky, if you’re not then it will ask for a retransmit of the whole sequence) and now that makes the rest of the stored buffer pointless (nothing cares about packets 1-19, then packets 21-30). That’s why you look at things one packet at a time, because then you aren’t in the sequence business …

That of course assumed that the next time packet 20 shows up it’s not another “bad apple”, in which case you’ve now denial of serviced yourself …

…we can somehow store all the received packets in a continuous buffer…

Storing is one issue, but communication is another issue. I don’t often recommend this kind of approach, but you may want to consider allocating a big buffer in your user app, then passing that to the driver, to be mapped into kernel space. Whether it’s a circular buffer or a linked list, there are ways you can make that work without requiring yet another copy from kernel mem to user mem. The app can periodically scan the buffer for new data and mark its decisions without requiring any ioctls at all. Is that tricky to synchronize? Yes, no doubt, but it’s hard to beat the performance.

1 Like

Again, presuming TCP traffic. Maybe the OP can clarify

procmon is an ineffective method of testing for anything. There are no analytical results possible from this tool’s output

presumably, if something ‘bad’ is detected, the TCP connection will be reset or abandoned. Abandoning it would involve ‘black holing’ all future packets in the stream and controlling the responses sent so that the presumably malicious sender of these packets wastes as many resources as possible after you have detected their malice

Exactly what Mr. Roberts said. I’ve been busy teaching, but that is exactly what I was going to write.

Peter

@MBond2 said:
Again, presuming TCP traffic. Maybe the OP can clarify

procmon is an ineffective method of testing for anything. There are no analytical results possible from this tool’s output

presumably, if something ‘bad’ is detected, the TCP connection will be reset or abandoned. Abandoning it would involve ‘black holing’ all future packets in the stream and controlling the responses sent so that the presumably malicious sender of these packets wastes as many resources as possible after you have detected their malice

We want to monitor both TCP and UDP packets. Is there any open source NDIS that implements your approach of black holing the future packets and sending the proper response for TCP? because all the project that i have access to just drop the NetBuffer when it matches to something malicious, will this approach cause any real problem?

@Tim_Roberts said:
Is that tricky to synchronize? Yes, no doubt, but it’s hard to beat the performance.

Any suggestion on the best/most optimized approach to synchronize the NDIS driver with the user mode service?