Best approach to have a dynamic buffer that we use to send the received packets from NDIS to users?

craig_howard · June 19, 2021, 4:48pm

If you’re using the shared buffer approach between the service and the kernel (which I agree with, and which is getting closer to how RDMA works) then having two shared “something happened here” events and two shared ‘what happened’ regions (one for the service, one for the driver) works well … you’re essentially duplicating the DMA interrupt transaction, but in software between the driver and the service instead of between hardware and the driver …

A big part of how this can be done depends on TCP and/ or UDP and if the packets originate from “out there” or on the local machine … you mentioned both TCP and UDP, that’s going to get very, very complicated very quickly because most TCP packets these days are HTTP, and most HTTP streams are SSL encrypted.

How do you plan on examining the contents of a TCP HTTPS packet?

brad_H · June 20, 2021, 9:41pm

@Tim_Roberts said:

…we can somehow store all the received packets in a continuous buffer…

Storing is one issue, but communication is another issue. I don’t often recommend this kind of approach, but you may want to consider allocating a big buffer in your user app, then passing that to the driver, to be mapped into kernel space. Whether it’s a circular buffer or a linked list, there are ways you can make that work without requiring yet another copy from kernel mem to user mem. The app can periodically scan the buffer for new data and mark its decisions without requiring any ioctls at all. Is that tricky to synchronize? Yes, no doubt, but it’s hard to beat the performance.

Any suggestion on the best/most optimized approach to synchronize the NDIS and the user mode app?

brad_H · June 21, 2021, 9:38am

One more question :

If we switch to WFP, will this help the process of sending these packets to user mode for checking and the overall speed of user-kernel communication?

Basically, if we switch to WFP, will it help us in this case in any ways?

craig_howard · June 22, 2021, 11:55pm

WFP certainly looks like the way to go for what you’re trying to do … I’d do some research into this, the “inspect” sample on GitHub looks really similar to what you’re trying to do …

brad_H · June 23, 2021, 4:18am

@craig_howard said:
WFP certainly looks like the way to go for what you’re trying to do … I’d do some research into this, the “inspect” sample on GitHub looks really similar to what you’re trying to do …

But will it help much?

We are concerned with TCP/UDP packets, we can even switch the entire signature checking to kernel as well if it is really necessary, but will switching to WFP make any difference in terms of not reducing bandwidth?

brad_H · June 23, 2021, 5:02am

Also for the signature checking part, just assume that we need to go over packets and inspect them for any malicious content such as exploits by matching them against signatures.

So if we move the entire signature checks to kernel, will it help THAT much? What about switching to WFP?
Because right now more than 60% of bandwidth is reduced because of these user mode checks…

craig_howard · June 23, 2021, 6:02am

A good rule of thumb is keep your code and your data as close together as possible … if your data lives in kernel space then your business logic should too.

Even more important, it’s better to use the MS samples when possible as you know that they work (but please use current samples, I had a client trying desperately to get an audio driver to work properly, turns out it was based on MSVAD rather than SYSVAD … sigh …). I worked on a file system minifilter driver a few years ago enhancing the MS file cache for Win10 embedded (which has a readonly memory space, so any writes actually get cached elsewhere … client wanted that improved) and discovered that the minifilters that MS introduced a few years back are an order of magnitude improvement over anything I could spin on my own …

And especially after reading the post about NDIS 6.x packet allocations and NDL’s and such I would be all over using a WFP sample for anything that needed to touch packets (like you’re trying to do) … remember, you want to spend your time writing the business logic for packet introspection/ scanning/ whatever, not spend it trying to dig out NDL’s and manage usermode to kernelmode transitions …

You’ll save bandwidth since it’s all in kernel space, it will run faster (the slowest system thread will run faster than the fastest thread in your service), you don’t need to decompose NDIS 6.x or NDIS 5.x packets, the program is (almost) guaranteed to fit into the NDIS ecosystem, seems like a no-brainer …

Peter_Viscarola_OSR · June 23, 2021, 12:52pm

it will run faster (the slowest system thread will run faster than the fastest thread in your service)

Mr @craig_howard …. I’m sure you’re going to think I’m picking on you, but this is not correct. Threads running in kernel mode (at IRQL PASSIVE_LEVEL) have absolutely no scheduling advantage over threads running in user mode. Threads are threads. The process owning the thread (system process or a user process) doesn’t impact the thread’s ongoing scheduling.

The fact that “things run faster in kernel-mode” is a regular misconception that I have to correct among clients. They want to “move work to the OS” because it’ll “run faster” – except it doesn’t. Threads are threads, regardless of the mode in which they happen to be running at the time.

To the OP: The rule of thumb for Best Practice is if you can do practically in user mode, do it in user mode. In general, leave policy decisions and business logic out of our kernel mode, where the consequence of failure is higher and the code is harder to revise as needs change.

Peter

brad_H · June 23, 2021, 2:01pm

@“Peter_Viscarola_(OSR)” said:

it will run faster (the slowest system thread will run faster than the fastest thread in your service)
To the OP: The rule of thumb for Best Practice is if you can do practically in user mode, do it in user mode. In general, leave policy decisions and business logic out of our kernel mode, where the consequence of failure is higher and the code is harder to revise as needs change.

Peter

Alright, i changed the communication from BUFFERED to OUT_DIRECT as suggested in the other thread, and unfortonutly it still didn’t help, the bandwidth still get reduced by around 60-70%…

Here’s how packet inspection is happening right now :

Send and Received packet contents are stored in a linked list
User mode service sends OUT_DIRECT ioctls frequently, and every linked list member is removed and copied to user-mode buffer directly (since its out_direct) one by one
User mode inspects all the packets, and sends a direct_out IOCTL to the NDIS giving a continious buffer that contains all the packets, and then kernel iterates over them and creates the proper NBL for each of them and sends to proper channel (based on whether its a receive or send packet

I know… its messy, and obviously there are many rooms for improvmenet.

But the main problem actually occurs on large file transfers in 1Gb/s+ networks, so in slower networks it works just fine. but in 1Gb/s+ the bandwidth gets reduced around 60-70% when copying large files…

Note that even when i remove the signature check from user-mode (so every packets is passed), it doesn’t change anything, so the problem is not with the packet checking part.

So what do you guys think the main problem is? How can i improve this and make to reduction of bandwidth less than 20-30%?
Should i bring the signature checks to kernel? switch to WFP? or somehow directly copy the received packet contents to the user-mode service instead of using a linked list and iterating over it?

brad_H · June 23, 2021, 2:40pm

Although i am starting to feel like moving everything to kernel mode wont help much either,

Because i wrote a test NDIS driver that basically just reads the received packets ( using NdisGetDataBuffer) and compares it against 15 signatures (each of them around 100 bytes), and if it was OK then it allows it to pass. But even with this simple NDIS driver the bandwidth is getting reduced around 80%, so when copying large files from shares to local disk the speed gets reduced from 100MB/s to 20MB/s !!

And because of this i am starting to think the only solution is moving to WFP, right? or am i missing something here?

craig_howard · June 23, 2021, 3:55pm

I would definitely build the “inspect” WFP sample and have it do the same kind of packet scan (or even just a short delay, to simulate a scan) and see what the result is …

Peter_Viscarola_OSR · June 23, 2021, 4:45pm

What Mr. @craig_howard said. Exactly. For a quick check, right?

Or, you know, get into some performance measurement. Which is really where you’re going to have to go eventually. You can’t fix it, until you know what “it” is. If the vast majority of the overhead isn’t in getting the data back to user-mode, then no matter what clever scheme you use to do that, it’s not going to speed the overall process significantly.

Peter

MBond2 · June 23, 2021, 11:25pm

effective design for UDP traffic is radically different than effective design for TCP traffic both from an IO model as well as from a ‘scanning’ point of view.

UDP traffic is inherently a packet by packet kind of analysis. Yes higher level protocols that run on top of UDP can generate correlated streams of packets, and yes UDP packets can be fragmented into multiple Ethernet (or other media) frames, but they can be arbitrarily lost or delivered out of order so packet by packet analysis - and the time taken to analyze one packet has no effect on the ability to deliver another packet to the upper level

TCP traffic is inherently a stream of data. Think paper tape. There could be out of order packets and retransmissions and all sorts of other stuff required to get that stream from one end of the connection to the other, but it is always presented to the application as a stream of bytes in order. That means that to scan for anything, you have to be scanning over some window of data in that stream. When I say window, I mean a range of bytes that have to be held back from the upper level while enough data arrives in order for any kind of scanning to be effective. And that means that an artificial latency will be introduced. The size of that latency depends on the data rate, the effective delivery rate (how many packets arrive out of order or need to be retransmitted) and most importantly, the size of that window. And the drop in throughput you observe in your unsystematic testing can be directly attributed to this effect.

There is another way to approach this problem though. Instead of scanning data in advance of forwarding it up, you can copy it and scan it after it has been forwarded. if anything of interest is found, then you can flag the connection and take whatever remedial action seems good to you (black holing and tar pit schemes are especially effective as they waste the resources of your attacker).

This may seem like it is very unsafe since allows the content that should be blocked to reach the application before any action is taken on the connection, but it is probably highly effective.

understand and set aside for a moment the problems of encrypted traffic that others have eluded to and assume that you can somehow solve them. That’s a whole other topic that is even more expansive than the current one.

It is highly unlikely that your solution is some kind of protocol sanitizer. There are millions of protocols that run on top of TCP including innumerable proprietary ones; so that would be an infeasible task. Instead I suspect that you are looking for specific byte patterns within the stream. These ‘signatures’ would then indicate the presence of undesired content. That means that at best some amount of the content has been delivered before you detect the pattern and remediation that you take is already about preventing further harm instead of an absolute protection. And this design just expands on that by moving the intervention into an asynchronous context while minimally interfering with the throughput of valid streams of data - which are expected to be far more numerous than the nefarious ones I would think.

anton_bassov · June 24, 2021, 5:42pm

Exactly what Mr. Roberts said. I’ve been busy teaching, but that is exactly what I was going to write.

The “only” problem is that we are speaking about NDIS packets here…

Once these packets (as well as their corresponding buffers) get allocated by the KM code, you still have to copy data from NDIS packets to this buffer somehow,right.

Therefore, although this approach allows you to share a buffer (i.e. is meant to work as a direct IO), in this particular case it happens to be, for all practical purposes, nothing more than just a buffered one, because you still have to copy data from one buffer to another, and do it by the CPU.

Certainly, you can try it the other way around, i.e. map driver-allocated NDIS buffers to the userland, but the security “implications” of this approach seem pretty obvious.

I think the best way to go here is simply to use the WFP, which has been specifically designed for this purpose, rather than a NDIS filter

Anton Bassov

brad_H · June 27, 2021, 7:07am

@craig_howard said:
I would definitely build the “inspect” WFP sample and have it do the same kind of packet scan (or even just a short delay, to simulate a scan) and see what the result is …

Well i tried the inspect WFP that Microsoft has written to see how well the WFP performs.

But to my surprise its even worse than NDIS. After i removed the IP based filter (because the default code only monitors one IP), file transfer speed from shares get reduced from 80MB/s to 20MB/s, and this is without me implementing any packet content checking! note that in the case of NDIS it only got reduced after i implemented a simple packet checking.

So is this normal? why is WFP so much worse than NDIS in terms of reducing the transfer speed of files through shares?

Note that the only thing i change in the Inspect source code was the IP based filter.

So… what should i test next? does this mean NDIS is better for deep packet inspection?

(We want to monitor every IP and port, so adding port/IP filters is not gonna work)

brad_H · June 27, 2021, 7:13am

@MBond2 said:
There is another way to approach this problem though. Instead of scanning data in advance of forwarding it up, you can copy it and scan it after it has been forwarded. if anything of interest is found, then you can flag the connection and take whatever remedial action seems good to you (black holing and tar pit schemes are especially effective as they waste the resources of your attacker).

This is not gonna work, because in the case of many exploits, if you let one packet slide then the attacker all of the sudden has kernel code execution.

brad_H · June 27, 2021, 7:26am

Could there be a workaround for the reduction of speed in the case of SMB based share file transfer? Obviously we can’t just stop looking at port 445/SMB packets, and we can’t just only inspect the first few hundred packets of a SMB transfer, since for example the attacker/malware might try to exploit after many packets have been sent/received, and its never determined how many packets are we required to inspect until we can determined the connection to be safe…

brad_H · June 27, 2021, 9:47am

I also found an interesting old thread in OSR:

https://community.osr.com/discussion/290695/wfp-callout-driver-layer2-filtering

In it @NILC says:

However, I am encapsulating packets and needed the ability to be able to create NBL chains in order to improve performance when dealing with large file transfers and the like (i.e. typically for every 1 packet during an SMB file transfer one needs to generate at least 2 packets per 1 original packet because of MTU issues)

But i don’t get why “for every 1 packet during an SMB file transfer one needs to generate at least 2 packets per 1 original packet because of MTU issues”? And what does it mean to encapsulate packets for overcoming this issue?

craig_howard · June 27, 2021, 5:44pm

@brad_H said:

@craig_howard said:
I would definitely build the “inspect” WFP sample and have it do the same kind of packet scan (or even just a short delay, to simulate a scan) and see what the result is …

Well i tried the inspect WFP that Microsoft has written to see how well the WFP performs.

But to my surprise its even worse than NDIS. After i removed the IP based filter (because the default code only monitors one IP), file transfer speed from shares get reduced from 80MB/s to 20MB/s, and this is without me implementing any packet content checking! note that in the case of NDIS it only got reduced after i implemented a simple packet checking.

So is this normal? why is WFP so much worse than NDIS in terms of reducing the transfer speed of files through shares?

Note that the only thing i change in the Inspect source code was the IP based filter.

So… what should i test next? does this mean NDIS is better for deep packet inspection?

(We want to monitor every IP and port, so adding port/IP filters is not gonna work)

Hmm … OK, my interest has been piqued so I’m going to load up the WFP filter and see what’s up … so that we’re on the same sheet of music, could you verify some of my assumptions …

You’re using the “inspect” WFP sample on the github repo, which was updated two months ago
You’re using the latest VS2019 … which you’ll know is the latest because the sample won’t compile without putting the line “#define _NO_CRT_STDIO_INLINE” in the main header before anything else due to a bug that the MS folks introduced in the latest update
You’re doing all of the packet introspection in the context of the filter; no passing data into userland, no IOCTL’s, nothing, everything entirely done in the kernel context. I plan on simply sniffing the packet header looking for the IP of a machine on my network and putting out a debug message when that hits, so as to minimize any performance impact from the introspection
You’re examining raw TCP and UDP packets, without any SSL decodes
You’re running in debug mode under the KMDF verifier

Are these assumption accurate?

anton_bassov · June 27, 2021, 5:46pm

We want to monitor every IP and port, so adding port/IP filters is not gonna work

What are you going to do with this info in your NDIS driver ? Unless we are speaking about the requests on some well-known in advance port that is associated with a certain well-known service, the address/port combination alone may not always be sufficient to identify your target recipient /sender process, which means you need a WPF part of the solution anyway. This “upper” WFP part is going to relate the address/port combination to a particular process, which allows the “lower” NDIS filter to make the actual use of this info.

In fact, WFP part alone should normally suffice. Introducing NDIS filter is nothing more than just an optomisation that allows you to block the “undesirable” packets before they have even had a chance to reach the protocol stack, effectively improving the performance. However, the opposite is not true - as long as you want to relate your target packets of interest to some particular process, WFP-level part if the solution is the absolute must

This is not gonna work, because in the case of many exploits, if you let one packet slide then the attacker all of the sudden
has kernel code execution.

Well, if this is the case, it means that the target machine has been already compromised right at the kernel level, right? This, in turn, automatically implies that it is already too late to do anything about it. Just to give you an idea, what hold the KM attacker back from simply modifying the callback chain, effectively disabling your driver(s) and turning it into a piece of the dead code that never ever actually gets executed, despite being physically loaded in RAM???

Anton Bassov