Sharing large volumes of data between user and kernel

I know this topic has been discussed here, and I read almost all that I could find on pervious OSR threads about this before posting. However, I still am unsure of the best approach here.

My goal:
Create a fast communication channel between user and kernel and the kernel dumps data very frequently to it and the user mode needs to consume it ASAP to free up the space for the next set of records. Typical producer/consumer scenario.

The approaches I have seen:

  1. User allocates memory, and send it down via IOCTL and kernel locks this down till process exits.
  2. Kernel allocates memory and maps it to user space.
  3. A Device which implements IRP_MJ_READ and data is sent via async completion ports.
  4. NEITHER_IO method.

These methods all have different advantages and disadvantages. However, since security is something critical, I am looking for a ranking of these from best to worst with some explanation(need not be detailed, just pointers for me to do research are fine) of why that is so(this is what I couldn’t find in the earlier discussions).

EDIT1 : Record size is constant, these can potentially be CPU perf counters(potentially because there is another driver giving us the data to steward, so the collection point is a bit opaque). Chunks are not large, in the range of few KB at a time.

The first question is what do you consider large volume of data and in what size chunks are you talking about? It is both rate of data you need and the size of the pieces being requested that impact the best model.

@Don_Burn said:
The first question is what do you consider large volume of data and in what size chunks are you talking about? It is both rate of data you need and the size of the pieces being requested that impact the best model.

Record size is constant, these can potentially be CPU perf counters(potentially because there is another driver giving us the data to steward, so the collection point is a bit opaque). Chunks are not large, in the range of few KB at a time.

You probably should first just see about using IOCTL’s with buffered I/O if it is less than 4KB. See what you get for performance, then start tuning. I did this with a somewhat different model of I/O our initial results were about 100,000 requests per minute, we got that up to 750,000 with the right tweaks, but until you know what you are getting, and what you need, start easy.

The approaches I have seen:

  1. User allocates memory, and send it down via IOCTL and kernel locks this down till process exits.
  2. Kernel allocates memory and maps it to user space.
  3. A Device which implements IRP_MJ_READ and data is sent via async completion ports.
  4. NEITHER_IO method.

These methods all have different advantages and disadvantages. However, since security is something critical,
I am looking for a ranking of these from best to worst with some explanation(need not be detailed, just pointers for me
to do research are fine) of why that is so(this is what I couldn’t find in the earlier discussions).

WEll, in case if you are concerned about the security, option 2 has to go out of the window right on the spot, and option 4 should follow the suit shortly, while options 1 and 3 are, indeed, worth being considered. Assuming that the data rate is, indeed, high, you can combine these two into a quite efficient and safe solution.

For example, you can send a METHOD_DIRECT IOCTL of a type X with a large buffer, and pend it in a driver. As a result, you will get a buffer that is shared by the driver and the app, until this IOCTL gets completed by your driver. However, unlike manually implementing a shared buffer, the system will take care of everything that you would otherwise have to care about (like, for example, abnormal termination of the target application). In order to signal data arrival to an app, you can use another IOCTL of a type Y. The app will submit few IOCTLs of type Y to the driver, and the driver will pend them. When the driver wants to inform the app about the event of interest (i.e.of data arrival) it will simply write data to the shared buffer, and complete one of the outstanding IOCTLs of type Y with the appropriate info. At this point the app will be in a position to read data from the shared buffer using an info provided by the driver with the completed IOCTL.

Anton Bassov

@anton_bassov said:
Thanks!
I believe what you described is called as the inverted IO model.

But what is wrong with #2?

  1. The driver has more control over the memory buffer, and can allocate with Nx tag
  2. Since it is allocated in the kernel it can be from NPP directly, versus in user mode where we have to do a VirtualLock() or at Kernel side have the virtual pages mapped to a safe system range.
  3. No need of pended IOCTLS and the concern that the app crashes taking the buffer away leading to leaks or BSODS.
  4. No worries about checking the page protection and make it non-executable, lest the app writer decides to send an executable buffer our way.

In short, the kernel has just more control over it.

Just as a general rule, NEITHER_IO is never the right answer.

You are prematurely optimizing something that is unlikely to be worth much effort. You have declined to answer Mr. Burns’ reasonable question about size and frequency of the data to be transferred.

In short, you are wasting your (and our) time.

Peter

As I said in an earlier comment do something simple first, and get a handle on what is needed. As Peter pointed out, optimizing early is just wasting time. Assuming this is a KMDF driver, and you put together the support for the data collection, testing the interface to user land if it is an IOCTL will take less than an hour. Right now you could spend days on deciding the “best solution” only to find that you meet the needs of the system from the beginning with a simple and safe approach.

I have used all four of your techniques (though METHOD_NEITHER probably not in the way you are approaching it) over the years, but even with WDF it was worth doing the simple approach first, with KMDF it is ridiculous to not test the driver and see what the performance is with the simple approach before you go further.

But what is wrong with #2?

Check the archives - we had discussed it so many times in this NG. For example, the following thread maybe of interest

https://community.osr.com/discussion/292714/allocatecontiguousmemoryspecifycache-freecontiguousmemory

Anton Bassov

@“Peter_Viscarola_(OSR)” said:
You are prematurely optimizing something that is unlikely to be worth much effort.

But I did not, I merely asked a question here to find the best way, and @Don_Burn did give his feedback!

You have declined to answer Mr. Burns’ reasonable question about size and frequency of the data to be transferred.

But I did, in two places, responded to @Don_Burn and also added an EDIT to the original post so the future reader gets the gist.

In short, you are wasting your (and our) time.

I won’t hide it, well, this pisses me off, there is no reason to be impolite or rude. I know you are a wizard of the kernel, and perhaps a busy person, if your time is so valuable, you needn’t read or answer the post at all.

@anton_bassov said:

But what is wrong with #2?

Check the archives - we had discussed it so many times in this NG. For example, the following thread maybe of interest

https://community.osr.com/discussion/292714/allocatecontiguousmemoryspecifycache-freecontiguousmemory

Anton Bassov

Thanks @anton_bassov this is a very good read indeed.

@Don_Burn said:
As I said in an earlier comment do something simple first, and get a handle on what is needed. As Peter pointed out, optimizing early is just wasting time. Assuming this is a KMDF driver, and you put together the support for the data collection, testing the interface to user land if it is an IOCTL will take less than an hour. Right now you could spend days on deciding the “best solution” only to find that you meet the needs of the system from the beginning with a simple and safe approach.

I have used all four of your techniques (though METHOD_NEITHER probably not in the way you are approaching it) over the years, but even with WDF it was worth doing the simple approach first, with KMDF it is ridiculous to not test the driver and see what the performance is with the simple approach before you go further.

I hear you, the goal was to hit the ground running with the best solution to begin with, rather than testing any, and hence the collective experience of this forum helps. I will definitely follow your suggestions.

was to hit the ground running with the best solution to begin with, rather than testing any

That is not a good goal, because it can never be achieved. You can’t know what’s “best” without actual data. As we have told you, start with a simple implementation and if this does not meet your performance goals, you trade off complexity for performance.

I merely asked a question here to find the best way

And we gave you your answer: Code it up using buffered I/O and see if it’s fast enough. If not, THEN and only then, consider other approaches. After we told this, you persisted with asking “how about method #2”…

But I did, in two places, responded to Don_Burn

Yeah? But we still don’t know the THROUGHPUT you need You said “a few KB” but HOW FREQUENTLY? This is a key part of the equation, because the overhead of moving “a few KB” of data is likely to be dwarfed by the overhead of the operations. A few KB per second is different from a few KB, a hundred times each millisecond, right?

Am I missing the place you told us the total aggregate throughout you need?

well, this pisses me off, there is no reason to be impolite or rude

Careful. You come into MY house and you complain about the free dinner I serve you? Hmmmm…

If you have further complaints you can email me directly to rail about my perceived injustices. But this thread is now locked.

Peter