How Many I/Os Per Second

Mr. @Don_Burn raised a question in my mind about the number of I/O Requests per second on can push through a KMDF driver, using one thread doing synchronous I/O.

I decided to do a very quick and simple test. Here’s my code:

constexpr auto READ_BUFFER_SIZE = 4096;

void
SendIOs();

void
CountIOs();

UCHAR _ReadBuffer[READ_BUFFER_SIZE];
HANDLE _DeviceHandle;
volatile ULONGLONG _CompletedRequests;
constexpr auto MS_BETWEEN_CHECKS = 5'000;
ULONGLONG _OpsLastPeriod = 0;

BOOL
WINAPI ConsoleHandler(DWORD signal) {

    if (signal == CTRL_C_EVENT) {

        printf("^C Exits:\n");

        ExitProcess(1);
    }

    return(true);
}

int main()
{
    ULONG code;

   if (!SetConsoleCtrlHandler(ConsoleHandler, TRUE)) {

       printf("\nERROR: Could not set control handler"); 

       exit(0);
   }

    //
    // Open the nothing device by name
    //
    _DeviceHandle = CreateFile(L"\\\\.\\NOTHING",
//    _DeviceHandle = CreateFile(L"F:\\_work\\x.txt",
                              GENERIC_READ | GENERIC_WRITE,
                              0,
                              nullptr,
                              OPEN_EXISTING,
                              0,
                              nullptr);

    if (_DeviceHandle == INVALID_HANDLE_VALUE) {

        code = GetLastError();

        printf("CreateFile failed with error 0x%lx\n",
               code);

        return (code);
    }

    std::thread SendThread(SendIOs);
    std::thread CountThread(CountIOs);

    SendThread.join();
    CountThread.join();

    ExitProcess(0);
}

void
CountIOs()
{
    using timer = std::chrono::high_resolution_clock;
    timer::time_point clock_start;
    timer::time_point clock_end;
    timer::duration   elapsed_time;
    std::chrono::milliseconds elapsed_ms;

    while(TRUE) {
        
        Sleep(MS_BETWEEN_CHECKS);

        clock_end = timer::now();

        elapsed_time = clock_end - clock_start;

        clock_start = timer::now();

        elapsed_ms = std::chrono::duration_cast<std::chrono::milliseconds>(elapsed_time);

        printf("Total IOs: %llu\n", _CompletedRequests);
        printf("IOs last period: %llu\n", _CompletedRequests-_OpsLastPeriod);

        printf("IOs/Second: %lu\n", (ULONG)( (_CompletedRequests-_OpsLastPeriod) / (elapsed_ms.count()/1000)));

        _OpsLastPeriod = _CompletedRequests;

    }

    return;
}

void
SendIOs()
{
    DWORD bytesRead;
    DWORD code;

    while(true) {

        //
        // Send a read
        //
        if (!ReadFile(_DeviceHandle,
                      _ReadBuffer,
                      sizeof(_ReadBuffer),
                      &bytesRead,
                      nullptr)) {

            code = GetLastError();

            printf("ReadFile failed with error 0x%lx\n",
                   code);

            ExitProcess(code);
        }

        _CompletedRequests++;

    } // while TRUE

    return;
}

Using the code above, sending 4K reads, and using the NOTHING_KMDF driver that we use in our WDF Seminar from GitHub (which uses SEQUENTIAL dispatching and completes every read with success and zero bytes)… I~~ get 310K I/Os per second. If I make the buffer size on the read zero (and DO NOT change the Queuing to allow zero-length Requests) I get about 410K I/Os per second. This last number pretty much represents just the time through the I/O Manager and the Framework (in and out), as the driver never gets called.~~ (see below for an update)

If I change the dispatching to Parallel there’s no significant change in the throughput (strangely, it seems like it’s a tiny bit lower if anything).

While this is higher than Mr. Burn reported, this is significantly lower than I expected.

The test hardware was an Intel Core i7-9850HE CPU @ 2.70GHz – The system was connected to the Internet during the tests, has some custom hardware installed (but not active), but was otherwise stock Windows 10 Pro 20H2 (19042.1466).

If I have the time, I’ll run some more tests with some more sophisticated user-mode schemes. But no guarantees.

I’d be very curious if other folks are able to repeat these numbers – or provide a critique of my test code above.

Peter

For the archives, I should state:

  • This is just about the most naive way to send reads to a driver that you could possibly devise. There’s a lot of “dead time” on the device here, while we wait for the one request that we send to finish, and the send another.
  • There was no attempt at real “test hygiene” – The system under test happens to be one I had already connected to the debugger for another project, and I made no attempt at all to be sure it was clean, quiet, or otherwise prepared for the test.
  • I didn’t do a pile of repeated test runs. I ran the test a few times, I changed the app or the driver, and ran the test a few more times… then I posted here.

So, the test wasn’t intended to be scientific or to demonstrate the maximum throughput that a WDF driver is capable of. It’s just some quick data.

OK?

Peter

As I am sure Peter is aware, the above test code will have a low degree of accuracy. I would not expect it to materially affect the results reported, but multiple access of _CompletedRequests and clock_end & clock_start in the CountIOs thread will reduce the precision of the values reported.

Also, given this pattern, a reduction in the results from parallel dispatching would be something that I would expect from the increased overhead.

This particular system has all cores on a single die and so there won’t be any NUMA effects. That will make the consistency of the reported results better.

Whew! Well… THAT’s more like it.

I was “concerned” to put it mildly, that I was only seeing 310K IOPS through a KMDF driver that does nothing. That just did not square with my experience or expectations. And it was definitively not good news for the world.

Well… now I know why: It seems I ran my initial round of tests with KMDF Verifier enabled. Duh!

When I disable KMDF Verifier, on the same system I ran the tests on above… I now get a very stable 1.17 MILLION I/Os per second, from the very naive synchronous code I posted above (with 4K buffers, returning 0 bytes and success for the read). This is what I expected when I replied to Mr. Burn so long ago.

Changing the user-mode I/O model to a “more sophisticated” overlapped model, using thread pools (via CreateThreadpoolIo with all the default parameters… who knows what they are), and posting 100 async I/O operations at a time… I actually get LESS throughput: only about 864,000 IOPS. Which is very strange. What’s stranger, in fact, is that the number of I/Os completed per second starts out at about 970,000 per second… and slowly DESCREASES to a point where it settles down to about 865,000 per second or so. Weird.

I have one more test I want to run, if I get the time. I’ll post the results here.

Peter

OK… I couldn’t resist… and I had the code just sitting here (in another driver).

Changing the I/O operation from READ to IOCTL… and implementing Fast I/O for Device Control in the Nothing driver (and, again, completing the Request with STATUS_SUCCESS and 0 bytes transferred), I get between 2.04 million and 2.2 million I/O operations per second.

This is much more closely aligned with my expectations… though I swear I have measured 3 million IOPS on at least one previous occasion.

Please don’t take this as a wholesale endorsement of people using Fast I/O for IOCTL in device drivers. It’s almost never appropriate, it’s not a “KMDF thing” at all, and it comes with so many risks I can’t even list them.

So, to remind you… none of this is super scientific, but it WAS fun. And it shows a few things:

  1. If performance is important, don’t ship with Verifier enabled in your driver. Duh.
  2. Plain old, boring, synchronous I/O is remarkably efficient… You don’t need to resort to thread pools and such in most cases. Let your application requirements, not driver throughput, be your guide as to how you send and complete I/O Requests.
  3. If you REALLY need more than a million or so IOPS from your KMDF driver, there are alternative strategies.

There are a ton of other scenarios I’d like to test (like, completing requests via the InCallerContext Event Processing Callback)… but I’ve got to do real work at some point today.

Hope you enjoyed the results as much as I did,

Peter

2 Likes

On 2022-02-10 12:12 p.m., Peter_Viscarola_(OSR) wrote:

Hope you enjoyed the results as much as I did,

This was informative. I thoroughly appreciate your susceptibility to
nerd sniping. :slight_smile:

Good stuff! Remember that overlapped IO is going to involve an APC transition someplace in the kernel, which will add some overhead … in my experience overlapped is rarely (i.e. never) useful except for things that take a really long time, on the order of 100’s of ms’s, to complete. Your request completion times are about an order of magnitude less than that … :slight_smile:

overlapped IO is going to involve an APC transition someplace

Hmmmm… but… if you’re talking about The Special Kernel APC for I/O Completion, that’s not triggered by the use of OVERLAPPED (what the user requests). It’s triggered by the return of STATUS_PENDING (what the driver provides), which is universal in WDF.

So, no… you don’t get this APC as a result the user’s request for asynchronous I/O.

Peter

1 Like

Peter, Could you please elaborate on your statement for not using Fast I/O for better throughput; i.e. “it comes with so many risks” ?
Thank you

Could you please elaborate on your statement for not using Fast I/O

It would take me more space that I have here to do a good job of doing that.

Fast I/O really only belongs in file systems (and filters). That’s what it was “invented” for. It also happens to be supported, but only for IOCTLs, in device drivers. I’ll try to briefly enumerate the primary reasons why it’s usually an enormously bad idea to use it in a device driver:

  • It requires you do all your processing in the context of the caller. So, the caller gets synchronous I/O handling, regardless of whether they want it or not.
  • It runs in the context of the caller, so a driver that uses it can’t be filtered (without being aware the underlying driver uses Fast I/O).
  • It’s entirely outside the bounds of WDF – So none of the state management that you ordinarily get by KMDF (like, NOT having Requests arrive at your EvtIo Event Processing Callbacks when before your driver is “ready” for them) apply. This means you need to maintain your own PnP and Power state within your driver. This can also create problems with stop/remove.
  • For Fast I/O for Device Control, the buffer parameters (InBuffer, InBufferLength, OutBuffer, and OutBuffferLength) are passed from the user with absolutely no validation whatsoever. This means that you must be absolutely scrupulous in your validation of these parameters, or your users are in for some very big surprises.
  • Verifier (either KMDF Verifier or Windows Driver Verifier, I can never remember which one) does not expect it and does not support it. In fact, one of the verifiers routinely instantiates a filter over your Device Object… so you never see any Fast I/O calls (you get IRPs instead). Now you have two ways to get the same IOCTL… and two paths to debug.
  • If you need to do any buffer access in an arbitrary process context (like, in your DpcForIsr) Fast I/O is impossible to work with. You need to punt the Fast I/O request back to the OS and get an IRP created. Again, you now have two paths to debug.
  • SDV doesn’t understand it.
  • CA will tell you that the FastIoDispatch entry doesn’t belong to you.
  • It’s not well known, or well understood, by the vast majority of Windows device driver developers. When you write a device driver that uses Fast I/O, you’re basically creating a piece of software that few people will understand, and that will (therefore) be more difficult than average to maintain.

Fast I/O is rarely “the best” solution to most driver designs. If you need in-caller-context processing, there’s a callback for that, and it’s not Fast I/O.

I wrote the above list from whatever occurred to me as I typed and without any serious forethought. Despite that, I hope it’s helpfu.

Peter

Thank you very much Peter for the very detailed explanation.

while it is obvious that the longer an IOP takes - especially the longer that it takes on the hardware side, and the more concurrent operations that the hardware can support - the better overlapped IO is, I fail to see why an APC is compulsory