Windows System Software -- Consulting, Training, Development -- Unique Expertise, Guaranteed Results

Home NTDEV
Before Posting...
Please check out the Community Guidelines in the Announcements and Administration Category.

More Info on Driver Writing and Debugging


The free OSR Learning Library has more than 50 articles on a wide variety of topics about writing and debugging device drivers and Minifilters. From introductory level to advanced. All the articles have been recently reviewed and updated, and are written using the clear and definitive style you've come to expect from OSR over the years.


Check out The OSR Learning Library at: https://www.osr.com/osr-learning-library/


Timer for Synchronization

mwendlingmwendling Member Posts: 4

My driver needs to send an Ethernet packet every 250ms for synchronization to other hardware. I have a user mode app for testing which gives very good performance using the chrono high resolution clock in c++. But when I try to bring the timer into my driver using KeQueryUnbiasedInterruptIme to query the time and KeDelayExecutionThread for the delay, the resolution is far from acceptable and varies by several milliseconds (intervals of 247ms to 258ms). I also tried KeQueryPerformanceCounter and had about the same results. Is there a better approach, or would you expect this to work as well as the user mode app and assume that I’m not implementing it correctly in the driver?

Comments

  • Jeffrey_Tippet_[MSFT]Jeffrey_Tippet_[MSFT] Member - All Emails Posts: 577

    Windows NT is not a realtime OS. You already know this, but that fact is going to color every bit of this discussion. You can often get "pretty okay" results on NTOS, and in very constrained setups, you can get "surprisingly good" results -- at least, surprisingly good for a non-RTOS. But you cannot get perfect results all the time.

    If you have hard millisecond requirements on the timing of the packets, you cannot use NTOS to do this. You'll need an RTOS. It's possible that you already have one: expensive NICs typically have an internal clock, a general-purpose ARM or RISC CPU, and an SDK that lets you move your program onto the NIC's CPU. (Note that even this has an asterisk: Ethernet itself is a shared medium, which means that if anyone else on the network transmits a packet at the same time, your packet gets trashed. You can retransmit, but now you've lost a hundred nanoseconds.) Another option is to buy a little board (like a raspberry pi) that has an Ethernet jack and can run an RTOS.

    If you must run on NTOS (or any non-RTOS like macOS or vanilla Linux), then you have to accept unbounded imprecision.

    Okay, the big disclaimer is out of the way. Now I suggest you take a break from reading my reply to read through this excellent page: https://docs.microsoft.com/en-us/windows/win32/sysinfo/acquiring-high-resolution-time-stamps

    Regarding your specific question: KeQueryUnbiasedInterruptTime is not an exact equivalent for usermode's QueryPerformanceCounter. The actual equivalent is KeQueryPerformanceCounter. If you switch to that, you should be able to get your driver's numbers up to the same level of quality that you see in your usermode version.

    KeQueryInterruptTime (and similarly KeQueryTickCount) are only updated once every clock tick. That means that on typical PCs, the value can be as much as 16ms stale. The Unbiased versions are similar; they're just not artificially advanced when the system goes to sleep. For network protocols, you probably want the biased clock: for other devices on the network, time continues to advance, even if the local host has gone to low power.

    There's also a Precise variant of these clocks. These use a quick LERP to improve on the 16ms granularity. On x64, the CPU has a cycle counter (RDTSC) that is much more granular than the 16ms interrupt. So you can LERP the cycle counter with the clock interrupt to claw back a bit more precision.

    But you don't need to worry about any of that: if you really need high precision, use KeQueryPerformanceCounter. This method doesn't have one specific implementation. Instead, it automatically selects the implementation that has the best precision on the current CPU. So if KeQueryInterruptTimePrecise is the best you can do, then KeQPC will do it. If RDTSC is the best, then KeQPC will use that. If there's some future technology that gives an amazing clock source, KeQPC will use that. So by using KeQPC, you're saying "only the best is good enough for me: give me the best!" without having to worry about the details.

    Summary of time sources:

    • InterruptTime / TickCount: only 16ms resolution
    • Precise: somewhat better than 16ms
    • Unbiased: time stops when the CPU stops
    • KeQueryPerformanceCounter: probably the correct answer

    One more caveat: note that the NIC has its own internal queue of packets. If the NIC has a huge queue of packets to be transmitted, your packet will sit and wait for an unbounded time. You may be able to avoid this problem by using 802.1p priority tags: the better NIC drivers have a separate queue for high priority packets. Of course, putting the tag on the packet will change what goes out on the wire, so make sure the other side of this connection is okay with seeing the extra header on the packet.

    Alternatively, you can solve the problem by just legislating it away: demand that your solution only runs on a dedicated Ethernet port, with no other contention from other traffic.

  • mwendlingmwendling Member Posts: 4

    Thank you for your reply. I have read through the link you provided. I understand that Windows is not a real time OS, but I am puzzled by the significant difference in results that I am getting between the user mode app approach and the kernel mode driver approach. Here is some additional information. The first code snippet below is from the user mode application. The accuracy of the interval that I observe in Wireshark between packet transmissions is 250.0ms +/- .02ms. The second code snippet is from the driver. From this one, I observe 250ms to 262ms in Wireshark. I will study your comments further to see if there is another approach that I should try.

    App approach:

    std::chrono::duration time_span;
    std::chrono::high_resolution_clock::time_point tstart;
    std::chrono::high_resolution_clock::time_point tnow;

    while (1)
    {
    tstart = std::chrono::high_resolution_clock::now();
    tnow = std::chrono::high_resolution_clock::now();
    time_span = std::chrono::duration_cast<std::chrono::duration>(tnow - tstart);
    while (time_span.count() <= .25)
    send Ethernet packet
    }

    Driver approach:
    While (syncing)
    {
    Interval = KeQueryPerformanceCounter(PerformanceFrequency).QuadPart – startTime.QuadPart
    If (Interval < 2500000)
    {
    LARGE_INTEGER T250MS = { (ULONG)(-2500000 + Interval), -1};
    KeDelayExecutionThread(KernelMode, FALSE, &T250MS);
    }
    startTime = KePerformanceCoutner(PerformanceFrequency);
    send Ethernet packet and wait for completion
    }

  • Tim_RobertsTim_Roberts Member - All Emails Posts: 13,763

    Your user-mode example is incomplete. What do you do "while (time_span.count() <= .25)"? Do you Sleep, or are you in a hard CPU loop?

    Tim Roberts, [email protected]
    Providenza & Boekelheide, Inc.

  • mwendlingmwendling Member Posts: 4

    It's a hard cpu loop. That's all it does is send a packet every 250ms.

  • Peter_Viscarola_(OSR)Peter_Viscarola_(OSR) Administrator Posts: 8,236

    Well, that explains the difference.

    In one case you’re running continually; in the other case, you’re waiting for the timer and then your thread needs to be scheduled.

    Peter

    Peter Viscarola
    OSR
    @OSRDrivers

  • Tim_RobertsTim_Roberts Member - All Emails Posts: 13,763

    Yes, that explains it. You surely could not expect to deliver a production product that wastes 100% of a CPU, couldyou?

    Your solution, however, does seem pretty clear. You use KeDelayExecutionThread to get close, and then use KeStallExecutionProcessor for the final wait. KeDelayExecutionThread gives up the CPU, but KeStallExecutionProcessor does a tight CPU loop for short waits. That should get you the resolution you need, at a much lower CPU impact.

    Tim Roberts, [email protected]
    Providenza & Boekelheide, Inc.

  • mwendlingmwendling Member Posts: 4

    Thanks for your replies. As it turns out, the driver actually does give the same response as the app on my laptop and home computer, but is 100 times worse as described above on my main computer at work. I compared network settings and that was not the issue. I suspected Windows version since that was the only computer running 2004, but I just tried it on another computer running 2004 and it was fine. So, this may be an unexplained anomaly on one computer, but since it is a very capable computer, I expect I will see it elsewhere. I'll need to do more testing. Thanks again for your feedback.

  • MBond2MBond2 Member Posts: 233

    This is exactly the sort of problem you expect here. Some of the time, on some machines it will work well enough. but take the same code to different hardware, or run it enough times and the problems pointed out by others surface. It is likely that the 'better' the machine is, the more likely you are to fail

  • Scott_Noone_(OSR)Scott_Noone_(OSR) Administrator Posts: 3,375

    FYI you can potentially get more insight into the timing differences across systems by using Xperf. This article might give you a starting point: Happiness is Xperf.

    -scott
    OSR

Sign In or Register to comment.

Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

Upcoming OSR Seminars
OSR has suspended in-person seminars due to the Covid-19 outbreak. But, don't miss your training! Attend via the internet instead!
Writing WDF Drivers 7 Dec 2020 LIVE ONLINE
Internals & Software Drivers 25 Jan 2021 LIVE ONLINE
Developing Minifilters 8 March 2021 LIVE ONLINE