AI Tools for Assisting with Device Driver Development

For those of us currently doing device driver development work on Windows, what are people finding most helpful when it comes to using AI tools for daily development work for code that executes in kernel-mode?

Taking into account KMDF/UMDF (WDF) vs. File System Mini-Filter vs. “legacy” drivers, usage of “flat C” vs. the minimal form of C++ commonly referred to as “C with classes”, what experiences are people having (both good and bad) with anything from code-completion to wholesale boiler-plate code generation?

How good is AI at maintaining context for things like IRP handling, being mindful of the current IRQL and what can & cannot be done at any specific IRQL, managing memory allocation & access for non-paged & paged memory, determining when exception handling can/should be used, etc.?

If any kind of commercial 3rd party driver frameworks or in-house developed frameworks are being used, are AI tools “digesting” those frameworks and working with them cooperatively or trying to replace them with generated boilerplate which doesn’t fit in with the patterns currently in use with existing driver projects?

Copilot etc are Enhanced Code Completion, which is useful, but the ‘intelligence’ behind it can, in my opinion, best be described as a ‘confident idiot’. Everything complex uttered by these systems needs to be evaluated by a domain expert.

Copilot is pretty good at nit-picking code reviews. So, if you want all your nits picked out of your code, let loose the bots!

On the other hand, the techbro fascists behind this crap are noxious.

1 Like

The first thing that should be said is that what is called AI is a family adaptive heuristic algorithms. It is a type of algorithm that has existed for a long time, but is now computationally feasible for general use. And has now been made widely available

In my experience, the ‘AI’ that’s built into edge and other search engines that provides better search results is probably the most useful. After that, the enhanced intelisense (or whatever they call it now) in Visual Studio. Automated code generation of wrappers like C# p/invoke definitions works well too. But after that it falls off steeply.

I have not attempted to use it to detect incorrect IRQL or any other kind of problem where actually knowing is important. These algorithms are not good at policing absolute rules

In such a field, AI is not yet good enough to write the code itself. Your best bet is "brainstorming" with AI about the logic.

Don't even dare to ask about function behavior. That includes IRQL and exceptions.

The one time I asked AI to help me with what I considered a fairly simple task in KMDF it failed miserably. I don’t remember exactly what I asked it to do but I recall that the code it generated used several API calls that do not exist in WDF. It just made them up, the calls even started with Wdf. That was a few months ago, maybe it’s better now. But Windows drivers are a niche that I doubt they train the AI models regularly on.

1 Like

Thanks for the replies so far, and in advance for those that may be posted in the future.

I wasn’t asking out total ignorance or for trolling purposes, but instead to get a “finger on the pulse” at a point in time to better understand what others are encountering, either willingly or having it forced on them. A lot of companies producing software are going in whole-hog on AI from the top down and so AI is being pushed hard, everywhere, even when it’s completely in appropriate for current AI capabilities. I’m old enough to remember the Jonestown Massacre in the news as it happened and I am most definitely of a cautious nature anytime management gets zealous about something and starts handing out the cups of kool-aid to everyone.

I’ve been getting the “AI Summary” experience with Google Search for most of the past year, and when searching for either reference material or code snippets for obscure API functions for both user-mode & kernel-mode, I’m getting mixed results. For the most part, the basic C# code it offers up is reasonably correct in small amounts, but I’m truly horrified at the C/C++ code it spews out that must have come from something akin to a psychotic break. It has hallucinations involving what is obviously C# / .NET source material that gets transmogrified into egregiously wrong C++ code with total disregard for the fact that no such C++ language bindings or projections exist in reality. What seems to have gotten worse with Google Search, though, is that many times the AI summary lacks proper attribution of original source material, whether it be Microsoft’s own documentation, postings on Stack Overflow, Reddit, GitHub or any where else, although the attribution issue seems to slowly getting better over the past 1 to 2 months. I’m also finding that after the AI summary, there appears to be less useful traditional search results. Targeted searches involving exact values in double quotes seem to be ignored and broadened into something that returns irrelevant results, and much the same seems to be happening with exclusion search terms starting with “-” being ignored.

As for AI generated native code for Win32 API or NT Native API function calls, COM class implementations for explorer shell extension, etc., are perhaps structurally sound in terms of regurgitating Microsoft’s own documentation examples, but the code itself looks like MSDN examples from the mid 1990’s as utilized by a freshman computer science student who is programming on Windows with C++ for the first time. There’s no RAII usage, no try/catch or **try/**except usage, no input parameter validation, no modern C/C++ language feature usage, endless repetition of boiler plate code, etc. It’s certainly nothing production ready or worth shipping as part of an enterprise product. I’m even seeing this with Claude. It’s one thing for the code to be shoddy for user-mode, but it’s fatal for kernel-mode.

What experiences have others had when it comes to providing context & rules via markdown for the user-mode & kernel-mode sides of a project? Has it been resulting in better AI behavior when it comes to interacting with the codebase and offering up suggestions for code changes?

Using grok for driver development is like having one of the smartest people in the world sitting right next to me always ready to answer the hardest, most technical questions 24 hours a day. I have about fallen off my chair with how incredibly technical grok can get talking hardware protocol and device drivers that I could have never expected. I love to let grok review code and insist it implement state of the art c++26 optimization. Grok without being told understands the constraints of driver development and customizes around this. Some ideas I don’t care for, occasionally are even wrong, but the vast majority result in a better end product which I eagerly incorporate and bank the knowledge. This is extraordinary, very few people in the world could give such high quality advice in extreme nuance technical areas of c++, hardware, device drivers or just whatever I am doing. And in the future grok will continue to get better exponentially.

For me grok is one of the biggest advances in the history of driver development. It is a lot like chess engines. The best players use chess engines to learn and find better moves and ideas to become better. In the past players were limited by what they and others around them were capable of. Same thing going on here.

This is a very complicated topic.

First: Grok sucks, and is controlled by a fascist asshole. Full stop. I will not argue this.

Second: GitHub CoPilot is, as Mr. Roddy noted, almost decent as a code completion tool. But Microsoft have so badly brain damaged the models by forcing a 128K context window it makes AI look really bad. And the tools are so bad, they constantly make mistakes.

Third: The OSR team have spent a great deal of time on this, and are finding AI in general and Claude Code In particular, very helpful. Good prompt engineering is essential. Focused prompting, even for code reviews, can improve the depth of results. Writing code with one model and getting a code review from another model (even if a typically “inferior” one) seems to be very effective.

We have had luck with code reviews of mini-filter code. Like most static tools, you wade through a pile of false positives to fine one or two gems. But, find them we have.

In terms of actually generating new KM code for Windows: Not that much success. It seems to me that the models don’t have a big enough base on which to train. As OSR’s Scott Noone recently noted (words to the effect of): “The more I know about a topic, the less happy I am with the answers I get from an AI.” In short, based on my personal experience and the experience of the whole OSR team over the past six months, I would absolutely not trust AI to author my kernel mode code that needed to work.

OTOH, we have had repeated and spectacular luck with having Claude Code generate user mode utilities, and test cases, in Modern C++ 20. I mean, complete semi-complex utilities in domains where none of us know a single thing. As an example, we wrote a set of utilities to build software kits (think of a crazy powerful modern version of binplace), that includes moving files around, signing executables, building driver packages, uploading the package to the hardware dashboard for attestation signing, tracking the submission’s progress, downloading the results, and kitting the resulting signed driver binaries into a zip archive. One command at the command line, go get coffee, come back and release. 90% AI designed, built, tested (including test framework designed and built and test cases created). Yes, it took engineer oversight and guidance, but… just really impressive results, in a domain where folks here know sweet fuck all.

I could go on about this at great length, and already have. But bottom line: Lots of uses for AI in our world. But not yet for actually authoring KM code for Windows. No way, no how, not even close when you really look closely at the code that’s been written.

2 Likes

Just as a reply to this specific point: I’ve found I get what I ask for. It’s really, really, critically, super important to setup you claude.md with the right constraints. I’m getting really solid RAII, DRY compliant, results with actual useful narrative code comments (WHY something is being done, not “now add 3 to foo”) from Claude code.

But it took experience and a lot of work to get there, for sure.

1 Like

This. I’ve seen it do the same thing with calls to filter manager. Entirely hallucinated functions.

In fairness, ISTR that this was using GitHub Copilot, which is (sadly) brain damaged and thus dumb as a post. But… yup. “You’re not the only one” to get this result.

The last straw for me with GitHub Copilot was when it autocompleted a line in my minifilter with this gem:

securityContext = IoGetCurrentIrpStackLocation(fileObject)->FileObject->SecurityContext;

Ended up being a good thing though because for me once I pulled the coding agents out of the IDE I’ve found significantly more use cases. Great for reviews, generating tests, spinning up infrastructure, etc. Just lots of little quality of life things that you wouldn’t take the time to do yourself necessarily.

But as Peter said you definitely have to steer them in the right direction. And any code that is going to go into production needs to be scrutinized to death. They’ll swear up and down that something works and that it’s been tested and then you try it and it doesn’t work (“You’re absolutely right!”). Or they changed something else unrelated to make the test work. And that’s if you’re lucky, other times the change can be wrong in such a way that it seems right and passes a glancing review but really has a horrible security/reliability issue.

1 Like

I had similar experience to previous responders in this thread that overall, for driver coding AI sucks. However, it looks like AI can help where driver development hurts the most: debugging, especially, crash analysis.

  1. Checkout mcp‑windbg. I haven’t tried it myself yet but it looks useful. I’d love to hear feedback if anyone tried it, or will try it in the future.
  2. Copy/Pasting the output of !analyze -v and asking for opinion, or what is the next debugger command to run could give some insight, and new ideas how to proceed.
  3. Similarly, for live debugging, stack traces and other command output provided to an AI agent which has access to your code (Cursor, VSCode, etc.) can help analyze hangs, data races, use-after-free, etc.
1 Like

Hello. Months ago i generated legacy driver for Hyper-V checking (for testing purposes) using AI. Hyper-V-Tools/Hyperv_detector/hyperv_detector_v2/src/kernel_mode/hyperv_driver.c at 34b2c15fb57e170a893a15a8fd822d29d09ae999 · gerhart01/Hyper-V-Tools · GitHub . It was generated from zero with solution and vcxproj files, could be compiled and it seems can be debuggable. I suppose, that all windows kernel driver’s types can be generated using AI (because technically that tasks are equivalent to task to create legacy driver) and Windows driver samples is presented on msdn.

@Scott_Noone_OSR

Regarding:

Does that include automating the provisioning of VMs for all required Windows builds, deploying the HLK to the VMs, configuring the required HLK tests specific to the driver in each VM, running all of the HLK tests and putting them where they can be signed in an Azure YAML build pipeline? Or, is that just setting up testing in local VMs for doing dev testing before you’d consider things to be ready to perform HLK testing?

One of the things that tends to eat a lot of time is refreshing my VM environment where I use the HLK to actually test on each specific build of Windows where the driver is expected to execute and for which certification will be requested (check those boxes) during driver package submission.

Does that include automating the provisioning of VMs for all required Windows builds, deploying the HLK to the VMs, configuring the required HLK tests specific to the driver in each VM, running all of the HLK tests and putting them where they can be signed in an Azure YAML build pipeline?

YMWV with all of this but I’ll say to start: if you just sit down with a coding agent and say “give me a complete HLK testing pipeline start to signed driver” it’ll sure as hell will try to do SOMEthing. And you might even end up with something that works, who knows.

But, in my experience thus far you have to treat the clankers like really prolific, really talented, but also really junior engineers: good but sometimes hasty/sloppy at implementation and bad at big picture. Therefore the best results are if you break it down into individual tasks that have some type of verification step that it needs to clear before moving on. In the case of a unit test it might be “you need to prove that it compiles and runs”, the case of an infrastructure change it might be “make this Groovy change, run it through the linter, and then run a successful pipeline with it.”

If you can get in the groove it really lowers the bar in terms of what’s “worth it” to go try and build. For example, we recently had a bug report filed that only happened if you saved a Word document in Word 97 format (don’t ask :joy:). It was clear we didn’t have a test for this, so that was a good agent task. Now generating transparently encrypted Word documents and saving them in every possible format is part of CI (in addition to PPT and XLS files). Could I have done it myself? Sure, but it would have taken longer and probably been less thorough.

In terms of HLK testing: I’ve never tried to automate the actual certification testing process (i.e. install controller, get the right playlist, provision client, etc.). But we do run the HLK tests themselves daily. We basically get the controller set up, extract all the tests, and then have a pipeline to generate VMs from common parent images, create the right volumes, install our drivers, run the tests, and gather the logs. This means we end up testing way more than the logo playlist does in the hopes of no surprises come actual certification time which is then manual. If there’s an API to do it then it’s possible, but I’m still holding on to that being manual.