Okay, below is another attempt to describe the bigger picture (why we would like to glue wings to our pig? @“Peter_Viscarola_(OSR)” , and the previous description did do a poor job):
- We have an interface card getting tons of packets from the line and we need to pass those packets to a suite of apps to process. These packets have a max size of 1KB
- The interface card has a Bus-Master DMA engine that reads descriptors from main memory to get the destination (logical) address to do DMA writes to main memory
- Between the suite of apps and the interface card, we have an app (say app1) just a layer above the driver to load and initialize drivers. Part of the initialization includes allocating memory, lock it in memory, create a list of descriptors (one per DMA transfer/one per packet) and let the device card to know where to get them
- Once initialized, app1 can share the memory buffer w/ other apps so all apps (let’s say one of them is app2) can access all packets
- This memory buffer is basically a large circular buffer of a series of 1KB packet buffers, let’s call this memory buffer as circular buffer from now on. This circular buffer is allocated using CreateFileMappingA as shared memory (between apps) in 132MB segments in user mode, then mapped and locked in kernal mode in 32MB blocks, then a table is created in kernel mode to describe each segment in 4KB pages recorded both logical and virtual addresses, the table is returned to user space and based on which another table (packet buffer table) is created to describe the the circular buffer in 1KB packet buffers.
- The packet buffer table is also allocated using CreateFileMappingA as a shared memory, with this table shared between apps, all an app needed to access a specified packet buffer in circular buffer, is its index, which is an offset from the base address of the circular buffer
- The mechanism above works well on Windows Server 2008, but in the test setup we created for Windows Server 2016, we found that sometimes we could miss a few packets once in upto 5 hours, for whatever reason. For the same hardware, same driver, same app1 and app2, running different OS, the result seems to be different.
- Everytime when it fails, we notice that it fails 4 packets in a roll, which happens to reside on the same physical page (the question was titled as such largely due to the correlation here, but @“Peter_Viscarola_(OSR)” was right that a correlation may not determine a causation)
To be more specific, our testing is done in this way:
- Start app2, which starts app 1 to initialize everything
- Then starts a process in app2 to allow packets to flow in, after about 10 minutes, all packet buffers in the circular buffer have been written at least once
- After that new packets starts overwriting old packets, app1 tracks the packets and increments the packet index to reflects the progress of overwriting
- App2 based on the updated index to check if the packets are overwritten as expected (new packets have different signatures from old ones)
- If a packet is found outdated (when the index has grown past its index and it’s found with old signature), the whole process stops (but packets usually keep flowing in for a bit)
- If no packets found outdated after about 20 minutes (all packets in circular buffer are overwritten at least once now), we halt this test iteration, stop app 2, which close app1 gracefully before closing
- Go to 1 again to start a new iteration
It would take 2-5 hours of repeating the test to found an outdated packet occurence and every time when it happens, it have 4 packets in a row.
Note, if we don’t do step 6, which triggers a sequence of restarting app1 and app2, it may run forever without reproducing the problem
Hopefully this at least provides a rough context to allow more questions to be asked to fill the gaps. I am grateful @Tim_Roberts & @“Peter_Viscarola_(OSR)” even bothered to reply before this.