Playing audio on a virtual microphone from user-mode app

Hello members of this incredible forum filled with great knowledge! I’m a very beginner on this domain so excuse me if I write erroneous or inaccurate stuff and feel absolutely free to ask any details I may forget.
.
Alright so here’s my situation: I’m receiving a stream of audio (an array of bytes of raw PCM in a fixed known format each 50ms on average) coming from a UDP connection, I’m using an MFC user-mode application to actually receive that audio and my goal is to be able to send it to my own virtual microphone and make it so the microphone will output that particular audio.
.
The closest sample to my goal (that I could find) is the SimpleAudioSample project: It already has a virtual microphone and it even plays some basic sinewave tone on it! So I just removed the parts concerning the virtual speaker that I don’t need, commented the code that generates sinewaves, and managed to add a working IOCTL that is allowing me to send an audio buffer from my user-mode app to the driver, but the step I’m really struggling with is actually setting the virtual microphone’s buffer with this buffer coming from my user-mode app.
.
My first idea was to basically set a global pointer to the last CMiniportWaveRTStream created in CMiniportWaveRT::NewStream, and each time I receive an audio buffer through IOCTL, I would set the m_pDmaBuffer member of that stream to my received buffer with something (probably wrong) like this:

BYTE* AudioBuffer = (BYTE*)_Irp->AssociatedIrp.SystemBuffer;
pGlobalWaveRTMiniportStream->m_pDmaBuffer = AudioBuffer;

However, the moment I try to set the global pointer of the last stream created inside CMiniportWaveRT::NewStream:

if (NT_SUCCESS(ntStatus))
{
        pGlobalWaveRTMiniportStream = stream;
        pGlobalWaveRTMiniportStream->AddRef();
}

My virtual mic seems to break and doesn’t show anymore its audio activity in the sound settings window

I don’t really understand when/how streams are created, I noticed for example that when I open the **Recording **tab in the sound settings window, a new stream is created, and when I leave that tab, that stream is closed.
Or maybe I should proceed completely differently? I’ve read that some use Ring Buffers but I’m not quite sure what I must use in my situation, any help would really be appreciated, thank you!

I can’t comment to why the microphone has stopped working, but can comment about your approach a bit …

Yes, you will need to use a ring buffer … essentially you will need to be able to supply a constant data stream to the endpoint as well as provide position information to the OS. As usermode applications come and go they will need access to that ring buffer to pull the data from …

In a sound driver you’re not communicating directly with an end user application, you’re communicating with the OS audio engine which then communicates with your driver … the usermode app gets information from the audio engine about what your driver can supply (via calls from the audio engine into your driver), then starts pulling data (which it gets from a pathway the audio engine set up) … there’s also exclusive mode which is a more direct data pathway but for now let’s stick with shared mode …

A “stream” is simply the audio engine establishing a data link between your driver and a usermode program out there somewhere … when you click on the “record” tab you’re telling the OS that you want to start getting data from the microphone, so the audio engine establishes a stream to accomplish that …

It sounds like you have a way to put data into your driver, so let’s take some baby steps here … in your circumstances I would do these steps, checking everything as I went (and of course run under KMDF Verifier)

  • Create a ring buffer which will supply data to the audio engine speaker “stream”, put some static data in there and get it to where you can “hear” things with a program like Audacity
  • Modify your IOCTL so that it will push data into that ring buffer (I would also suggest using METHOD_DIRECT_IN rather than METHOD_BUFFERED for this)
  • Rather than having a usermode app collect the UDP packets and send them to the driver, instead look into “Kernel Winsock”, for which there are several good libraries on GitHub. Create a system thread which will make the UDP socket data connection and gather the data to put directly into the ring buffer … this will make things much speedier. You can still do all of the socket and port setup in the MFC application, just send an IOCTL to the system thread when things are good and have the system thread start collecting data

There is a wdmaudio newsgroup out there [https://www.freelists.org/archive/wdmaudiodev] which has folks which can help to diagnose the “why did things stop working” question, as well one of the regulars there has a website filled with good audio samples and utilities (with source) which can really help dial things in …

1 Like

Thanks to you craig, I managed to make some huge progress!

  • I created my own Ring Buffer which is working (not perfecty?) and put it as a member of the MiniportWaveRTStream class.
  • I’m now using METHOD_IN_DIRECT in my custom IOCTL to transfer the audio buffers for now, I’m still copying the bytes inside the AssociatedIrp.SystemBuffer (I don’t know if it’s really worth it to go through all the headaches to get the MDL working since I’ll probably use Kernel Winsock later on anyway).
  • Instead of trying to set a global pointer to the last stream created, I set a global pointer to its ring buffer member, and now I can set its address access it and Write bytes into it.
  • When there’s not enough bytes in the Ring Buffer to read, I set the remaining to zero bytes asked by the driver to Zero (silence).

And after a lot of failures and retries (printing debug messages inside my ring buffer for example was NOT a good idea at all), I managed to get it to work, I can successfully hear the sound that’s emitted from the UDP client, but the audio quality is pretty bad: there’s permanent crackling noise during and I can’t seem to find what’s the cause! What I tried so far:

  • Increase the wave format quality to 48kHz, 16 bits, 1 channel → the permanent crackling noise is still there.
  • Play the received PCM audio in several user-mode applications to see if the problem is in the UDP packets/format → the incoming audio buffers are fine, no crackling, no noise, the problem isn’t in the transmission (tested in C++ using SDL and in C# using NAudio).
  • Remove the spinlock in the write and/or read methods of the Ring Buffer → only made it worse.
  • Since the type of bytes is unsigned char, I thought maybe the noise is coming from the negative values of the bytes inside the buffer, so I tried setting manually every negative byte in my audio buffer to be at least 0 before sending it to the Ring Buffer → only made it worse.

Is it maybe due to my Ring Buffer lacking some sort of optimization? I tried to make it the simpliest possible, using only three parameters to keep the track (Write Position, Read Position and the Count of availables bytes), here are my 2 methods Write and Read, if u can spot any imperfection I would be glad to correct it!

NTSTATUS RingBuffer::Write(_In_ BYTE* pBytes, _In_ SIZE_T nbBytesToWrite)
{
	if (nbBytesToWrite > m_BufferSize) return STATUS_BUFFER_TOO_SMALL;
	if (nbBytesToWrite == 0) return STATUS_SUCCESS;

	NTSTATUS status = STATUS_SUCCESS;

	KeAcquireSpinLock(m_SpinLock, &m_SpinLockIrql);

	if ((m_WritePosition + nbBytesToWrite) - m_ReadPosition > m_BufferSize)
	{
		// Buffer will be overwritten, ReadPosition will have to be moved after the write
		status = STATUS_DATA_OVERWRITTEN;
	}

	SIZE_T nbTotalBytesWritten = 0;

	do
	{
		SIZE_T nbBytesToWriteNow = min(nbBytesToWrite, m_BufferSize - m_WritePosition);
		RtlCopyMemory(m_Buffer + m_WritePosition, pBytes + nbTotalBytesWritten, nbBytesToWriteNow);
		m_WritePosition = (m_WritePosition + nbBytesToWriteNow) % m_BufferSize;
		nbBytesToWrite -= nbBytesToWriteNow;
		nbTotalBytesWritten += nbBytesToWriteNow;
	} while (nbBytesToWrite > 0);

	// Set the new count of bytes which cannot exceed the buffer Size
	m_BytesCount = min(m_BufferSize, m_BytesCount + nbTotalBytesWritten);

	if (status == STATUS_DATA_OVERWRITTEN)
	{
		m_ReadPosition = m_WritePosition;
	}

	KeReleaseSpinLock(m_SpinLock, m_SpinLockIrql);
	return status;
}

And instead of m_ToneGenerator.GenerateSine(m_pDmaBuffer + bufferOffset, runWrite);inside CMiniportWaveRTStream::WriteBytes
I set it to:

SIZE_T bytesRead = 0;

m_RingBuffer->Read(m_pDmaBuffer + bufferOffset, runWrite, &bytesRead);

if (bytesRead < runWrite)
{
        RtlZeroMemory(m_pDmaBuffer + bufferOffset + bytesRead, runWrite - bytesRead );
}
NTSTATUS RingBuffer::Read(_In_ BYTE* pTarget, _In_ SIZE_T nbBytesToRead, SIZE_T* readCount)
{
	if (nbBytesToRead == 0)
	{
		if (readCount) *readCount = 0;
		return STATUS_SUCCESS;
	}

	KeAcquireSpinLock(m_SpinLock, &m_SpinLockIrql);

	if (m_BytesCount == 0) // buffer is empty
	{
		if (readCount) *readCount = 0;
		KeReleaseSpinLock(m_SpinLock, m_SpinLockIrql);
		return STATUS_DEVICE_NOT_READY;
	}

	// Ajdust the size of the bytes to read in case we don't have that much bytes in our buffer
	nbBytesToRead = min(nbBytesToRead, m_BytesCount);

	SIZE_T nbTotalBytesRead = 0;

	do
	{
		SIZE_T nbBytesToReadNow = min(nbBytesToRead, m_BufferSize - m_ReadPosition);
		RtlCopyMemory(pTarget + nbTotalBytesRead, m_Buffer + m_ReadPosition, nbBytesToReadNow);
		m_ReadPosition = (m_ReadPosition + nbBytesToReadNow) % m_BufferSize;
		nbBytesToRead -= nbBytesToReadNow;
		nbTotalBytesRead += nbBytesToReadNow;
	} while (nbBytesToRead > 0);

	m_BytesCount -= nbTotalBytesRead;

	if (readCount)
	{
		// We return how many bytes we could read
		*readCount = nbTotalBytesRead;
	}

	KeReleaseSpinLock(m_SpinLock, m_SpinLockIrql);
	return STATUS_SUCCESS;
}

You may need to add some instrumentation to allow you to monitor the buffer levels. If you are having to do zero padding because your buffer ran dry, that will cause crackles. How large is your buffer? There is a painful balance there; too little, and you run dry and return zeros. Too much, and you get unacceptable latency. I’ve been running about 8k bytes.

Don’t let the “unsigned char” type fool you. Each sample is a 16-bit signed value.

You don’t need to worry about the MDL. The AssociatedIrp.SystemBuffer is just fine.

1 Like

Thank you Tim for pointing that out! I was naively thinking that filling the DMA buffer with zeros would just result in true silence when the ring buffer is empty but it’s clearly not how it works, not fading in silence just produces that annoying crackling noise as shown in my audacity recording

I tried to play a PCM file (of the same format) into my ring buffer with a much bigger size (around the size of my file which is 5mb) and with 0 delay between each call of my `RingBuffer::Write(), I kept the same size of the number of bytes per write as my UDP packets (which is 3840) and the sound is totally clear, no crackles or any noise like that (I didn’t test the latency but I can imagine it being huge)

I also tried to modify the buffer size while playing the UDP audio packets:

  • 10mb → the sound is pretty good with almost no crackles, not sure what’s going on, I guess there is a huge latency so the read position never catches the write position?
  • 5mb → I can hear some crackles, but they’re not very frequent
  • between 200kb and 8kb → the crackles are almost present in every sound.

Also something very weird that’s happening rarely is that I can hear old audio from few seconds ago mixed with the new audio, which is weird since it means that it’s reading values that should’ve been overwritten? Shouldn’t the **spinlock **I use in write/read methods prevent this from happening? I’m pretty sure it’s not coming from UDP packets being dropped or arriving too late since the repeated audio was already heard once so it must have been successfully received in time.

I also checked for the **latency **in a skype call from my test machine (a VM) to my host machine and even with a 8kb size buffer there’s still some terrible lag of almost 1 second. I experienced the same thing few time ago when making a user-mode app in C# to try to play the received stream using the NAudio .NET library and WaveOut, and the only solution I found was to use the library’s integration of the **WASAPI **“WasapiOut” to get almost real-time audio transmission. I don’t know if the latency is from the same source and I have to use WASAPI or something like that?

For now I’ll try to make it Fade in to silence when my buffer run dry instead of simply filling with zeros to get rid of the crackles. If you have any idea about what could be causing the other 2 issues (the lag and the audio from the past) I’d be more than happy to read it!

I was naively thinking that filling the DMA buffer with zeros would just result in true silence when the ring buffer is empty but it’s clearly not how it works,

Well, it does. The problem is that sudden drop from non-zero to zero. Unless the previous samples fade to zero, that will be seen as a high-frequency note, which comes out as a click. If the previous sample was large, the click will be loud.

1 Like

Okay so after two weeks of headaches trying to find what is causing those noises in my audio output, I did all the tests I could possibly think of to eliminate all other possible sources (UDP sender sending bad audio, maybe my user-mode app messing with some bytes), and I can safely say it’s all coming from two very precise situations : when the ring buffer overflows and when it underflows/runs dry.

Underflow:

I guess I kinda managed to setup a “working” fading in and out between silences. What I did for the Fade Out is basically verify each attempt of read my Ring Buffer if there will be more data after that read: if the result is false then divide the last samples in my ring buffer by a gradually increasing number before giving it to the framework. And almost the same thing for the **Fade In **where I check if the last read audio was silence/faded out I simply divide the first half of the simples I’m about to give by a gradually decreasing number.

It works in theory and often in practice but sometimes I just have very small windows to fade in and out since the framework reads very small amount of audio once at a time that can be as small as 48 samples! So in result the fading isn’t very effective, like the screenshot below shows (I delimited the areas of effect of each fading with maximum values to see more clearly).
And I can’t really start fading out from further away since since there might be a write inbetween that would add some data before the expected silence, forcing me to fix my fade out by another fade in, which seems kinda sketchy.

Overflow:

At first I thought I would manage the overflows by simply overwriting old data and moving the read position, but as expected, it skips a bunch of samples in the audio and it results in a click, just like here:

Then I thought instead of simply overwriting for a few samples, I could just setup a small secondary ring buffer that would hold the overflow and after each read of the main ring buffer, fill it back with the queued audio data from the secondary buffer. However when my secondary ring buffer gets full it drops the bytes that I can’t write anywhere else. I tried setting both the main ring buffer and the secondary one at 8k size but it still happens that both of them gets full so often that it’s just not an acceptable solution. Plus it probably adds some latency here and there since it’s almost the same as using a single 16k buffer.

I’m currently at a loss of ideas for this issue, I can’t seem to find what’s the proper of doing this while keeping a smooth audio with no clicks and no additional latency.

I tried analyzing how fast it reads and write data into my ring buffer but honestly I couldn’t really understand the pattern or the frequency as sometimes it seems to be just fine: it write a bunch of bytes, then it read them. But some other times it just writes or reads too much without the other and my buffer either overflows or runs dry.

It also turned out that the latency issue was coming from the VM I used to test my driver on. When using a real computer as a test machine, the audio is received almost in real time with a 8k ring buffer. I didn’t hear any old audio mixed with new audio as well since the last time so it might have been coming from the VM or some extern source that seem to have disappeared.

Thank you once again for your help as it’s clearly making progress towards something functional, I really appreciate it!

Yes, timing in VMs is known to be crap. You need to design your controlling application so that yoiu simply don’t get overflows and underflows on real metal hardware. There’s just no practical way to work around that in real time.

1 Like

@Saliom said:
Okay so after two weeks of headaches trying to find what is causing those noises in my audio output, I did all the tests I could possibly think of to eliminate all other possible sources (UDP sender sending bad audio, maybe my user-mode app messing with some bytes), and I can safely say it’s all coming from two very precise situations : when the ring buffer overflows and when it underflows/runs dry.

Underflow:

I guess I kinda managed to setup a “working” fading in and out between silences. What I did for the Fade Out is basically verify each attempt of read my Ring Buffer if there will be more data after that read: if the result is false then divide the last samples in my ring buffer by a gradually increasing number before giving it to the framework. And almost the same thing for the **Fade In **where I check if the last read audio was silence/faded out I simply divide the first half of the simples I’m about to give by a gradually decreasing number.

It works in theory and often in practice but sometimes I just have very small windows to fade in and out since the framework reads very small amount of audio once at a time that can be as small as 48 samples! So in result the fading isn’t very effective, like the screenshot below shows (I delimited the areas of effect of each fading with maximum values to see more clearly).
And I can’t really start fading out from further away since since there might be a write inbetween that would add some data before the expected silence, forcing me to fix my fade out by another fade in, which seems kinda sketchy.

Overflow:

At first I thought I would manage the overflows by simply overwriting old data and moving the read position, but as expected, it skips a bunch of samples in the audio and it results in a click, just like here:

Then I thought instead of simply overwriting for a few samples, I could just setup a small secondary ring buffer that would hold the overflow and after each read of the main ring buffer, fill it back with the queued audio data from the secondary buffer. However when my secondary ring buffer gets full it drops the bytes that I can’t write anywhere else. I tried setting both the main ring buffer and the secondary one at 8k size but it still happens that both of them gets full so often that it’s just not an acceptable solution. Plus it probably adds some latency here and there since it’s almost the same as using a single 16k buffer.

I’m currently at a loss of ideas for this issue, I can’t seem to find what’s the proper of doing this while keeping a smooth audio with no clicks and no additional latency.

I tried analyzing how fast it reads and write data into my ring buffer but honestly I couldn’t really understand the pattern or the frequency as sometimes it seems to be just fine: it write a bunch of bytes, then it read them. But some other times it just writes or reads too much without the other and my buffer either overflows or runs dry.

It also turned out that the latency issue was coming from the VM I used to test my driver on. When using a real computer as a test machine, the audio is received almost in real time with a 8k ring buffer. I didn’t hear any old audio mixed with new audio as well since the last time so it might have been coming from the VM or some extern source that seem to have disappeared.

Thank you once again for your help as it’s clearly making progress towards something functional, I really appreciate it!

Hi Saliom,

Can u provide some code snippet of how u are sending data from custom IOCTL handler to ringbuffer.and also where you are defining your ringbuffer, I mean in which class.

Thanks in advance.

@Tim_Roberts said:
Yes, timing in VMs is known to be crap. You need to design your controlling application so that yoiu simply don’t get overflows and underflows on real metal hardware. There’s just no practical way to work around that in real time.

I’ll try to implement my solution in the MSVAD micarray sample to see if it helps on a real machine, I’ll post an update if I manage to make any progress!

@chauhan_sumit001 said:

Can u provide some code snippet of how u are sending data from custom IOCTL handler to ringbuffer.and also where you are defining your ringbuffer, I mean in which class.

There’s this guy on youtube who made a bunch of tutorials on windows driver dev, I learned how to use IOCTL with his IOCTL videos, it’s not amazing, but it’s better than nothing. And for the rest I just set a global pointer to the ring buffer (which I put in the same class as the m_pDmaBuffer) and use it directly from the PnpHandler.

##Update:
I managed to port my work to the MSVAD MicArray project which seems to work better: it definitely reduced the amount of underflows and overflows and it even seems to be smoothing samples by itself (making overflows hard to notice which is nice), but the audio is still not perfect as the buffer will still run dry occasionally and I think I finally found the culprit which is the latency variation (or unstable ping) between the recording machine and the receiving machine (running the virtual mic) when sending the audio UDP packets, and even a very small latency variation of something like 5/10ms will cause an underflow and even an overflow if the circular buffer’s size is small as show in this small representation I tried to make as accurate as possible to my scenario:

I have also noticed that underflows will each time add some delay in the audio if there is no overflow right after to remove samples, I could possibly fix that by skipping the number of bytes zeroed from the next audio buffer, resulting in more underflows however.

To fix the underflows I thought about delaying the read of every packet by a small amount of time that will equalize their delay, making them all 10ms for example in my precedent scenario, in thoery it should eliminate any underflow caused by a latency from 0.01 to 10 ms, but the major counterpart of it would obviously be the increased delay, going from the expected 40ms (smallest audio buffer size the recording machine can provide) to a bit less real-time 50ms. And to eliminate up to 20ms ping spikes, I would have to raise it to a not so much real-time 60ms and so on.

I’ve also read that the sleep() function does not guarantee delaying exactly the amount of time passed in its parameter, and may sleep some more time depending on the CPU load, so I’m wondering if it’s really a good idea to use that.

What do you think about it? Should I go for it or did I miss another (simpler?) way to control the flow of data to make it run smoothly?

All true. Audio Engine usually transfers 10ms at a time at a high priority. If you don’t have data ready, then you get dropouts.

Yes, the Sleep(N) function sleeps for a MINIMUM of N milliseconds. The scheduler only re-evaluates the thread list during a timer interrupt, which happens every 16ms. During each scheduler interval, it checks to see which threads are ready-to-run and chooses one. If your timer has expired, then you switch from “blocked” to “ready-to-run” and might be picked.

Doing real time audio over a network has always been hugely problematic. Companies like Skype and Zoom have invested lots of man-hours trying to optimize that experience.

1 Like

you seem to have understood the basic problem completely. There are no silver bullet solutions to these fundamental problems. As usual, engineering judgment about the tradeoffs of one solution or another has to be used. The basic tradeoff is between audio quality and audio latency. If you are content that that audio be played on the remote system 1 or 2 hours later, you can ensure essentially 100% fidelity. The reason is that that latency is several orders of magnitude larger than the expected network latency and all kinds of failures and retransmissions can take place during that time so that the result can be essentially perfect.

presumably, that latency is far too long and you are trying for something finer.

if you need two way audio, the problem is much harder because you need timing that works with natural conversation timings, but if it is one way only, the problem is much simpler since you only need to work out the latency product for the particular network in question

1 Like