Why using TDI to send data is so slowly?

I’ve been working on a storage project and TDI is involved to send data to a remote host. I’ve implemented a kernel mode socket using TDI. However, the speed of the data sending is much slower than that of using user-mode socket. I’ve been told that the “KeWaitForSingleObject” after “IoCallDriver” is the cause of it. I’ve tried to drop that waiting section, however, only to get a Blue Screen saying “Access violation”. I thought the cause of it was when the TCPIP.sys tried to access the memory buffer described by the MDL, the buffer had already freed by other parts of my program for the absence of “KeWaitForSingleObject”. Would someone give me some hints on how to improve the sending speed of my TDI client implementation? Thanks in advance.

Here’s the list of the sending function:

NTSTATUS tdi_send_stream(PFILE_OBJECT connectionFileObject, const char *buf, int len, ULONG flags)
{
PDEVICE_OBJECT devObj;
KEVENT event;
PIRP irp;
PMDL mdl;
IO_STATUS_BLOCK iosb;
NTSTATUS status;

devObj = IoGetRelatedDeviceObject(connectionFileObject);

KeInitializeEvent(&event, NotificationEvent, FALSE);

irp = TdiBuildInternalDeviceControlIrp(TDI_SEND, devObj, connectionFileObject, &event, &iosb);

if (irp == NULL)
{
return STATUS_INSUFFICIENT_RESOURCES;
}

if (len)
{
mdl = IoAllocateMdl((void*) buf, len, FALSE, FALSE, NULL);

if (mdl == NULL)
{
IoFreeIrp(irp);
return STATUS_INSUFFICIENT_RESOURCES;
}

__try
{
MmProbeAndLockPages(mdl, KernelMode, IoReadAccess);
status = STATUS_SUCCESS;
}
__except (EXCEPTION_EXECUTE_HANDLER)
{
IoFreeMdl(mdl);
IoFreeIrp(irp);
status = STATUS_INVALID_USER_BUFFER;
}

if (!NT_SUCCESS(status))
{
return status;
}
}

TdiBuildSend(irp, devObj, connectionFileObject, NULL, NULL, len ? mdl : 0, flags, len);

status = IoCallDriver(devObj, irp);

if (status == STATUS_PENDING)
{
KeWaitForSingleObject(&event, Executive, KernelMode, FALSE, NULL);
status = iosb.Status;
}

return NT_SUCCESS(status) ? iosb.Information : status;
}

I believe that you’ll find that the event is triggered only after the
TCP stack knows it does not need to retransmit the data and therefore
the MDL can be discarded along with the associated buffer. The stack
knows it does not need to retransmit the data when it gets the ACK
back from the other side.

In other words you have introduced a delay of RTT between each
send. This will be even greater if the packets are small and the
Nagle algorithm comes in to play.

Just like I/O to disk, network I/O benefits from having multiple
asynchronous IRPs outstanding in order to keep the pipe full.

Your current code is spending most of its time twiddling its fingers
rather than pumping the next block of data.

Mark.

At 01:45 PM 5/11/2007, zxpsuper@163.com wrote:

I’ve been working on a storage project and TDI is involved to send
data to a remote host. I’ve implemented a kernel mode socket using
TDI. However, the speed of the data sending is much slower than that
of using user-mode socket. I’ve been told that the
“KeWaitForSingleObject” after “IoCallDriver” is the cause of it.
I’ve tried to drop that waiting section, however, only to get a Blue
Screen saying “Access violation”. I thought the cause of it was when
the TCPIP.sys tried to access the memory buffer described by the
MDL, the buffer had already freed by other parts of my program for
the absence of “KeWaitForSingleObject”. Would someone give me some
hints on how to improve the sending speed of my TDI client
implementation? Thanks in advance.

Here’s the list of the sending function:

NTSTATUS tdi_send_stream(PFILE_OBJECT connectionFileObject, const
char *buf, int len, ULONG flags)
{
PDEVICE_OBJECT devObj;
KEVENT event;
PIRP irp;
PMDL mdl;
IO_STATUS_BLOCK iosb;
NTSTATUS status;

devObj = IoGetRelatedDeviceObject(connectionFileObject);

KeInitializeEvent(&event, NotificationEvent, FALSE);

irp = TdiBuildInternalDeviceControlIrp(TDI_SEND, devObj,
connectionFileObject, &event, &iosb);

if (irp == NULL)
{
return STATUS_INSUFFICIENT_RESOURCES;
}

if (len)
{
mdl = IoAllocateMdl((void*) buf, len, FALSE, FALSE, NULL);

if (mdl == NULL)
{
IoFreeIrp(irp);
return STATUS_INSUFFICIENT_RESOURCES;
}

__try
{
MmProbeAndLockPages(mdl, KernelMode, IoReadAccess);
status = STATUS_SUCCESS;
}
__except (EXCEPTION_EXECUTE_HANDLER)
{
IoFreeMdl(mdl);
IoFreeIrp(irp);
status = STATUS_INVALID_USER_BUFFER;
}

if (!NT_SUCCESS(status))
{
return status;
}
}

TdiBuildSend(irp, devObj, connectionFileObject, NULL, NULL, len
? mdl : 0, flags, len);

status = IoCallDriver(devObj, irp);

if (status == STATUS_PENDING)
{
KeWaitForSingleObject(&event, Executive, KernelMode, FALSE, NULL);
status = iosb.Status;
}

return NT_SUCCESS(status) ? iosb.Information : status;
}


Questions? First check the Kernel Driver FAQ at
http://www.osronline.com/article.cfm?id=256

To unsubscribe, visit the List Server section of OSR Online at
http://www.osronline.com/page.cfm?name=ListServer

You need to study kernel-mode asynchronous programming techniques.

The basic approach of asynchronous programming is that it is s two-step operation: the caller initiates an operation and sometime later the calling process is called back and given the results of the operation.

You will have to modify not only your tdi_send_stream routine, but also the routines that call your current tdi_send_stream method. Everything needs to follow this asynchronous model.

One part of your solution would be to replace your tdi_send_stream routine with something that I’ll call “tdi_start_send_buffer”. It would have parameters loosely defined something like this:

NTSTATUS
tdi_start_send_buffer(
PFILE_OBJECT connectionFileObject,
const char *buf,
int len,
ULONG flags,
PXYZ CompletionCallback,
PVOID CompletionContext
)

The CompletionCallback parameter is a pointer to a function that will be called when the buffer send completes. The CompletionContext parameter is memory that the caller of tdi_start_send_buffer allocates and initializes with enough information for the callback to know what is being completed (so it can re-use or free memory, cor example…).

Within your tdi_start_send_buffer you must also pass a completion pointer and some (different) context to TdiBuildSend).

Hope you get the basic idea here.

Thomas F. Divine

-----Original Message-----
From: xxxxx@lists.osr.com [mailto:bounce-286413-
xxxxx@lists.osr.com] On Behalf Of zxpsuper@163.com
Sent: Friday, May 11, 2007 8:37 AM
To: Windows System Software Devs Interest List
Subject: [ntdev] Why using TDI to send data is so slowly?

I’ve been working on a storage project and TDI is involved to send data
to a remote host. I’ve implemented a kernel mode socket using TDI.
However, the speed of the data sending is much slower than that of
using user-mode socket. I’ve been told that the “KeWaitForSingleObject”
after “IoCallDriver” is the cause of it. I’ve tried to drop that
waiting section, however, only to get a Blue Screen saying “Access
violation”. I thought the cause of it was when the TCPIP.sys tried to
access the memory buffer described by the MDL, the buffer had already
freed by other parts of my program for the absence of
“KeWaitForSingleObject”. Would someone give me some hints on how to
improve the sending speed of my TDI client implementation? Thanks in
advance.

Here’s the list of the sending function:

NTSTATUS tdi_send_stream(PFILE_OBJECT connectionFileObject, const char
*buf, int len, ULONG flags)
{
PDEVICE_OBJECT devObj;
KEVENT event;
PIRP irp;
PMDL mdl;
IO_STATUS_BLOCK iosb;
NTSTATUS status;

devObj = IoGetRelatedDeviceObject(connectionFileObject);

KeInitializeEvent(&event, NotificationEvent, FALSE);

irp = TdiBuildInternalDeviceControlIrp(TDI_SEND, devObj,
connectionFileObject, &event, &iosb);

if (irp == NULL)
{
return STATUS_INSUFFICIENT_RESOURCES;
}

if (len)
{
mdl = IoAllocateMdl((void*) buf, len, FALSE, FALSE, NULL);

if (mdl == NULL)
{
IoFreeIrp(irp);
return STATUS_INSUFFICIENT_RESOURCES;
}

__try
{
MmProbeAndLockPages(mdl, KernelMode, IoReadAccess);
status = STATUS_SUCCESS;
}
__except (EXCEPTION_EXECUTE_HANDLER)
{
IoFreeMdl(mdl);
IoFreeIrp(irp);
status = STATUS_INVALID_USER_BUFFER;
}

if (!NT_SUCCESS(status))
{
return status;
}
}

TdiBuildSend(irp, devObj, connectionFileObject, NULL, NULL, len ?
mdl : 0, flags, len);

status = IoCallDriver(devObj, irp);

if (status == STATUS_PENDING)
{
KeWaitForSingleObject(&event, Executive, KernelMode, FALSE,
NULL);
status = iosb.Status;
}

return NT_SUCCESS(status) ? iosb.Information : status;
}


Questions? First check the Kernel Driver FAQ at
http://www.osronline.com/article.cfm?id=256

To unsubscribe, visit the List Server section of OSR Online at
http://www.osronline.com/page.cfm?name=ListServer

Mark and Thomas,

I really appreciate your replies.

I’m not very familiar with asynchronous TDI techniques. The point is: Is there any possibility to use syncronous TDI calls so there’s only minor changes in my codes or is there anything I can do to disable the Nagle algorithm?

Thanks for your help!

Nope.

You must learn and use asynchronous techniques to obtain performance. There is no “quick fix”. If there was a quick fix I am sure we would have suggested it.

Sorry.

Thomas F. Divine

-----Original Message-----
From: xxxxx@lists.osr.com [mailto:bounce-286438-
xxxxx@lists.osr.com] On Behalf Of zxpsuper@163.com
Sent: Friday, May 11, 2007 11:25 AM
To: Windows System Software Devs Interest List
Subject: RE:[ntdev] Why using TDI to send data is so slowly?

Mark and Thomas,

I really appreciate your replies.

I’m not very familiar with asynchronous TDI techniques. The point is:
Is there any possibility to use syncronous TDI calls so there’s only
minor changes in my codes or is there anything I can do to disable the
Nagle algorithm?

Thanks for your help!


Questions? First check the Kernel Driver FAQ at
http://www.osronline.com/article.cfm?id=256

To unsubscribe, visit the List Server section of OSR Online at
http://www.osronline.com/page.cfm?name=ListServer

Many thanks, Thomas.

Now the question becomes how to do it asynchronously, is it? Would you please give me some more materials on it? I’ve really no idea on it. Thanks in advance!

> Now the question becomes how to do it asynchronously, is it? Would you please

give me some more materials on it? I’ve really no idea on it.

Read about the general asych IO issues in WDK documentation, and pay a special attention to IO completion routine issues. If you look at TdiBuildSend() documentation, you will see that it allows you to specify the address of a completion routine. Therefore, if you have specified the address of a completion routine in your TdiBuildSend(), this routine will get invoked when IO operation that corresponds to this particular IRP gets completed. Just keep in mind that TdiBuildInternalDeviceControlIrp() is just a a macro for IoBuildDeviceIoControlRequest(), which works only for the synchronous requests. Therefore, if you are interested in asych IO, you have to allocate an IRP with IoAllocateIrp(), rather than with TdiBuildInternalDeviceControlIrp(). After having allocated IRP with IoAllocateIrp(), specify a completion routine in a subsequent call to TdiBuildSend(), and then pass IRP down the stack - when its get completed, your completion routine will get invoked.

One more thing worth mentioning is TDI_SEND_NON_BLOCKING flag, which allows you to achieve truly asych performance. If you have specified it and underlying transport has no internal buffer space available at the moment, your request will fail with STATUS_DEVICE_NOT_READY, rather than waiting until it can buffer the given data internally. If you have registedered ClientEventSendPossible() callback with TDI_SET_EVENT_HANDLER, it will get unvoked when buffer space becomes available, so that it can try to re-submit the request (although there is no guarantee it will succeed - after all, another client may submit its request before your send request reaches the transport)

To summarize, read WDK documentation - answers to your questions are there…

Anton Bassov

anton,

Thanks for your comment and I’ll refer to DDK some more…

:slight_smile: