REALLY slow TDI performance sending TDS packets after compression

Hello,
I started a previous thread concerning compressing the userbuffer in an IRP from the higher driver before passing on to the TCP driver. It was recommended that I instead create and use a substitute mdl. I did that and that worked! Thanks. However, I still have a secondary problem of performance.

On the same machine the performance is great. But if I connect from a remote client then the compressed packets are sent across the LAN extremely slowly. It takes like I’d say 17 to 20 times longer to transmit the same number of packets. It appears that the TCP driver is timing out before sending on every packet I’m guessing. And in actuallity it may be the higher driver that apparently is not sending any further asynchronous TDI_SENDs until the previous one completes. That is probably the real problem. IF the higher driver would send further packets then some TCP buffer would probably max out and get sent. TDIMon shows the 4K buffer sends, but occuring at a very slow rate.

Again, on the local machine it is fast. The higher driver continues to send multiple TDI_SEND ioctls, not waiting on previous ones to complete which is the expected behavior. Jsut remote clientit waits.

The strange part is that if I instead do the compression in the application and make the driver just a pass through then the performance issue goes away and the TCP traffic is fast. I can see in my driver that the packets arrive compressed and look just liek they do when I instead allow my driver to compress and substitute the mdl.

Anyone have any idea as to what would cause this behavior?

For TCP working at the wire speed you need to build a pipeline as if sender does not get ACK
from the receiver pretty fast it tends to shrink TCP windows thus resulting performance degradation.
Also make sure with the smaller packets size you don’t hit Nagle (accumulated send). And also make sure
you don’t use “send-send-receive” technique (make everything work using events or have worker thread
blocked on receive operation with a ready chained buffer to do incoming data store).

If the loopback works fine and LAN works slow I think it’s Nagle to be turned OFF first.

Please dig this list archives using listed keywords. You’ll find A LOT of the interesting stuff.

Good luck!

-anton

-----Original Message-----
From: xxxxx@lists.osr.com [mailto:xxxxx@lists.osr.com] On Behalf Of xxxxx@swbell.net
Sent: Saturday, July 21, 2007 8:35 PM
To: Windows System Software Devs Interest List
Subject: [ntdev] REALLY slow TDI performance sending TDS packets after compression

Hello,
I started a previous thread concerning compressing the userbuffer in an IRP from the higher driver before passing on to the TCP driver. It was recommended that I instead create and use a substitute mdl. I did that and that worked! Thanks. However, I still have a secondary problem of performance.

On the same machine the performance is great. But if I connect from a remote client then the compressed packets are sent across the LAN extremely slowly. It takes like I’d say 17 to 20 times longer to transmit the same number of packets. It appears that the TCP driver is timing out before sending on every packet I’m guessing. And in actuallity it may be the higher driver that apparently is not sending any further asynchronous TDI_SENDs until the previous one completes. That is probably the real problem. IF the higher driver would send further packets then some TCP buffer would probably max out and get sent. TDIMon shows the 4K buffer sends, but occuring at a very slow rate.

Again, on the local machine it is fast. The higher driver continues to send multiple TDI_SEND ioctls, not waiting on previous ones to complete which is the expected behavior. Jsut remote clientit waits.

The strange part is that if I instead do the compression in the application and make the driver just a pass through then the performance issue goes away and the TCP traffic is fast. I can see in my driver that the packets arrive compressed and look just liek they do when I instead allow my driver to compress and substitute the mdl.

Anyone have any idea as to what would cause this behavior?


Questions? First check the Kernel Driver FAQ at http://www.osronline.com/article.cfm?id=256

To unsubscribe, visit the List Server section of OSR Online at http://www.osronline.com/page.cfm?name=ListServer

Thanks. Good thngs to check. But I want the nagle algorithm to be enabled. I want the TCP driver to accumulate multiple TDI_SENDs together to reduce the number of packets sent over the LAN. And the client is not sending back any sort of ACK. It is a one way transmission. For some reason the higher driver is just not issuing the next TDI_SEND until the previousone has completed…unless the compression is actually done in the application instead of my driver. In that case the higher driver receives the sends already compressed and calls my driver as quickly as possible, not waiting for any previous IRP to complete.

Keeping Nagle enabled *in your case* is a bad idea. Read again my explanation. And ACK is sent by TCP stack, not by your “other side” application.
Are you really waiting on completion even when doing TDI_SEND calls?

-a

-----Original Message-----
From: xxxxx@lists.osr.com [mailto:xxxxx@lists.osr.com] On Behalf Of xxxxx@swbell.net
Sent: Sunday, July 22, 2007 12:59 AM
To: Windows System Software Devs Interest List
Subject: RE:[ntdev] REALLY slow TDI performance sending TDS packets after compression

Thanks. Good thngs to check. But I want the nagle algorithm to be enabled. I want the TCP driver to accumulate multiple TDI_SENDs together to reduce the number of packets sent over the LAN. And the client is not sending back any sort of ACK. It is a one way transmission. For some reason the higher driver is just not issuing the next TDI_SEND until the previousone has completed…unless the compression is actually done in the application instead of my driver. In that case the higher driver receives the sends already compressed and calls my driver as quickly as possible, not waiting for any previous IRP to complete.


Questions? First check the Kernel Driver FAQ at http://www.osronline.com/article.cfm?id=256

To unsubscribe, visit the List Server section of OSR Online at http://www.osronline.com/page.cfm?name=ListServer

OK, I will try disabling the Nagle algorithm. I appreciate the insight. And I understand that TCP protocol involves an ACK. I was just meaning that neither my driver nor the driver above mine do any TDI_RECEIVEs to receive any data between the sending of each packet.
By the way. I just converted my driver to create its own IRP and send that one and complete the original IRP in the completion routine. So I don’t touch the original IRP at all, except to set the status and number of bytes sent on completion. And that did not help at all. So indeed, still sounds like Nagle is the culprit.
It does seem a mystery to me however still that if the compression is done in the application instead of the driver then the same exact sized small packets are sent but everything is fast. So in short, compression done in app then everything is fast. Compression done in my driver then everything crawls. IT’s as if somehow the driver above me is involved in the Nagle algorithm…computing something based on the size of the packets it receives from the app and perhaps sending a fast IOCTL to the TCP driver. Sound likely?

> And the client is not sending back any sort of ACK. It is a one way transmission.

How is it possible with TCP??? Sending ACKs by the recepient is the essential part of TCP protocol - if no ACKs were sent, TCP would have no chance of ever becoming a protocol with guaranteed packet delivery, and, instead, would be simply an unrelieable datagram-based service. Are you speaking about TCP or UDP here???

If you speak about TCP, then, judging from your description of a problem, it could be so-called “Silly window syndrome”. It can happen either because of the recepient who gets bogged for this or that reason and asks the sender to shrink TCP window upon every transmission, or because of the sender who is desperate to send data straight away, no matter how small the segment is. In latter case Nagle’s algorithm may solve the problem. However, Nagle’s algorithm is “two-edged sword” - there is a well-known conflict between Nagle’s algorithm and delayed TCP ackowledgement that happens when two small writes are followed by read (first packet gets sent straight away; the second one is held up until the first write is acknowledged, but acknowledgement is held up because of delayed ACK strategy). In such case, you have either to disable Nagle’s algorithm, or at least to make sure that you don’t use write-write-read sequence (both write-write-write and read-write-read-write sequences are OK, but write-write-read is a performance killer).

Anton Bassov

Thanks Anton. I think I’m going to have to hybernate on this a bit and let it sink in since I am new to network divers. Thanks for pointing me in the right direction. I definitely see now how this is related to Nagle algorithm and am starting to read a few white papers on it. Later.

Good! This means no “send-send-receive” issue.

AFD does own re-buffering. Apply setsockopt(…) to set send buffer size to zero and you
should get absolutely the same behavior in both user and kernel modes.

-anton

-----Original Message-----
From: xxxxx@lists.osr.com [mailto:xxxxx@lists.osr.com] On Behalf Of xxxxx@swbell.net
Sent: Sunday, July 22, 2007 2:35 AM
To: Windows System Software Devs Interest List
Subject: RE:[ntdev] REALLY slow TDI performance sending TDS packets after compression

OK, I will try disabling the Nagle algorithm. I appreciate the insight. And I understand that TCP protocol involves an ACK. I was just meaning that neither my driver nor the driver above mine do any TDI_RECEIVEs to receive any data between the sending of each packet.
By the way. I just converted my driver to create its own IRP and send that one and complete the original IRP in the completion routine. So I don’t touch the original IRP at all, except to set the status and number of bytes sent on completion. And that did not help at all. So indeed, still sounds like Nagle is the culprit.
It does seem a mystery to me however still that if the compression is done in the application instead of the driver then the same exact sized small packets are sent but everything is fast. So in short, compression done in app then everything is fast. Compression done in my driver then everything crawls. IT’s as if somehow the driver above me is involved in the Nagle algorithm…computing something based on the size of the packets it receives from the app and perhaps sending a fast IOCTL to the TCP driver. Sound likely?


Questions? First check the Kernel Driver FAQ at http://www.osronline.com/article.cfm?id=256

To unsubscribe, visit the List Server section of OSR Online at http://www.osronline.com/page.cfm?name=ListServer

> Thanks Anton. I think I’m going to have to hybernate on this a bit and let it

sink in since I am new to network divers. Thanks for pointing me in the right
direction. I definitely see now how this is related to Nagle algorithm and am
starting to read a few white papers on it. Later.

TCP has been dogged by “Nagle algorithm + delayed ACK” and “Silly window syndrome” problems for at least 20 yers, so that your problem is not among Windows-specific ones. What I would advise you to do is just to Google “Nagle algorithm” and “Silly window syndrome” - you will find quite a few docs that, unlike RFCs, describe the whole thing in easy-to-understand way.

After getting familiar with the theory behind your problem, try to analyze the differences between IOCTL patters that you see in the situation when compresssion is done by an app (i.e. the one that works fine) and the one when compresssion is done by your driver (i.e. the one that is troublesome) - apparently, AFD.SYS just takes into the account something that your driver does not…

Anton Bassov

Note: TDI_SEND is not completed till all ACKs will arrive for this data
portion. This can be disastrous for performance if TDI is mis-used.


Maxim Shatskih, Windows DDK MVP
StorageCraft Corporation
xxxxx@storagecraft.com
http://www.storagecraft.com

wrote in message news:xxxxx@ntdev…
> Hello,
> I started a previous thread concerning compressing the userbuffer in an
IRP from the higher driver before passing on to the TCP driver. It was
recommended that I instead create and use a substitute mdl. I did that and that
worked! Thanks. However, I still have a secondary problem of performance.
>
> On the same machine the performance is great. But if I connect from a remote
client then the compressed packets are sent across the LAN extremely slowly. It
takes like I’d say 17 to 20 times longer to transmit the same number of
packets. It appears that the TCP driver is timing out before sending on every
packet I’m guessing. And in actuallity it may be the higher driver that
apparently is not sending any further asynchronous TDI_SENDs until the previous
one completes. That is probably the real problem. IF the higher driver would
send further packets then some TCP buffer would probably max out and get sent.
TDIMon shows the 4K buffer sends, but occuring at a very slow rate.
>
> Again, on the local machine it is fast. The higher driver continues to send
multiple TDI_SEND ioctls, not waiting on previous ones to complete which is the
expected behavior. Jsut remote clientit waits.
>
> The strange part is that if I instead do the compression in the application
and make the driver just a pass through then the performance issue goes away
and the TCP traffic is fast. I can see in my driver that the packets arrive
compressed and look just liek they do when I instead allow my driver to
compress and substitute the mdl.
>
> Anyone have any idea as to what would cause this behavior?
>

By which (I assume) Maxim means that you should not wait for your last send
to complete before queueing your next send (if you have data to send). Pack
as much send-data into the pipe as you have and let TCP put it as
effeciently as possible into the receive window.

-dave

-----Original Message-----
From: xxxxx@lists.osr.com
[mailto:xxxxx@lists.osr.com] On Behalf Of Maxim S. Shatskih
Sent: Sunday, July 22, 2007 6:13 PM
To: Windows System Software Devs Interest List
Subject: Re:[ntdev] REALLY slow TDI performance sending TDS packets after
compression

Note: TDI_SEND is not completed till all ACKs will arrive for this data
portion. This can be disastrous for performance if TDI is mis-used.


Maxim Shatskih, Windows DDK MVP
StorageCraft Corporation
xxxxx@storagecraft.com
http://www.storagecraft.com

The best idea is to use “pool” of pseudo-sockets (TDI connections) and use
effective re-buffering.
CPU cycles are cheap today in any case =)

-a

-----Original Message-----
From: xxxxx@lists.osr.com
[mailto:xxxxx@lists.osr.com] On Behalf Of David R. Cattley
Sent: Monday, July 23, 2007 1:26 AM
To: Windows System Software Devs Interest List
Subject: RE: [ntdev] REALLY slow TDI performance sending TDS packets after
compression

By which (I assume) Maxim means that you should not wait for your last send
to complete before queueing your next send (if you have data to send). Pack
as much send-data into the pipe as you have and let TCP put it as
effeciently as possible into the receive window.

-dave

-----Original Message-----
From: xxxxx@lists.osr.com
[mailto:xxxxx@lists.osr.com] On Behalf Of Maxim S. Shatskih
Sent: Sunday, July 22, 2007 6:13 PM
To: Windows System Software Devs Interest List
Subject: Re:[ntdev] REALLY slow TDI performance sending TDS packets after
compression

Note: TDI_SEND is not completed till all ACKs will arrive for this data
portion. This can be disastrous for performance if TDI is mis-used.


Maxim Shatskih, Windows DDK MVP
StorageCraft Corporation
xxxxx@storagecraft.com
http://www.storagecraft.com


Questions? First check the Kernel Driver FAQ at
http://www.osronline.com/article.cfm?id=256

To unsubscribe, visit the List Server section of OSR Online at
http://www.osronline.com/page.cfm?name=ListServer