A very strange question about WSK socket!!!

kzzz_ggg · May 24, 2021, 7:59am

I write a TCP server, client is WSK kernel socket, this client sends 138 sectors to server each time.
A: when run client driver, client PC network write traffic is 25m/s, server receives traffic is 25m/s too.
B: but i write another R3 socket client, also sends 138 sectors，on the same PC, the traffic is 90m/s, server also receives 90m/s too.

This is very strange, how to solve the problem of traffic slow on wsk

thanks！

kzzz_ggg · May 24, 2021, 8:01am

Server is R3 socket
Client is endless loop send

Tim_Roberts · May 24, 2021, 4:50pm

By “sectors”, do you mean 512-byte sectors, so blocks of about 70k bytes?

By “m/s”, do you mean “megabytes per second”? That would usually be MB/s, and megabits per second would be “Mbps”.

What is the network here? Gigabit Ethernet? Wifi?

MBond2 · May 24, 2021, 10:23pm

If any of us can help you at all, we need far more information about your design and the tests that you are performing. Design choices when writing socket clients and servers can have an enormous effect on performance (more than 1000x between the worst and the best). Details on the network hardware, how the traffic is generated (including how many threads make send call, what send buffer sizes are involved, and whether they block or not) and how the server handles the received data (again, threads, buffer sizes and number of concurrent IOPS are all important details).

FYI on modern Windows, a properly designed UM process, can saturate a 10 Gb/s NIC with TCP traffic

kzzz_ggg · May 25, 2021, 12:36am

@Tim_Roberts said:
By “sectors”, do you mean 512-byte sectors, so blocks of about 70k bytes?

By “m/s”, do you mean “megabytes per second”? That would usually be MB/s, and megabits per second would be “Mbps”.

What is the network here? Gigabit Ethernet? Wifi?

yes，it’s 70k, and speed is MB/s, I converted。
Direct connection between two PC with network cable, Gigabit network

kzzz_ggg · May 25, 2021, 12:44am

@MBond2 said:
If any of us can help you at all, we need far more information about your design and the tests that you are performing. Design choices when writing socket clients and servers can have an enormous effect on performance (more than 1000x between the worst and the best). Details on the network hardware, how the traffic is generated (including how many threads make send call, what send buffer sizes are involved, and whether they block or not) and how the server handles the received data (again, threads, buffer sizes and number of concurrent IOPS are all important details).

FYI on modern Windows, a properly designed UM process, can saturate a 10 Gb/s NIC with TCP traffic

it’s simplest example for server and client(R0 and R3)，very normal。
I think it is a problem with the socket send buffer， But wsk(WskControlSocket) does not provide this settings，
only the SO_RCVBUF option can be set

kzzz_ggg · May 25, 2021, 12:54am

driver wsk send 70kb block，speed is 20MB/S，
when send 2048kb block it’s 80MB/S,
but R3 send 70kb block, speed is 100MB/S

MBond2 · May 25, 2021, 10:39pm

So you have two physical machines connected with 1 Gb/s Ethernet using either a cross over cable or a switch right?

Assuming that the transfer speed is limited by the network and not some other factor, you would expect it to max out at around 800 Mb/s for TCP traffic since the required acknowledgements prevent TCP from achieving more. If I understand your units, you are seeing 90 MB/s which is not bad - around 720 Mb/s. I assume that this is the ‘good’ performance number and you are trying to figure out how come you get the ‘bad’ one in your other test? It is unclear to me which setup gives good performance and which setup gives bad performance.

First, what would lead to you think that a send buffer exists? When TCP stream data is received
the TCP driver can do one of two things:

if there is a buffer available to fill, put this data into it and signal completion
buffer the data and reduce the receive window size

when there is TCP stream data to be sent, the TCP driver can do one of two things

if there is TCP window available, send the data immediately and reduce the send window size (
queue the data until window becomes available

I am glossing over a lot of details here obviously.

The depth of the send queue is essentially unlimited, as it is just a queue of MDLs. But performance will suffer if it can’t be kept ‘full’ as every time the send path hits #1 and can send immediately, that means that time was lost and more data could have been sent

the depth of the receive buffer is finite however and every time the receive path hits #2 and can’t immediately complete performance will be lost not just because that buffer is consumed, but because it will then tell the other end to slow down sending data until it can catch up

so, obviously much depends on exactly how you acquire the data that is to be sent, and exactly how you receive it. And specifically, on the threading model and IO pattern used. A lot depends on the protocol being implemented on top of the TCP stream, but in general a good design will send data as soon as it can from as many threads as it can (few protocols allow effective multi-threaded send on a TCP socket). And plan to have several pending reads outstanding at all times so that there is always a new buffer to fill when the next portion of the TCP stream becomes available. Keep track of the order in which these buffers were provided, so that the stream can be reassembled correctly

there are many details missing from this summary, but I hope it gets you thinking on the right track

kzzz_ggg · May 26, 2021, 2:56am

@MBond2 said:
So you have two physical machines connected with 1 Gb/s Ethernet using either a cross over cable or a switch right?

Assuming that the transfer speed is limited by the network and not some other factor, you would expect it to max out at around 800 Mb/s for TCP traffic since the required acknowledgements prevent TCP from achieving more. If I understand your units, you are seeing 90 MB/s which is not bad - around 720 Mb/s. I assume that this is the ‘good’ performance number and you are trying to figure out how come you get the ‘bad’ one in your other test? It is unclear to me which setup gives good performance and which setup gives bad performance.

First, what would lead to you think that a send buffer exists? When TCP stream data is received
the TCP driver can do one of two things:

if there is a buffer available to fill, put this data into it and signal completion

buffer the data and reduce the receive window size

when there is TCP stream data to be sent, the TCP driver can do one of two things

if there is TCP window available, send the data immediately and reduce the send window size (

queue the data until window becomes available

I am glossing over a lot of details here obviously.

The depth of the send queue is essentially unlimited, as it is just a queue of MDLs. But performance will suffer if it can’t be kept ‘full’ as every time the send path hits #1 and can send immediately, that means that time was lost and more data could have been sent

the depth of the receive buffer is finite however and every time the receive path hits #2 and can’t immediately complete performance will be lost not just because that buffer is consumed, but because it will then tell the other end to slow down sending data until it can catch up

so, obviously much depends on exactly how you acquire the data that is to be sent, and exactly how you receive it. And specifically, on the threading model and IO pattern used. A lot depends on the protocol being implemented on top of the TCP stream, but in general a good design will send data as soon as it can from as many threads as it can (few protocols allow effective multi-threaded send on a TCP socket). And plan to have several pending reads outstanding at all times so that there is always a new buffer to fill when the next portion of the TCP stream becomes available. Keep track of the order in which these buffers were provided, so that the stream can be reassembled correctly

there are many details missing from this summary, but I hope it gets you thinking on the right track

I want to emphasize 2 things。

it’s simplest example for server and client(R0 and R3)， server set as follows
INT Nodelay = 1;
setsockopt(hSocket, IPPROTO_TCP, TCP_NODELAY, (CHAR *)&Nodelay, sizeof(Nodelay));
```
INT BufSize = 4096*512;
setsockopt(hSocket, SOL_SOCKET, SO_RCVBUF, (CHAR *)&BufSize, sizeof(BufSize));
```
R0 R3 client run on the same PC, the code is almost the same ，connect and send to server, no other settings, send block is same, the difference is that one is the application layer socket and the other is the kernel wsk, but run result is different， R0 is 20 MB/s, R3 is 100 MB/s

Run only one server and client per test。

MBond2 · May 26, 2021, 9:28pm

I don’t know how to help you if you can’t provide more details about what you are doing and the results that you see.

but I can tell you that if you can only achieve 100 MB/s over the loop back adapter on a reasonable PC, you are doing something badly wrong. Probably a threading or IO pattern issue since you should be able to achieve about 100x this speed

If I had to guess, I would guess that you need to learn more about overlapped IO and the Windows IO model. That’s a total guess, so please take no offense

also, TCP no delay is irrelevant on the loop back adapter and probably does not do what you think it does on a real network. the no delay option overrides the algorithm to coalesce small socket writes into fewer network packets - reducing the number of Ethernet + IP + TCP headers that need to be sent versus the number of bytes of TCP stream data

kzzz_ggg · May 27, 2021, 2:18am

@MBond2 said:
I don’t know how to help you if you can’t provide more details about what you are doing and the results that you see.

but I can tell you that if you can only achieve 100 MB/s over the loop back adapter on a reasonable PC, you are doing something badly wrong. Probably a threading or IO pattern issue since you should be able to achieve about 100x this speed

If I had to guess, I would guess that you need to learn more about overlapped IO and the Windows IO model. That’s a total guess, so please take no offense

also, TCP no delay is irrelevant on the loop back adapter and probably does not do what you think it does on a real network. the no delay option overrides the algorithm to coalesce small socket writes into fewer network packets - reducing the number of Ethernet + IP + TCP headers that need to be sent versus the number of bytes of TCP stream data

Thank you for your help

MBond2 · May 29, 2021, 12:27am

I’m not sure that I have helped, but you are welcome. If you can describe your problem better, I think that we can help you more