Too many ioctls

hi loren
can u send some of those user mode stuff which u were earlier talking about
which has that 250us response
accuracy and jitter of less then 2-3 ms. Similar numbers are also what we
are looking at.
It would be very helpful to me if u can send me that stuff as soon as
possible.
Thanks
Mayank

-----Original Message-----
From: xxxxx@lists.osr.com
[mailto:xxxxx@lists.osr.com]On Behalf Of Loren Wilton
Sent: Tuesday, May 20, 2003 6:13 AM
To: NT Developers Interest List
Subject: [ntdev] Re: Too many ioctls

i didnt get u by the ‘kind of questions’. The processing engine has
certain
real time requirements which i feel cannot be met
in the user mode. the processing requirements of the processing engine
keep
on increasing and decreasing dynamically based on the number of
connections
that it has to handle. I dont think that these type of dynamic real time
requirements can be met in user mode specially in the case of intensive
processing algorithms that occur on each connection that is handled in the
kernel mode. You can think of it as real time streaming kind of thing.

It is always good in these cases to actually implement it in user space and
SEE if the requirements can be met, before going to any work at all to move
it to kernel space. I have stuff in user mode that will normally send out
hardware events with around 250us response accuracy, and almost never has a
jitter of more than 2-3ms.

I find it highly questionable that you will inherently get much higher
accuracy by implementing in kernel space. Indeed, you will quite possibly
end up worse off, unless you go to A LOT of work. Work you wouldn’t have to
do in user space.

Loren


You are currently subscribed to ntdev as:
xxxxx@intersolutions.stpn.soft.net
To unsubscribe send a blank email to xxxxx@lists.osr.com

The additional overhead for user mode is the CSWITCH overhead. My guess is
that won’t be your bottleneck. Your approach to prototype in user mode is
absolutely the right idea. I also don’t think you should write the user mode
code to make it run inside the kernel. Kernel environment is too weird to
make this work correctly and you will ultimately end up writing more code in
user mode to achieve your portability.

Also note that sustained thruput is easier to achieve even if there is
additional CPU overhead by using sufficient buffering. We have been able to
saturate a 100 Mbps NIC using windows with code from user mode.

Nar Ganapathy
Windows Core OS group
This posting is provided “AS IS” with no warranties, and confers no rights.

“Mayank Kumar” wrote in message
news:xxxxx@ntdev…
>
> hi nar
> We are more concerned with the performance advantage rather then the
> deployment advantage as far
> as i know. But we will have more data once the first version is ready and
we
> are able to see some
> performance figures.
> Initially the approach i am taking is that the sofwtare will be written
in
> a way to be portable
> between the user mode and kernel mode. it will be first tested in user
mode
> and then in kernel mode
> to see if there are any difference in the performance numbers. I think it
> should not be very difficult
> enough to implement this approach.
> Later on if we are very sure that no advantages are gained in terms of
> performance inside the
> kernel then we will have to have a method of efficiently tranferring
packets
> from the kernel to user mode
> at the rate of approximately 1024 kbytes per second. If that is possible
> then we will be able to
> work in user mode also. i think a shared memory approach will save a lot
of
> effort but what i afraid
> of is that a shared memory approach requires a lot of locks and other
> synchronization mgmt which
> will hit performance again.
>
> What do u think about it ?
>
> Mayank
>
>
> -----Original Message-----
> From: xxxxx@lists.osr.com
> [mailto:xxxxx@lists.osr.com]On Behalf Of Nar Ganapathy[MS]
> Sent: Wednesday, May 21, 2003 3:32 AM
> To: NT Developers Interest List
> Subject: [ntdev] Re: Too many ioctls
>
>
> A thread in kernel mode is no different than a thread in user mode from a
> scheduling point of view.
>
> I agree that sharing memory buffers is tempting . You could use a
> METHOD_DIRECT IOCTL or use DO_DIRECT_IO device object to get a locked read
> buffer from user mode and post it to hardware. This avoids a copy. In my
> opinion, the other benefits of user mode far outweighs this. If you also
> look at this from other angles besides performance (for example
deployment)
> you will see that user mode is better. Think about how hard it is to do an
> online upgrade of your driver to fix a bug in your engine.
>
> –
> Nar Ganapathy
> Windows Core OS group
> This posting is provided “AS IS” with no warranties, and confers no
rights.
>
> “Mayank Kumar” wrote in
message
> news:xxxxx@ntdev…
> >
> > Hi nar
> > so u mean to say that a kernel level thread running at PASSIVEL_IRQL is
in
> > no way more frequently scheduled than
> > a user level thread running at higher priority. If i have certain audio
> > algorithms (with high mips requirements) which i need to apply on
packets
> > received from the network, then will i not receive any advantage by
> > implementing the driver in kernel mode. At least the only advantage i
can
> > see is that i will not have to pass/copy the buffers from the kernel
mode
> > to user mode. So this much advantage can be a big one too. What do u say
> ??
> > Does Windows 2000 provide some thing called a zero copy packet transfers
> > like 2.4 kernel of linux provides.
> >
> > Thanks
> > Mayank
> >
> > -----Original Message-----
> > From: xxxxx@lists.osr.com
> > [mailto:xxxxx@lists.osr.com]On Behalf Of Nar Ganapathy [MS]
> > Sent: Tuesday, May 20, 2003 7:19 PM
> > To: NT Developers Interest List
> > Subject: [ntdev] Re: Too many ioctls
> >
> >
> > When I said “Kind of questions” I meant that most of questions seem to
be
> > software and non-device related. The windows scheduler does not treat
the
> > kernel mode threads any different than user mode threads. So you really
> > don’t gain anything by writing a kernel mode driver (apart from the
> > increased complexity). A higher priority user mode thread can preempt a
> > kernel thread. (Running at DISPATCH_LEVEL to avoid pre-emption is not
the
> > right answer either. You cannot do a lot of things at this level).
Writing
> > this kind of code in kernel is absolutely the wrong idea. I think you
> > underestimate the complexity writing kernel mode code.
> >
> >
> >
> > –
> >
> > Nar Ganapathy
> >
> > Windows Core OS group
> >
> > This posting is provided “AS IS” with no warranties, and confers no
> rights.
> >
> >
> > –
> > Nar Ganapathy
> > Windows Core OS group
> > This posting is provided “AS IS” with no warranties, and confers no
> rights.
> >
> > “Mayank Kumar” wrote in
> message
> > news:xxxxx@ntdev…
> > >
> > > hi nar
> > > i didnt get u by the ‘kind of questions’. The processing engine has
> > certain
> > > real time requirements which i feel cannot be met
> > > in the user mode. the processing requirements of the processing engine
> > keep
> > > on increasing and decreasing dynamically based on the number of
> > connections
> > > that it has to handle. I dont think that these type of dynamic real
time
> > > requirements can be met in user mode specially in the case of
intensive
> > > processing algorithms that occur on each connection that is handled in
> the
> > > kernel mode. You can think of it as real time streaming kind of thing.
> > >
> > > I hope i am clear
> > >
> > > Regards
> > > Maynk
> > >
> > >
> > >
> > > -----Original Message-----
> > > From: xxxxx@lists.osr.com
> > > [mailto:xxxxx@lists.osr.com]On Behalf Of Nar
Ganapathy[MS]
> > > Sent: Monday, May 19, 2003 9:50 PM
> > > To: NT Developers Interest List
> > > Subject: [ntdev] Re: Too many ioctls
> > >
> > >
> > > The kind of questions are you are posing here makes me wonder why you
> are
> > > doing implementing this processing engine in kernel mode at all. Why
> can’t
> > > you do it in user mode ? You have full access to RPCs and other good
> > stuff.
> > >
> > > –
> > > Nar Ganapathy
> > > Windows Core OS group
> > > This posting is provided “AS IS” with no warranties, and confers no
> > rights.
> > >
> > > “Mayank Kumar” wrote in
> > message
> > > news:xxxxx@ntdev…
> > > >
> > > > hi all
> > > > i have another query
> > > >
> > > > i am implementing a processing engine which executes in kernel.
> > > > the processing engine while executing requires acess to some data
> > > structures
> > > > which are therefore maintained in the kernel.
> > > > Some part of these data structures are configured by requests
recevied
> > > from
> > > > the user mode via ioctls.
> > > >
> > > > Now the problem is that there are too many configuration parameters
> > > required
> > > > for processing engine inside the kernel to execute.
> > > > Now i want to know that if i issue too many ioctls requests to
> configure
> > > > some small things then will there be any issues. these too many
> > > > requests will just to get /set the data structures inside the kernel
> > mode.
> > > > The other option is to maitain a shared memory between the kernel
and
> > the
> > > > user mode which can be accessed by both. i dont know if this
possible
> > and
> > > if
> > > > possible then how
> > > > much of an overhead will it be for the kernel thread to access this
> > shared
> > > > memory. Also in this case there will have to be some lock mechanims
> > > because
> > > > both the kernel and user may simulataneolsy access these data
> structures
> > > for
> > > > writing and reading.
> > > >
> > > > The issue is whicn approach to take
> > > > ---- implement a functions as collection of a number of IOCTLS
versus
> > > > ---- implement a functions as a single IOCTL
> > > >
> > > > Regards
> > > > Mayank
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > >
> > >
> > >
> > > —
> > > You are currently subscribed to ntdev as:
> > > xxxxx@intersolutions.stpn.soft.net
> > > To unsubscribe send a blank email to xxxxx@lists.osr.com
> > >
> > >
> > >
> > >
> > >
> >
> >
> >
> > —
> > You are currently subscribed to ntdev as:
> > xxxxx@intersolutions.stpn.soft.net
> > To unsubscribe send a blank email to xxxxx@lists.osr.com
> >
> >
> >
> >
> >
>
>
>
> —
> You are currently subscribed to ntdev as:
> xxxxx@intersolutions.stpn.soft.net
> To unsubscribe send a blank email to xxxxx@lists.osr.com
>
>
>
>
>

> to user mode. So this much advantage can be a big one too. What do u
say ??

Does Windows 2000 provide some thing called a zero copy packet
transfers
like 2.4 kernel of linux provides.

Zero-copy IO is in NT since version 3.1, unlike Linux which took 10
years to implement this old idea.

Max

> hi loren

can u send some of those user mode stuff which u were earlier talking
about
which has that 250us response
accuracy and jitter of less then 2-3 ms. Similar numbers are also what we
are looking at.
It would be very helpful to me if u can send me that stuff as soon as
possible.
Thanks
Mayank

Sorry, the code is both proprietary and large. However, if you play around
for a while with the multimedia timers, various ways of waiting on events,
multiple threads, thread priority, and process priority class it can be
done. It isn’t the easiest thing in the world to do. But it is probably
easier at user level than at kernel level, simply because it will be A LOT
easier to debug.

Things you want to think about:

  1. Separate the data path from the interface, if any. Make sure the data
    path won’t get hung up waiting on the interface. This means separate
    threads.
  2. Use queues and possibly separate sender/receiver threads as necessary.
    Make sure that you decouple receiving speed from sending speed and vice
    versa. A blocked receiver shouldn’t slow down a sender.
  3. If you have multiple receivers, make sure they all have separate
    queues and are able to process messages even if other receivers are blocked.
    Same for multiple senders.
  4. Use fast locking algorithms on your queues. Keep the overhead down.
    When possible use queues that don’t require locking for safe operation.
  5. Use priority judiciously, but use it as necessary. Priority and
    processing time should be the inverse of each other. DO NOT do things that
    can take a long time (fractional millisecond time or longer) at a high
    priority. Fob that off on a lower priority thread.
  6. Assign lower priorities in debug builds. It will screw up your
    timing, but it will also keep you from rebooting due to a system lockup
    every time your code gets in a loop for some unexpected reason.
  7. Experiment with both priority and priority class. Experiment with
    different thread priorities even for your important worker threads. You
    might discover that it works better if the senders have a higher priority
    than the receivers. Or ther other way around. Or maybe best when they are
    both the same, but even better if you take them both DOWN a priority notch.
    You will discover things about scheduling you never imagined!
  8. Measure measure measure. Get the scope out, and figure a creative way
    to connect it to some real wires on your machine, so you can SEE the
    processing time, and see how it changes as you fiddle priorities and
    algorithms.
  9. Measure measure measure. After you watch things on the scope, figure
    out how to accumulate a half hour’s statistics without disturbing the
    process, and stuff them into Excel. Do min, max, average and SD, and
    anything else that seems appropriate. Do this with a priority setup that
    didn’t look as good on the scope as the ‘best’ one you picked. You might
    get a surprise.

That’s the ten cent tour on thruput and turnaround time optimization.

Loren

Look into the MSDN-Library and search for CreateIoCompletionPort.

‘Windows Sockets 2.0: Write Scalable Winsock Apps Using Completion
Ports’

There are some hints about how many threads to use.

IoCompletionPorts seem to give better system response when your system
is IO driven.

Norbert.

“A husband is living proof that a wife can take a joke”
---- snip ----

> hi loren
> can u send some of those user mode stuff which u were earlier talking
about
> which has that 250us response
> accuracy and jitter of less then 2-3 ms. Similar numbers are also what we
> are looking at.
> It would be very helpful to me if u can send me that stuff as soon as
> possible.
> Thanks
> Mayank

Sorry, the code is both proprietary and large. However, if you play around
for a while with the multimedia timers, various ways of waiting on events,
multiple threads, thread priority, and process priority class it can be
done. It isn’t the easiest thing in the world to do. But it is probably
easier at user level than at kernel level, simply because it will be A LOT
easier to debug.

Things you want to think about:

  1. Separate the data path from the interface, if any. Make sure the data
    path won’t get hung up waiting on the interface. This means separate
    threads.
  2. Use queues and possibly separate sender/receiver threads as necessary.
    Make sure that you decouple receiving speed from sending speed and vice
    versa. A blocked receiver shouldn’t slow down a sender.
  3. If you have multiple receivers, make sure they all have separate
    queues and are able to process messages even if other receivers are blocked.
    Same for multiple senders.
  4. Use fast locking algorithms on your queues. Keep the overhead down.
    When possible use queues that don’t require locking for safe operation.
  5. Use priority judiciously, but use it as necessary. Priority and
    processing time should be the inverse of each other. DO NOT do things that
    can take a long time (fractional millisecond time or longer) at a high
    priority. Fob that off on a lower priority thread.
  6. Assign lower priorities in debug builds. It will screw up your
    timing, but it will also keep you from rebooting due to a system lockup
    every time your code gets in a loop for some unexpected reason.
  7. Experiment with both priority and priority class. Experiment with
    different thread priorities even for your important worker threads. You
    might discover that it works better if the senders have a higher priority
    than the receivers. Or ther other way around. Or maybe best when they are
    both the same, but even better if you take them both DOWN a priority notch.
    You will discover things about scheduling you never imagined!
  8. Measure measure measure. Get the scope out, and figure a creative way
    to connect it to some real wires on your machine, so you can SEE the
    processing time, and see how it changes as you fiddle priorities and
    algorithms.
  9. Measure measure measure. After you watch things on the scope, figure
    out how to accumulate a half hour’s statistics without disturbing the
    process, and stuff them into Excel. Do min, max, average and SD, and
    anything else that seems appropriate. Do this with a priority setup that
    didn’t look as good on the scope as the ‘best’ one you picked. You might
    get a surprise.

That’s the ten cent tour on thruput and turnaround time optimization.

Loren


You are currently subscribed to ntdev as: xxxxx@stollmann.de
To unsubscribe send a blank email to xxxxx@lists.osr.com

---- snip ----

hi Loren

That was very useful . Thanks a lot for the effort taken
by u to write a detailed account. I will try to take care of them.
Most of the first few have already been taken care of like the interfaces
were already defined in such a way as to completeley separate the receive
path from the transmit.

I justed wanted to know one thing that if every thing is implemented in User
mode then what is the most efficient way
of receiving/transmitting packets from/to the NIC. Will the protocol driver
approach be good or does the windows expose api in user mode for efficient
send/receive using zero copy as some body suggested that windows has zero
copy implemented since 3.1
NT

Thanks
Mayank

-----Original Message-----
From: xxxxx@lists.osr.com
[mailto:xxxxx@lists.osr.com]On Behalf Of Loren Wilton
Sent: Thursday, May 22, 2003 1:45 PM
To: NT Developers Interest List
Subject: [ntdev] Re: Too many ioctls

hi loren
can u send some of those user mode stuff which u were earlier talking
about
which has that 250us response
accuracy and jitter of less then 2-3 ms. Similar numbers are also what we
are looking at.
It would be very helpful to me if u can send me that stuff as soon as
possible.
Thanks
Mayank

Sorry, the code is both proprietary and large. However, if you play around
for a while with the multimedia timers, various ways of waiting on events,
multiple threads, thread priority, and process priority class it can be
done. It isn’t the easiest thing in the world to do. But it is probably
easier at user level than at kernel level, simply because it will be A LOT
easier to debug.

Things you want to think about:

  1. Separate the data path from the interface, if any. Make sure the data
    path won’t get hung up waiting on the interface. This means separate
    threads.
  2. Use queues and possibly separate sender/receiver threads as necessary.
    Make sure that you decouple receiving speed from sending speed and vice
    versa. A blocked receiver shouldn’t slow down a sender.
  3. If you have multiple receivers, make sure they all have separate
    queues and are able to process messages even if other receivers are blocked.
    Same for multiple senders.
  4. Use fast locking algorithms on your queues. Keep the overhead down.
    When possible use queues that don’t require locking for safe operation.
  5. Use priority judiciously, but use it as necessary. Priority and
    processing time should be the inverse of each other. DO NOT do things that
    can take a long time (fractional millisecond time or longer) at a high
    priority. Fob that off on a lower priority thread.
  6. Assign lower priorities in debug builds. It will screw up your
    timing, but it will also keep you from rebooting due to a system lockup
    every time your code gets in a loop for some unexpected reason.
  7. Experiment with both priority and priority class. Experiment with
    different thread priorities even for your important worker threads. You
    might discover that it works better if the senders have a higher priority
    than the receivers. Or ther other way around. Or maybe best when they are
    both the same, but even better if you take them both DOWN a priority notch.
    You will discover things about scheduling you never imagined!
  8. Measure measure measure. Get the scope out, and figure a creative way
    to connect it to some real wires on your machine, so you can SEE the
    processing time, and see how it changes as you fiddle priorities and
    algorithms.
  9. Measure measure measure. After you watch things on the scope, figure
    out how to accumulate a half hour’s statistics without disturbing the
    process, and stuff them into Excel. Do min, max, average and SD, and
    anything else that seems appropriate. Do this with a priority setup that
    didn’t look as good on the scope as the ‘best’ one you picked. You might
    get a surprise.

That’s the ten cent tour on thruput and turnaround time optimization.

Loren


You are currently subscribed to ntdev as:
xxxxx@intersolutions.stpn.soft.net
To unsubscribe send a blank email to xxxxx@lists.osr.com

hi loren
also see inline for some more questions.

-----Original Message-----
From: xxxxx@lists.osr.com
[mailto:xxxxx@lists.osr.com]On Behalf Of Loren Wilton
Sent: Thursday, May 22, 2003 1:45 PM
To: NT Developers Interest List
Subject: [ntdev] Re: Too many ioctls

hi loren
can u send some of those user mode stuff which u were earlier talking
about
which has that 250us response
accuracy and jitter of less then 2-3 ms. Similar numbers are also what we
are looking at.
It would be very helpful to me if u can send me that stuff as soon as
possible.
Thanks
Mayank

Sorry, the code is both proprietary and large. However, if you play around
for a while with the multimedia timers, various ways of waiting on events,
multiple threads, thread priority, and process priority class it can be
done. It isn’t the easiest thing in the world to do. But it is probably
easier at user level than at kernel level, simply because it will be A LOT
easier to debug.

Things you want to think about:

  1. Separate the data path from the interface, if any. Make sure the data
    path won’t get hung up waiting on the interface. This means separate
    threads.
  2. Use queues and possibly separate sender/receiver threads as necessary.
    Make sure that you decouple receiving speed from sending speed and vice
    versa. A blocked receiver shouldn’t slow down a sender.
  3. If you have multiple receivers, make sure they all have separate
    queues and are able to process messages even if other receivers are blocked.
    Same for multiple senders.
  4. Use fast locking algorithms on your queues. Keep the overhead down.
    When possible use queues that don’t require locking for safe operation.
    [MAYANK] what are they . how do u fast lock a queue. etc some more hints
    required here although i am doing some reasearch on the net. if u have any
    relevant links do send.
  5. Use priority judiciously, but use it as necessary. Priority and
    processing time should be the inverse of each other. DO NOT do things that
    can take a long time (fractional millisecond time or longer) at a high
    priority. Fob that off on a lower priority thread.
  6. Assign lower priorities in debug builds. It will screw up your
    timing, but it will also keep you from rebooting due to a system lockup
    every time your code gets in a loop for some unexpected reason.
  7. Experiment with both priority and priority class. Experiment with
    different thread priorities even for your important worker threads. You
    might discover that it works better if the senders have a higher priority
    than the receivers. Or ther other way around. Or maybe best when they are
    both the same, but even better if you take them both DOWN a priority notch.
    You will discover things about scheduling you never imagined!
  8. Measure measure measure. Get the scope out, and figure a creative way
    to connect it to some real wires on your machine, so you can SEE the
    processing time, and see how it changes as you fiddle priorities and
    algorithms.
  9. Measure measure measure. After you watch things on the scope, figure
    out how to accumulate a half hour’s statistics without disturbing the
    process, and stuff them into Excel. Do min, max, average and SD, and
    anything else that seems appropriate. Do this with a priority setup that
    didn’t look as good on the scope as the ‘best’ one you picked. You might
    get a surprise.
    [MAYANK] Are there any good measurement or profiling tools that u urself
    have used for making such an analysis. Can u suggest some. Do they give
    function level profiling

That’s the ten cent tour on thruput and turnaround time optimization.

Loren


You are currently subscribed to ntdev as:
xxxxx@intersolutions.stpn.soft.net
To unsubscribe send a blank email to xxxxx@lists.osr.com

> I justed wanted to know one thing that if every thing is implemented in
User

mode then what is the most efficient way
of receiving/transmitting packets from/to the NIC. Will the protocol
driver
approach be good or does the windows expose api in user mode for
efficient
send/receive using zero copy as some body suggested that windows has zero
copy implemented since 3.1
NT

Someone else will have to help you with this one, I have very little network
experience. Most of my stuff does realtime control and/or streaming audio
type stuff.

Loren

> 4. Use fast locking algorithms on your queues. Keep the overhead down.

When possible use queues that don’t require locking for safe operation.
[MAYANK] what are they . how do u fast lock a queue. etc some more hints
required here although i am doing some reasearch on the net. if u have any
relevant links do send.

Things like critical sections are a little less overhead than mutexes, when
you can use them.

You can also build queues that don’t require locking, although they do
usually require an event to wake up the receive thread or sometimes the send
thread. Its an old hardware queue trick. Basically you have a circular
queue of some size, usually a power of 2, with separate read and write
pointers. The writer only increments the write pointer, and the reader only
increments the read pointer, so you don’t have to worry about simultaneous
access. Ideally you want these two variables in separate cache lines on a
multiprocessor system. The reader knows there is data available if the
write pointer and read pointer aren’t the same. The writer can write new
data unless the write pointer+1 equals the read pointer. Sometimes with
care you can design things so that you don’t always have to cause the event
when you write a message, and save a little more overhead that way.

[MAYANK] Are there any good measurement or profiling tools that u urself
have used for making such an analysis. Can u suggest some. Do they give
function level profiling

For the stuff I’ve done I’ve usually ended up building my own capture
programs and usually used Excel to do the data analysis. There are of
course various hardware monitors that will also do at least some data
reduction, but they might either cost more than you can afford, or might not
do anything that you really want.

Loren

> send/receive using zero copy as some body suggested that windows has
zero

copy implemented since 3.1

Zero-copy (DO_DIRECT_IO) was since 3.1 for most driver stacks, as
about TCP sockets - I expect it to be since 1994, when MS scraped the
old STREAMS-based TCP implementation and replaced it with the code
from OS/2 LAN Manager, which is current for now and supports
zero-copy.

Set SO_SNDBUF to 0, then send large portions - larger then possible
window size - to avoid the pipeline stalls due to lack of data from
the sender.

This will do zero-copy sends. For receives, the best way is to send a
lot of overlapped receives to the socket and then process them using
IO completion port.

Max