Virtual Storport Miniport - HwStorStartIo

Nirranjan_K · February 19, 2010, 11:45am

In Virtual Storport Miniport “No locks are held prior to calling HwStorStartIo”. So multiple HwStorStartIo()s can be called on multiple CPUs parallely.

How should the virtual storport miniport driver queue and process the SRBs in correct order. i.e without of running into out of order processing of SRBs?

Thanks,
nirranjan

Igor_Sharovar · February 19, 2010, 12:25pm

HwStorStartIo is called by StorPort driver and according to MSDN documentation
“Storport always calls HwStorStartIo at DISPATCH IRQL by using an internal spin lock to ensure that requests are initiated sequentially.” I took this reference from description of HwStorStartIo routine of miniport StorPort driver but it should work for virtual miniport driver too. Getting request in HwStorStartIo you could use your own internal queue.

Igor Sharovar

OSR_Community_User · February 19, 2010, 12:27pm

Why would the need to be processed in any order? Since the disk could reorder operations, why would you need to…

–Mark Cariddi
OSR, Open Systems Resources, Inc.

-----Original Message-----
From: xxxxx@lists.osr.com [mailto:xxxxx@lists.osr.com] On Behalf Of xxxxx@yahoo.com
Sent: Friday, February 19, 2010 11:46 AM
To: Windows System Software Devs Interest List
Subject: [ntdev] Virtual Storport Miniport - HwStorStartIo

In Virtual Storport Miniport “No locks are held prior to calling HwStorStartIo”. So multiple HwStorStartIo()s can be called on multiple CPUs parallely.

How should the virtual storport miniport driver queue and process the SRBs in correct order. i.e without of running into out of order processing of SRBs?

Thanks,
nirranjan

NTDEV is sponsored by OSR

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at http://www.osronline.com/page.cfm?name=ListServer

Peter_Wieland · February 19, 2010, 2:03pm

To back Mark’s statement - a scsi or storport miniport is free to issue the I/O operations it receives in any order it wishes. You do not need to maintain order because no component above you in the stack is (or is allowed to) expect you to do so.

So queue them to the hardware however you see fit.

-p

-----Original Message-----
From: xxxxx@lists.osr.com [mailto:xxxxx@lists.osr.com] On Behalf Of Mark Cariddi
Sent: Friday, February 19, 2010 9:27 AM
To: Windows System Software Devs Interest List
Subject: RE: [ntdev] Virtual Storport Miniport - HwStorStartIo

Why would the need to be processed in any order? Since the disk could reorder operations, why would you need to…

–Mark Cariddi
OSR, Open Systems Resources, Inc.

-----Original Message-----
From: xxxxx@lists.osr.com [mailto:xxxxx@lists.osr.com] On Behalf Of xxxxx@yahoo.com
Sent: Friday, February 19, 2010 11:46 AM
To: Windows System Software Devs Interest List
Subject: [ntdev] Virtual Storport Miniport - HwStorStartIo

In Virtual Storport Miniport “No locks are held prior to calling HwStorStartIo”. So multiple HwStorStartIo()s can be called on multiple CPUs parallely.

How should the virtual storport miniport driver queue and process the SRBs in correct order. i.e without of running into out of order processing of SRBs?

Thanks,
nirranjan

NTDEV is sponsored by OSR

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at http://www.osronline.com/page.cfm?name=ListServer

NTDEV is sponsored by OSR

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at http://www.osronline.com/page.cfm?name=ListServer

James_Harper · February 19, 2010, 5:04pm

>

To back Mark’s statement - a scsi or storport miniport is free to
issue the
I/O operations it receives in any order it wishes. You do not need to
maintain order because no component above you in the stack is (or is
allowed
to) expect you to do so.

So queue them to the hardware however you see fit.

Surely Windows has the concept of ‘write barriers’ in its disk io
queues??? If I were you I’d do a bit more research before taking the
advice of “queue them to the hardware however you see fit”. To take it
to a ridiculous extreme (and way beyond what the above text implied),
how is filesystem integrity going to be maintained if you reorder your
writes across a 5 minute timespan, and you get a power failure or bsod
in that time? Windows might think it has committed the journal to disk
when in fact it hasn’t.

I’m not sure how tagged scsi requests will work either if you were to
get the requests in the order A, B, C, but you delayed the queuing to
the hardware of request A by an arbitrary amount of time…

James

Peter_Wieland · February 19, 2010, 6:20pm

If you need write A and write B to complete in a particular order you send B after A completes. If you need a barrier you can issue a flush.

As the storage miniport you may not complete the request for the I/O operation until you’ve actually sent it to the disk. But you may send them in any order you like.

-p

-----Original Message-----
From: xxxxx@lists.osr.com [mailto:xxxxx@lists.osr.com] On Behalf Of James Harper
Sent: Friday, February 19, 2010 2:04 PM
To: Windows System Software Devs Interest List
Subject: RE: [ntdev] Virtual Storport Miniport - HwStorStartIo

To back Mark’s statement - a scsi or storport miniport is free to
issue the
I/O operations it receives in any order it wishes. You do not need to
maintain order because no component above you in the stack is (or is
allowed
to) expect you to do so.

So queue them to the hardware however you see fit.

Surely Windows has the concept of ‘write barriers’ in its disk io queues??? If I were you I’d do a bit more research before taking the advice of “queue them to the hardware however you see fit”. To take it to a ridiculous extreme (and way beyond what the above text implied), how is filesystem integrity going to be maintained if you reorder your writes across a 5 minute timespan, and you get a power failure or bsod in that time? Windows might think it has committed the journal to disk when in fact it hasn’t.

I’m not sure how tagged scsi requests will work either if you were to get the requests in the order A, B, C, but you delayed the queuing to the hardware of request A by an arbitrary amount of time…

James

NTDEV is sponsored by OSR

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at http://www.osronline.com/page.cfm?name=ListServer

James_Harper · February 19, 2010, 8:41pm

>

If you need write A and write B to complete in a particular order you
send B
after A completes. If you need a barrier you can issue a flush.

As the storage miniport you may not complete the request for the I/O
operation
until you’ve actually sent it to the disk. But you may send them in
any order
you like.

I thought TCQ solved all these problems years ago, and you could queue
up write requests like A B C D E F G H, and the storage device
would re-order A-D and E-H however it liked, as long as A-D were
completed before E-H were started, thus keeping the i/o pipeline full
and letting the drive reorder the requests to maximise throughput. At
the driver level, Waiting for requests to complete before starting
others seems pretty poor for performance, especially for a storage
system with high latencies (eg FC over a WAN link).

But maybe you’re the head storage guy at Microsoft so I’m a little
reluctant to argue

James

OSR_Community_User · February 19, 2010, 10:36pm

I think waiting for a request to complete before issuing another one only
has performance impact if there are no other requests you can issue that are
not dependent on the ordered requests. Typically a disk has requests coming
from many sources, so looks like:

Source 1 - A1,B1,C1,sync,D1,E1
Source 2 - A2,sync,B2,C2,D2
Source 3 - A3,B3,C3,D3,E3,sync

So the disk request pool looks like:

Start - A1,B1,C1,A2,A3,B3,C3,D3,E3
After A2 completes it might look like - C1,B2,C2,D2,A3,B3,C3,D3,E3
After C1 completes it might look like - D1,E1,C2,D2,D3,E3
After E3 completes it might look like - E1,D2,…

My guess is you could come up with workloads where TCQ helps, but I can also
imagine lots of workloads where it doesn’t.

TCQ does seem like it terribly complicates error recovery too. If you have
A,sync,B I would hope B only executes if A was successful.

Without being the head storage designer of a major OS and looking at lots of
workload analysis data, and seeing how real world hardware handles things
like errors (devices do not always follow the standards), it’s hard to say
if TCQ is a win over just waiting for completions. I suspect file systems
can be designed such that global synchronization is rarely needed, and as a
result the benefits of TCQ are less significant. Global sync points would be
when you need to wait for a completion, and can’t issue any new requests. It
seems like one design goals on a modern file system would be to minimize the
synchronization points needed.

Another issue, since a logical disk might be multiple spindles, TCQ does not
help ordering across multiple spindles. Disk drives do not talk to each
other, although the logical disk may be an array controller and TCQ is
really communicating with the array controller about ordering. This implies
you would need to either have the local storage pool management (i.e. like
software raid) implement the TCQ synchronization (or report no TCQ support)
and the file system would need to alter its synchronization behavior based
on disk characteristics.

If we look at medium/larger servers, the storage has a good chance of
actually being sophisticated array controllers, who cache writes in NVRAM or
something similar, so will return success to many writes immediately anyway.

What exactly does “ordering” mean if you have multiple initiators to a disk,
which are common on clusters and NTFS cluster storage.

Jan

-----Original Message-----
From: xxxxx@lists.osr.com [mailto:bounce-402191-
xxxxx@lists.osr.com] On Behalf Of James Harper
Sent: Friday, February 19, 2010 5:41 PM
To: Windows System Software Devs Interest List
Subject: RE: [ntdev] Virtual Storport Miniport - HwStorStartIo

>
> If you need write A and write B to complete in a particular order you
send B
> after A completes. If you need a barrier you can issue a flush.
>
> As the storage miniport you may not complete the request for the I/O
operation
> until you’ve actually sent it to the disk. But you may send them in
any order
> you like.
>

I thought TCQ solved all these problems years ago, and you could queue
up write requests like A B C D E F G H, and the storage
> device
> would re-order A-D and E-H however it liked, as long as A-D were
> completed before E-H were started, thus keeping the i/o pipeline full
> and letting the drive reorder the requests to maximise throughput. At
> the driver level, Waiting for requests to complete before starting
> others seems pretty poor for performance, especially for a storage
> system with high latencies (eg FC over a WAN link).
>
> But maybe you’re the head storage guy at Microsoft so I’m a little
> reluctant to argue
>
> James
>
>
>
> —
> NTDEV is sponsored by OSR
>
> For our schedule of WDF, WDM, debugging and other seminars visit:
> http://www.osr.com/seminars
>
> To unsubscribe, visit the List Server section of OSR Online at
> http://www.osronline.com/page.cfm?name=ListServer

Mark_Roddy · February 19, 2010, 11:09pm

Is ordered tcq used? As far as I know, NT only uses simple tcq: it
tosses multiple requests at a device and leaves it to the device to
worry about how those requests get processed.

On Friday, February 19, 2010, Jan Bottorff wrote:
> I think waiting for a request to complete before issuing another one only
> has performance impact if there are no other requests you can issue that are
> not dependent on the ordered requests. Typically a disk has requests coming
> from many sources, so looks like:
>
> Source 1 - A1,B1,C1,sync,D1,E1
> Source 2 - A2,sync,B2,C2,D2
> Source 3 - A3,B3,C3,D3,E3,sync
>
> So the disk request pool looks like:
>
> Start - A1,B1,C1,A2,A3,B3,C3,D3,E3
> After A2 completes it might look like - C1,B2,C2,D2,A3,B3,C3,D3,E3
> After C1 completes it might look like - D1,E1,C2,D2,D3,E3
> After E3 completes it might look like - E1,D2,…
>
> My guess is you could come up with workloads where TCQ helps, but I can also
> imagine lots of workloads where it doesn’t.
>
> TCQ does seem like it terribly complicates error recovery too. If you have
> A,sync,B I would hope B only executes if A was successful.
>
> Without being the head storage designer of a major OS and looking at lots of
> workload analysis data, and seeing how real world hardware handles things
> like errors (devices do not always follow the standards), it’s hard to say
> if TCQ is a win over just waiting for completions. I suspect file systems
> can be designed such that global synchronization is rarely needed, and as a
> result the benefits of TCQ are less significant. Global sync points would be
> when you need to wait for a completion, and can’t issue any new requests. It
> seems like one design goals on a modern file system would be to minimize the
> synchronization points needed.
>
> Another issue, since a logical disk might be multiple spindles, TCQ does not
> help ordering across multiple spindles. Disk drives do not talk to each
> other, although the logical disk may be an array controller and TCQ is
> really communicating with the array controller about ordering. This implies
> you would need to either have the local storage pool management (i.e. like
> software raid) implement the TCQ synchronization (or report no TCQ support)
> and the file system would need to alter its synchronization behavior based
> on disk characteristics.
>
> If we look at medium/larger servers, the storage has a good chance of
> actually being sophisticated array controllers, who cache writes in NVRAM or
> something similar, so will return success to many writes immediately anyway.
>
> What exactly does “ordering” mean if you have multiple initiators to a disk,
> which are common on clusters and NTFS cluster storage.
>
> Jan
>
>
>
>> -----Original Message-----
>> From: xxxxx@lists.osr.com [mailto:bounce-402191-
>> xxxxx@lists.osr.com] On Behalf Of James Harper
>> Sent: Friday, February 19, 2010 5:41 PM
>> To: Windows System Software Devs Interest List
>> Subject: RE: [ntdev] Virtual Storport Miniport - HwStorStartIo
>>
>> >
>> > If you need write A and write B to complete in a particular order you
>> send B
>> > after A completes. ?If you need a barrier you can issue a flush.
>> >
>> > As the storage miniport you may not complete the request for the I/O
>> operation
>> > until you’ve actually sent it to the disk. ?But you may send them in
>> any order
>> > you like.
>> >
>>
>> I thought TCQ solved all these problems years ago, and you could queue
>> up write requests like A B C D E F G H, and the storage
>> device
>> would re-order A-D and E-H however it liked, as long as A-D were
>> completed before E-H were started, thus keeping the i/o pipeline full
>> and letting the drive reorder the requests to maximise throughput. At
>> the driver level, Waiting for requests to complete before starting
>> others seems pretty poor for performance, especially for a storage
>> system with high latencies (eg FC over a WAN link).
>>
>> But maybe you’re the head storage guy at Microsoft so I’m a little
>> reluctant to argue
>>
>> James
>>
>>
>>
>> —
>> NTDEV is sponsored by OSR
>>
>> For our schedule of WDF, WDM, debugging and other seminars visit:
>> http://www.osr.com/seminars
>>
>> To unsubscribe, visit the List Server section of OSR Online at
>> http://www.osronline.com/page.cfm?name=ListServer
>
>
> —
> NTDEV is sponsored by OSR
>
> For our schedule of WDF, WDM, debugging and other seminars visit:
> http://www.osr.com/seminars
>
> To unsubscribe, visit the List Server section of OSR Online at http://www.osronline.com/page.cfm?name=ListServer
>

–
Mark Roddy

Alex_Grig · February 19, 2010, 11:35pm

There is no “in order” or “out of order” concept for requests coming at
different CPUs. Likewise, you can’t tell whether an event happened earlier
on Earth or Mars, if they happen during certain time window. A request,
issued by an application at CPU may take unpredictably shorter or longer
time to reach the driver, than a request issued at CPU.

wrote in message news:xxxxx@ntdev…
> In Virtual Storport Miniport “No locks are held prior to calling
> HwStorStartIo”. So multiple HwStorStartIo()s can be called on multiple
> CPUs parallely.
>
> How should the virtual storport miniport driver queue and process the SRBs
> in correct order. i.e without of running into out of order processing of
> SRBs?
>
> Thanks,
> nirranjan
>
>
>

OSR_Community_User · February 20, 2010, 2:07am

It’s been a while since I decoded the scsi command packets on Windows, but
also believe it doesn’t use ordered tcq at all, so the device is free to
reorder things as it desires. The OS does set the FUA bit (force unit
access) on requests if the storage irp has SL_WRITE_THROUGH in the flags.
Also note that advanced storage arrays often lie about completion, even when
the FUA bit is set, as they have ways of caching a write such that it (in
theory) should not get lost (like storing the write request in NVRAM). You
would need to read the SCSI spec to decide if FUA causes a write fence in
relation to other pending requests.

The link at http://msdn.microsoft.com/en-us/library/dd979523(VS.85).aspx has
some curious info about Transactional NTFS’s interaction with storage
devices.

Jan

Is ordered tcq used? As far as I know, NT only uses simple tcq: it
tosses multiple requests at a device and leaves it to the device to
worry about how those requests get processed.

Bob_Kroeter · February 20, 2010, 8:24pm

> It’s been a while since I decoded the scsi command packets on Windows,

but also believe it doesn’t use ordered tcq at all, so the device is free to
reorder things as it desires

Sorry, this design paradigm really bothers me as I have seen too many driver writers doing such things. Just because you haven’t seen a particular action in the debugger eons ago is in no way valid reasoning to skip such an important design consideration that impacts data integrity. Any new Windows versions or updates may, if they have not already, legally choose to use an ordered queue at anytime. Could you really sleep well at night knowing such a time bomb is in your code that could trash every single customer disk drive? And by the way FUA has nothing to do with ordering AFAICS.

Mark_Roddy · February 20, 2010, 10:18pm

It mostly doesn’t affect a storport miniport at all. Miniports do not
perform queueing, they pass requests from storport to the addressed
device. Any queueing issues and requirements are the domain of the
actual device, storport itself, and the initiator. However, I worked
directly on windows storage scsi and storport drivers and on storage
stack drivers for ten years up until last year, and I never once saw
anything other than simple tagged requests and I am convinced that the
only mechanisms in use in NT to enforce any ordering on storage device
are flush mechanisms. That said, and as I started out, even if NTFS or
disk,sys suddenly started using ordered tag queues, that would not
affect miniports, scsi or storport, in the slightest. They would
simply be passing those requests on at the direction of the containing
port driver.

Mark Roddy

On Sat, Feb 20, 2010 at 8:24 PM, wrote:
>> It’s been a while since I decoded the scsi command packets on Windows,
>> but also believe it doesn’t use ordered tcq at all, so the device is free to
>> reorder things as it desires
>
> Sorry, this design paradigm really bothers me as I have seen too many driver writers doing such things. Just because you haven’t seen a particular action in the debugger eons ago is in no way valid reasoning to skip such an important design consideration that impacts data integrity. Any new Windows versions or updates may, if they have not already, legally choose to use an ordered queue at anytime. Could you really sleep well at night knowing such a time bomb is in your code that could trash every single customer disk drive? And by the way FUA has nothing to do with ordering AFAICS.
>
>
> —
> NTDEV is sponsored by OSR
>
> For our schedule of WDF, WDM, debugging and other seminars visit:
> http://www.osr.com/seminars
>
> To unsubscribe, visit the List Server section of OSR Online at http://www.osronline.com/page.cfm?name=ListServer
>

OSR_Community_User · February 21, 2010, 12:17am

It was more of a calibration/disclaimer on the validity of my comment (i.e
do your own research for a better answer). I actually DID look though the
sources in the WDK 7600 disk class driver, but I have never run the WDK 7600
disk class driver and don’t know if it matches the current shipping driver
(it used to). I found no evidence of ordered tag queuing happening in those
sources.

There are no specifications other than the disk class driver sources that
might describe this area, so suspect reverse engineering OS behavior would
be needed. Reverse engineering never tells you the intent/future direction
of a components owner, only it’s past and current implementation. Other
options would be to open a support ticket with Microsoft, which may or may
not get a definitive answer about Windows use of tagged queuing, or if
you’re one of the fortunate people to have source code access, you could do
your own research. Neither of these options will likely tell you what’s in
the future product plans, only past/current implementation.

I would certainly love a design paradigm where EVERYTHING is deeply
documented, but that is simply not reality for the majority of developers on
this list.

I did say you would have to look at the scsi spec to understand any queuing
impact setting the FUA bit would have.

I do like to think 15 years of experience writing Windows drivers gives my
comments rather more accuracy that some. I was also concurring with Mark’s
comment, also a VERY experienced Windows driver developer. Unless you have
the Windows 7 source code in front of you, or a scsi trace showing ordered
tag queuing happening (i.e. reverse engineering), then it seems like the
best available evidence from two highly experienced developers say the
chances are better than 50% that Windows does not use ordered tagged
queuing. The reality is, YOU as an engineer would have to come to your own
conclusion about what reality is, and make your engineering decision based
on your views. Mark and I are just data points to be considered.

So how do you suggest we solve the problem of writing software in an
environment where we don’t absolutely know all decisions by other developers
of other components? The most accurate answer about Windows TCQ is, it’s not
specified what the behavior is. The question then becomes, what are the
boundaries on what software you can write given the unspecified behavior. It
only takes reading this list for a short time to see many developers have
little interest in putting boundaries on their product because certain
behavior is unspecified.

The sticky problem is that since some developers have no problem making
assumptions about unspecified behavior, giving their products new features,
ALL of us have to create products that compete with these products. I’ve had
this conversation a million times with management/sales/marketing people:
“There is no official way to make that feature work, we might find a way
that bends/breaks the rules, but as an engineer, I have to recommend against
doing things we can’t be sure are stable and work correctly.” Frequently,
the response from management/sales/marketing is “But brand B has that
feature, and to be competitive we need it too, so do anything you need to do
to make that feature work, and we’ll deal with any problems later”. I as a
developer than am faced with the ethical conflict of: do I refuse to do
things that I believe are inappropriate or risky, possibly risking my
employment or client relationship, or do I implement features that I have no
way of validating as safe (i.e. it seems to work in (often inadequate)
testing, but who knows what those other components really do). I actually
have somewhat of a reputation of sticking with my ethical values, and not
bending to pressure from peers/authority, it doesn’t always make me liked
(at least initially, but then when I deliver stable products and others
deliver unstable products, I get a lot of respect). I’m sure this is a
problem Microsoft grapples with all the time, because the instability caused
by some developers reflects back on them.

All engineering is about risk management, if it’s designing a building or
writing software. Nothing is absolutely risk free, as there are always
factors you can’t control. Engineers make their best judgment call of what
reality will be like in the future, and design products with the assumption
that view of reality is accurate. Engineers frequently don’t get the last
word on these risk management tradeoffs.

Jan

-----Original Message-----
From: xxxxx@lists.osr.com [mailto:bounce-402297-
xxxxx@lists.osr.com] On Behalf Of xxxxx@gmail.com
Sent: Saturday, February 20, 2010 5:25 PM
To: Windows System Software Devs Interest List
Subject: RE:[ntdev] Virtual Storport Miniport - HwStorStartIo

> It’s been a while since I decoded the scsi command packets on
Windows,
> but also believe it doesn’t use ordered tcq at all, so the device is
free to
> reorder things as it desires

Sorry, this design paradigm really bothers me as I have seen too many
driver writers doing such things. Just because you haven’t seen a
particular action in the debugger eons ago is in no way valid reasoning
to skip such an important design consideration that impacts data
integrity. Any new Windows versions or updates may, if they have not
already, legally choose to use an ordered queue at anytime. Could you
really sleep well at night knowing such a time bomb is in your code
that could trash every single customer disk drive? And by the way FUA
has nothing to do with ordering AFAICS.

NTDEV is sponsored by OSR

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at
http://www.osronline.com/page.cfm?name=ListServer

Calvin_Guan-3 · February 21, 2010, 2:45pm

In the event of “do it or go home”, I would clearly put the
assumptions/risks in bold in the design specification, honestly spell out
under what circumstances, the design may have what kinds of unintended
effects and how bad it could be. Somebody must sign off before the ball
starts rolling, right? As long as my product performs to spec, I’m off the
hook.

Calvin Guan

-----Original Message-----
From: xxxxx@lists.osr.com
[mailto:xxxxx@lists.osr.com] On Behalf Of Jan Bottorff
Sent: Saturday, February 20, 2010 9:17 PM
To: Windows System Software Devs Interest List
Subject: RE: [ntdev] Virtual Storport Miniport - HwStorStartIo