concurrent outstanding writes and drbd

When using my PV drivers for Windows running under Xen, I get occasional
errors reported from drbd (Linux based RAID1 system where one disk is
local and the other is on another machine) reporting “Concurrent local
write detected!”. This means that while one write to a given sector was
outstanding, another was submitted. This is an issue because DRBD needs
to ensure consistency of both disks (local and remote) and can’t because
it doesn’t know how the multiple outstanding writes might be re-ordered.
In response to this situation it drops the second write and writes a
warning in the logs that it has occurred.

My questions is, is this expected behaviour and “it’s just what windows
does”, or could my driver not be managing requests correctly? I don’t
quite understand why Windows would ever submit multiple outstanding
writes to the same sector in the first place though. It’s easy to
reproduce using benchmarking software but still occurs occasionally
during normal use.

In all my time using my drivers (months of uptime), I have never had any
filesystem corruption or crashes, but based on my understanding of the
situation there may be the possibility that it could happen.

Thanks

James

Further reading suggests that this also happens when the Microsoft iSCSI
Initiator (and several other initiators) with the target on top of DRBD,
so it may be that it’s a low level bug in the way Microsoft (ntfs?)
orders its writes…

It could be that the second write is identical to the first in which
case this is never going to be a problem, but of course DRBD can never
know that without inspecting every write so the warning messages will
keep occurring.

James

When using my PV drivers for Windows running under Xen, I get
occasional
errors reported from drbd (Linux based RAID1 system where one disk is
local and the other is on another machine) reporting “Concurrent local
write detected!”. This means that while one write to a given sector
was
outstanding, another was submitted. This is an issue because DRBD
needs
to ensure consistency of both disks (local and remote) and can’t
because
it doesn’t know how the multiple outstanding writes might be
re-ordered.
In response to this situation it drops the second write and writes a
warning in the logs that it has occurred.

My questions is, is this expected behaviour and “it’s just what
windows
does”, or could my driver not be managing requests correctly? I don’t
quite understand why Windows would ever submit multiple outstanding
writes to the same sector in the first place though. It’s easy to
reproduce using benchmarking software but still occurs occasionally
during normal use.

In all my time using my drivers (months of uptime), I have never had
any
filesystem corruption or crashes, but based on my understanding of the
situation there may be the possibility that it could happen.

“James Harper” wrote in message
news:xxxxx@ntdev…
> Further reading suggests that this also happens when the Microsoft iSCSI
> Initiator (and several other initiators) with the target on top of DRBD,
> so it may be that it’s a low level bug in the way Microsoft (ntfs?)
> orders its writes…

Unlikely that ntfs has been tested less than this drbd thing, Xen and your
driver all together ?
And, why a fiesystem cannot post two successive writes to same block?
– pa

>

“James Harper” wrote in message
> news:xxxxx@ntdev…
> > Further reading suggests that this also happens when the Microsoft
iSCSI
> > Initiator (and several other initiators) with the target on top of
DRBD,
> > so it may be that it’s a low level bug in the way Microsoft (ntfs?)
> > orders its writes…
>
> Unlikely that ntfs has been tested less than this drbd thing, Xen and
your
> driver all together ?

It’s not a bug that would ever show up in normal use, and if both writes
are the same then it would never be a problem, but drbd doesn’t know
this.

> And, why a fiesystem cannot post two successive writes to same block?

If the writes are different, and there is no barrier, then how do you
know which one gets written first?

If windows gives me a bunch of concurrent writes, should I be free to
re-order them any way I wish? If not me, then the underlying hardware?

James

The writes could be different, there could be no barrier, and the filesystem
might not care at all what the contents of tha block of sectors is. NTFS
imposes sector write order when it needs to, otherwise not. Your (linux)
mirroring software ought to process both writes, and if it thinks it needs
to impose order, it should do so. It gets even more peculiar, by the way,
there are edge cases where the contents of the write buffer can be modified
while the buffer is in flight. Not a problem with the linux side as the copy
across the xen rb will introduce a coherency to the contents of the write
operation that are not present in the guest.

Mark Roddy

On Wed, May 12, 2010 at 5:47 AM, James Harper > wrote:

> >
> > “James Harper” wrote in message
> > news:xxxxx@ntdev…
> > > Further reading suggests that this also happens when the Microsoft
> iSCSI
> > > Initiator (and several other initiators) with the target on top of
> DRBD,
> > > so it may be that it’s a low level bug in the way Microsoft (ntfs?)
> > > orders its writes…
> >
> > Unlikely that ntfs has been tested less than this drbd thing, Xen and
> your
> > driver all together ?
>
> It’s not a bug that would ever show up in normal use, and if both writes
> are the same then it would never be a problem, but drbd doesn’t know
> this.
>
> > And, why a fiesystem cannot post two successive writes to same block?
>
> If the writes are different, and there is no barrier, then how do you
> know which one gets written first?
>
> If windows gives me a bunch of concurrent writes, should I be free to
> re-order them any way I wish? If not me, then the underlying hardware?
>
> James
>
> —
> NTDEV is sponsored by OSR
>
> For our schedule of WDF, WDM, debugging and other seminars visit:
> http://www.osr.com/seminars
>
> To unsubscribe, visit the List Server section of OSR Online at
> http://www.osronline.com/page.cfm?name=ListServer
>

“Further reading suggests that this also happens when the Microsoft iSCSI
Initiator (and several other initiators) with the target on top of DRBD”

Can you explain it? What’s the actual IO path from an application to the disk?

Also, do you pass queuing attributes through your PV driver?

>

The writes could be different, there could be no barrier, and the
filesystem
might not care at all what the contents of tha block of sectors is.
NTFS
imposes sector write order when it needs to, otherwise not. Your
(linux)
mirroring software ought to process both writes, and if it thinks it
needs to
impose order, it should do so.

As DRBD can already detect overlapped concurrent writes, it would be
nice if it just stalled the queue until the first write completed. It is
a rare event after all. Ordinarily, it shouldn’t matter to a driver
sitting somewhere in the IO path as to what order the writes ultimately
complete in. If the initiator sends writes that have an arbitrary
outcome if reordered then so be it. DRBD though has to ensure that both
sides of the mirror stay in sync, and if one side of the mirror reorders
such writes (eg different model disks) and the other doesn’t then they
are no longer in sync and so next time the automatic sync crawler gets
to that sector and reports an error, sysadmin panic will (should)
follow.

I may have to detect and serialise such writes myself such that two
writes to the same sector are never outstanding at once. While NTFS does
it very occasionally, some benchmark apps do it a lot, so my driver will
test poorly in such cases. Better than corruption though.

It gets even more peculiar, by the way, there
are edge cases where the contents of the write buffer can be modified
while
the buffer is in flight. Not a problem with the linux side as the copy
across
the xen rb will introduce a coherency to the contents of the write
operation
that are not present in the guest.

I’m pretending that the data can never be modified while the write is in
flight :slight_smile:

James

>

“Further reading suggests that this also happens when the Microsoft
iSCSI
Initiator (and several other initiators) with the target on top of
DRBD”

Can you explain it? What’s the actual IO path from an application to
the disk?

ASCII art sucks, but this is a crude approximation of how DRBD works:

Logical Volume
|
DRBD <– network –> DRBD -> Physical volume
|
Physical Volume

Each DRBD node talks to the other over the network. On each node the
logical volume can be ‘primary’, or ‘secondary’. A ‘primary’ volume is
accessible, a ‘secondary’ is not, but can be changed to ‘primary’. The
config determines if the volume can be ‘primary’ on more than one node.
If the local physical volume fails then things can continue via the
remote physical volume.

There are 3 modes of operation wrt how writes are scheduled. They are
known as protocols A, B, and C I think but I don’t know which is which
off hand.

  1. The write on the logical volume is complete when it has been
    successfully written to both physical volumes
  2. The write on the logical volume is complete when it has been
    successfully written to the local physical volume, and has ‘arrived’ at
    the remote physical volume (but hasn’t necessarily been written to
    disk).
  3. The write on the logical volume is complete when it has been
    successfully written to the local physical volume, and is queued on the
    network to be sent to the remote physical volume.

Which protocol you use depends on your requirements. Obviously the
‘multiple primary’ scenario only makes sense for the first protocol.

In the iSCSI scenario I referred to, the iSCSI target sits ‘above’ the
logical volume in the above diagram.

Also, do you pass queuing attributes through your PV driver?

Now this may be the pointy part of the problem. Maybe Windows is rightly
telling me how I should queue the writes (eg don’t start this write
until the previous one is finished) and I’m not doing the right thing.
Is there a document that describes how such things should work? I
definitely allow up to 16 outstanding scsi requests (I’d allow more but
Windows never gives me more).

Thanks

James

I have just sent an update to the user that stalls the queue when a
write request comes that overlaps an existing write request. We’ll see
if it resolves the ‘problem’ and if it introduces performance problems.

I guess I’ll make it a registry option to disable the stall
functionality for better performance if a workload is encountered that
makes it happen a lot, assuming there is a noticeable performance impact
at all (and I think there will be).

James