MS Cluster service

OSR_Community_User-277 · January 6, 2003, 6:38am

Hi!
I am writing virtual disks driver. The virtual disks can be used by the MS Cluster. Sometimes, under the stress tests, the following message is printed in the Debugger and the group is moving to the second node.
[1168] Microsoft Clustering Service suffered an unexpected fatal error
[1168] at line 1710 of source module D:\nt\private\cluster\service\lm\lmutils.c. The error code was 21.

Where can I find any information about this error?

Redards,
Dany

Niraj_Jaiswal · January 6, 2003, 11:30pm

Isn’t Error 21 => “The device is not ready”. If the cluster disk driver (clusdisk.sys) encounters this error, it will definitely move to the second node. You should look into “\WINNT\cluster\cluster.log” for more details. You should also re-visit how you handle SCSI_RESERVE & SCSI_RELEASE operations.

Niraj

-----Original Message-----
From: Dany Polovets [mailto:xxxxx@store-age.com]
Sent: Monday, January 06, 2003 3:38 AM
To: NT Developers Interest List
Subject: [ntdev] MS Cluster service

Hi!

I am writing virtual disks driver. The virtual disks can be used by the MS Cluster. Sometimes, under the stress tests, the following message is printed in the Debugger and the group is moving to the second node. [1168] Microsoft Clustering Service suffered an unexpected fatal error [1168] at line 1710 of source module D:\nt\private\cluster\service\lm\lmutils.c. The error code was 21.

Where can I find any information about this error?

Redards,
Dany

You are currently subscribed to ntdev as: xxxxx@netapp.com To unsubscribe send a blank email to xxxxx@lists.osr.com

OSR_Community_User · January 7, 2003, 11:30am

Don’t forget SCSI_RESET too… It’s part of MSCS’ mechanism for
determining if a failover should occur. MSCS resets the disk device
containing the quorum volume (it may do this on other quorum disks too,
I forget) in a continuous loop (I forget the interval) and the active
node will just re-reserve the quorum volume’s disk using SCSI_RESERVE.
The non-active (sorry, been a while since I dealt with this so I forget
the proper terminology) is also trying to reserve the disk using
SCSI_RESERVE only its continuous loop has a longer interval. This means
that as long as the active node is working, it will re-reserve the reset
bus before the other node can get it. Once the active node dies the
other node will successfully reserve the disk and then it will be the
active node. With manual failover all that happens is that the active
node releases (SCSI_RELEASE) the disk resource (or maybe it simply stops
re-acquiring it and lets the SCSI_RESET do the release for it) and this
allows the non-active node to successfully reserve the disk and become
the active node. I may have some of the details wrong here, but in
general this is how it work, at least on MSCS for W2K Advanced Server
and Data Center. This mechanism is obviously set up to work with a
shared SCSI array containing the quorum volume. If you’re using a
virtual volume for the quorum volume then you need to handle the SCSI
reserve/release/reset opcodes otherwise MSCS will have problems. I have
a hunch there’s a little more to it than that but I don’t remember and
will have to ask a friend. Will get back to you.

Nate Bushman
PowerQuest Corp.

-----Original Message-----
From: Jaiswal, Niraj [mailto:xxxxx@netapp.com]
Sent: Monday, January 06, 2003 9:30 PM
To: NT Developers Interest List
Subject: [ntdev] RE: MS Cluster service

Isn’t Error 21 => “The device is not ready”. If the cluster disk driver
(clusdisk.sys) encounters this error, it will definitely move to the
second node. You should look into “\WINNT\cluster\cluster.log” for more
details. You should also re-visit how you handle SCSI_RESERVE &
SCSI_RELEASE operations.

Niraj

-----Original Message-----
From: Dany Polovets [mailto:xxxxx@store-age.com]
Sent: Monday, January 06, 2003 3:38 AM
To: NT Developers Interest List
Subject: [ntdev] MS Cluster service

Hi!

I am writing virtual disks driver. The virtual disks can be used by the
MS Cluster. Sometimes, under the stress tests, the following message is
printed in the Debugger and the group is moving to the second node.
[1168] Microsoft Clustering Service suffered an unexpected fatal error
[1168] at line 1710 of source module
D:\nt\private\cluster\service\lm\lmutils.c. The error code was 21.

Where can I find any information about this error?

Redards,
Dany

You are currently subscribed to ntdev as: xxxxx@netapp.com To
unsubscribe send a blank email to xxxxx@lists.osr.com

You are currently subscribed to ntdev as: xxxxx@powerquest.com
To unsubscribe send a blank email to xxxxx@lists.osr.com

OSR_Community_User · January 7, 2003, 11:39am

Oops. When I said that MSCS *may* reset other quorum disks as well as
the disk containing quorum I meant that it *may* reset other
CLUSTER-managed disks in addition to the disk containing the quorum
volume. Or it may not. I just seem to remember that it was resetting
more than it absolutely needed to for the sake of arbitration (it’s been
a few years).

-----Original Message-----
From: Nate Bushman
Sent: Tuesday, January 07, 2003 9:30 AM
To: NT Developers Interest List
Subject: [ntdev] RE: MS Cluster service

Don’t forget SCSI_RESET too… It’s part of MSCS’ mechanism for
determining if a failover should occur. MSCS resets the disk device
containing the quorum volume (it may do this on other quorum disks too,
I forget) in a continuous loop (I forget the interval) and the active
node will just re-reserve the quorum volume’s disk using SCSI_RESERVE.
The non-active (sorry, been a while since I dealt with this so I forget
the proper terminology) is also trying to reserve the disk using
SCSI_RESERVE only its continuous loop has a longer interval. This means
that as long as the active node is working, it will re-reserve the reset
bus before the other node can get it. Once the active node dies the
other node will successfully reserve the disk and then it will be the
active node. With manual failover all that happens is that the active
node releases (SCSI_RELEASE) the disk resource (or maybe it simply stops
re-acquiring it and lets the SCSI_RESET do the release for it) and this
allows the non-active node to successfully reserve the disk and become
the active node. I may have some of the details wrong here, but in
general this is how it work, at least on MSCS for W2K Advanced Server
and Data Center. This mechanism is obviously set up to work with a
shared SCSI array containing the quorum volume. If you’re using a
virtual volume for the quorum volume then you need to handle the SCSI
reserve/release/reset opcodes otherwise MSCS will have problems. I have
a hunch there’s a little more to it than that but I don’t remember and
will have to ask a friend. Will get back to you.

Nate Bushman
PowerQuest Corp.

-----Original Message-----
From: Jaiswal, Niraj [mailto:xxxxx@netapp.com]
Sent: Monday, January 06, 2003 9:30 PM
To: NT Developers Interest List
Subject: [ntdev] RE: MS Cluster service

Isn’t Error 21 => “The device is not ready”. If the cluster disk driver
(clusdisk.sys) encounters this error, it will definitely move to the
second node. You should look into “\WINNT\cluster\cluster.log” for more
details. You should also re-visit how you handle SCSI_RESERVE &
SCSI_RELEASE operations.

Niraj

-----Original Message-----
From: Dany Polovets [mailto:xxxxx@store-age.com]
Sent: Monday, January 06, 2003 3:38 AM
To: NT Developers Interest List
Subject: [ntdev] MS Cluster service

Hi!

I am writing virtual disks driver. The virtual disks can be used by the
MS Cluster. Sometimes, under the stress tests, the following message is
printed in the Debugger and the group is moving to the second node.
[1168] Microsoft Clustering Service suffered an unexpected fatal error
[1168] at line 1710 of source module
D:\nt\private\cluster\service\lm\lmutils.c. The error code was 21.

Where can I find any information about this error?

Redards,
Dany

You are currently subscribed to ntdev as: xxxxx@netapp.com To
unsubscribe send a blank email to xxxxx@lists.osr.com

You are currently subscribed to ntdev as: xxxxx@powerquest.com
To unsubscribe send a blank email to xxxxx@lists.osr.com

OSR_Community_User-277 · January 7, 2003, 11:49am

Nate,

Thank you for an answer.
I know exactly how the Cluster is working, so please dont waste your time.
I just had a specific problem. It seems to me that the cluster service failed because the cluster event log reset failed.

Dany

-----Original Message-----
From: Nate Bushman [mailto:xxxxx@powerquest.com]
Sent: Tuesday, January 07, 2003 6:30 PM
To: NT Developers Interest List
Subject: [ntdev] RE: MS Cluster service

Don’t forget SCSI_RESET too… It’s part of MSCS’ mechanism for
determining if a failover should occur. MSCS resets the disk device
containing the quorum volume (it may do this on other quorum disks too,
I forget) in a continuous loop (I forget the interval) and the active
node will just re-reserve the quorum volume’s disk using SCSI_RESERVE.
The non-active (sorry, been a while since I dealt with this so I forget
the proper terminology) is also trying to reserve the disk using
SCSI_RESERVE only its continuous loop has a longer interval. This means
that as long as the active node is working, it will re-reserve the reset
bus before the other node can get it. Once the active node dies the
other node will successfully reserve the disk and then it will be the
active node. With manual failover all that happens is that the active
node releases (SCSI_RELEASE) the disk resource (or maybe it simply stops
re-acquiring it and lets the SCSI_RESET do the release for it) and this
allows the non-active node to successfully reserve the disk and become
the active node. I may have some of the details wrong here, but in
general this is how it work, at least on MSCS for W2K Advanced Server
and Data Center. This mechanism is obviously set up to work with a
shared SCSI array containing the quorum volume. If you’re using a
virtual volume for the quorum volume then you need to handle the SCSI
reserve/release/reset opcodes otherwise MSCS will have problems. I have
a hunch there’s a little more to it than that but I don’t remember and
will have to ask a friend. Will get back to you.

Nate Bushman
PowerQuest Corp.

-----Original Message-----
From: Jaiswal, Niraj [mailto:xxxxx@netapp.com]
Sent: Monday, January 06, 2003 9:30 PM
To: NT Developers Interest List
Subject: [ntdev] RE: MS Cluster service

Isn’t Error 21 => “The device is not ready”. If the cluster disk driver
(clusdisk.sys) encounters this error, it will definitely move to the
second node. You should look into “\WINNT\cluster\cluster.log” for more
details. You should also re-visit how you handle SCSI_RESERVE &
SCSI_RELEASE operations.

Niraj

-----Original Message-----
From: Dany Polovets [mailto:xxxxx@store-age.com]
Sent: Monday, January 06, 2003 3:38 AM
To: NT Developers Interest List
Subject: [ntdev] MS Cluster service

Hi!

I am writing virtual disks driver. The virtual disks can be used by the
MS Cluster. Sometimes, under the stress tests, the following message is
printed in the Debugger and the group is moving to the second node.
[1168] Microsoft Clustering Service suffered an unexpected fatal error
[1168] at line 1710 of source module
D:\nt\private\cluster\service\lm\lmutils.c. The error code was 21.

Where can I find any information about this error?

Redards,
Dany

You are currently subscribed to ntdev as: xxxxx@netapp.com To
unsubscribe send a blank email to xxxxx@lists.osr.com

You are currently subscribed to ntdev as: xxxxx@powerquest.com
To unsubscribe send a blank email to xxxxx@lists.osr.com

You are currently subscribed to ntdev as: xxxxx@store-age.com
To unsubscribe send a blank email to xxxxx@lists.osr.com

*** IMPORTANT: Do not open attachments from unrecognized senders ***

******************************************************************************************
The contents of this email and any attachments are confidential.
It is intended for the named recipient(s) only.
If you have received this email in error please notify the system manager or the
sender immediately and do not disclose the contents to any one or make copies.

******************************************************************************************

************************************************************************************
This footnote confirms that this email message has been scanned by
PineApp Mail-SeCure for the presence of malicious code, vandals & computer viruses.
************************************************************************************