This is something that only happens when a machine has been forcefully failed
over against its will. I guess you would need to see the code to tell what
I'm talking about, but in the description I gave of the code, if it doesn't
get a reservation, it exits. The way the code is intended to be used is
something like this:
Given machine A as cluster master and machine B as a cluster slave. Machine A
starts the reservation program with something like this as the command line:
reserve --reserve --hold /dev/sdc
This will result in the program grabbing a reservation on drive sdc (or
exiting with a non-0 status on failure) and then sitting in a loop where it
re-issues the reservation every 2 seconds.
Under normal operation, the reserve program is not started at all on machine
B. However, machine B does use the normal heartbeat method (be that the
heartbeat package or something similar, but not reservations) to check that
machine A is still alive. Given a failure in the communications between
machine B and machine A, which would typically mean it is time to fail over
the cluster, machine B can test the status of machine A by throwing a reset to
the drive to break any existing reservations, waiting 4 seconds, then trying
to run it's own reservation. This can be accomplished with the command:
reserve --reset --reserve --hold /dev/sdc
If the program fails to get the reservation then that means machine A was able
to resend it's reservation. Obviously then, machine A isn't dead. Machine B
can then decide that the heartbeat link is dead but machine A is still fine
and not try any further failover actions, or it could decide that machine A
has a working reserve program but key services or network connectivity may be
dead, in which case a forced failover would be in order. To accomplish that,
machine B can issue this command:
reserve --preempt --hold /dev/sdc
This will break machine A's reservation and take the drive over from machine
A. It's at this point, and this point only, that machine A will see a
reservation conflict. It has been forcefully failed over, so
resetting/rebooting the machine is a perfectly fine alternative (and the
reason it is recommended is because at this point in time, machine B may
already be engaged in recovering the filesystem on the shared drive, and
machine A may still have buffers it is trying to flush to the same drive, so
in order to make sure machine A doesn't let some dirty buffer get through a
break in machine B's reservation caused by something as inane as another
machine on the bus starting up and throwing an initial reset, we should reset
machine A *as soon as we know it has been forcefully failed over and is no
longer allowed to write to the drive*). Arguments with this can be directed
to Stephen Tweedie, who is largely responsible for beating me into doing it
this way ;-)
> Another problem is that reservations do *not* guarantee ownership over
> the long haul. There are too many mechanisms that break reservations to
> build a complete strategy on them.
See above about the reason for needing to reset the machine ;-) The overall
package is cooperative in nature, so we don't rely on reservations except for
the actual failover. However, due to this very issue, we need to kill the
machine that was failed over as soon as possible after the failover to avoid
any possible races with open windows in the new drive owner's reservation.
[ snipped comments about fine tooth analysis of other clustering software ]
The hardware shortcomings are known to us. The actual preferred method of
doing this would be to have each machine in the cluster have access to a
serial operated power switch so that when machine B failed over the cluster,
it could actually power down machine A to avoid those race conditions, only
touching the drive itself after machine A was already powered down. Race
windows gone.
--Doug Ledford <dledford@redhat.com> http://people.redhat.com/dledford Please check my web site for aic7xxx updates/answers before e-mailing me about problems - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/