* Ability to use external reasons for deciding which partition is the the quorate partition in a partitioned cluster. For example, a user may have a service running on one node, and that node must always be the master in the event of a network partition. Or, a node might lose all network connectivity except the cluster communication path - in which case, a user may wish that node to be evicted from the cluster.
* Integration with CMAN. We must not require CMAN to run with us (or without us). Linux-Cluster does not require a quorum disk normally - introducing new requirements on the base of how Linux-Cluster operates is not allowed.
* Data integrity. In order to recover from a majority failure, fencing is required. The fencing subsystem is already provided by Linux-Cluster.
* Non-reliance on hardware or protocol specific methods (i.e. SCSI reservations). This ensures the quorum disk algorithm can be used on the widest range of hardware configurations possible.
* Little or no memory allocation after initialization. In critical paths during failover, we do not want to have to worry about being killed during a memory pressure situation because we request a page fault, and the Linux OOM killer responds...
* Cluster node IDs must be statically configured in cluster.conf and must be numbered from 1..16 (there can be gaps, of course).
* Cluster node votes should be more or less equal.
* CMAN must be running before the qdisk program can operate in full capacity. If CMAN is not running, qdisk will wait for it.
* CMAN's eviction timeout should be at least 2x the quorum daemon's to give the quorum daemon adequate time to converge on a master during a failure + load spike situation.
* For 'all-but-one' failure operation, the total number of votes assigned to the quorum device should be equal to or greater than the total number of node-votes in the cluster. While it is possible to assign only one (or a few) votes to the quorum device, the effects of doing so have not been explored.
* For 'tiebreaker' operation in a two-node cluster, unset CMAN's two_node flag (or set it to 0), set CMAN's expected votes to '3', set each node's vote to '1', and set qdisk's vote count to '1' as well. This will allow the cluster to operate if either both nodes are online, or a single node & the heuristics.
* Currently, the quorum disk daemon is difficult to use with CLVM if the quorum disk resides on a CLVM logical volume. CLVM requires a quorate cluster to correctly operate, which introduces a chicken-and-egg problem for starting the cluster: CLVM needs quorum, but the quorum daemon needs CLVM (if and only if the quorum device lies on CLVM-managed storage). One way to work around this is to *not* set the cluster's expected votes to include the quorum daemon's votes. Bring all nodes online, and start the quorum daemon *after* the whole cluster is running. This will allow the expected votes to increase naturally.
The status block contains additional information, such as a bitmask of the nodes that node believes are online. Some of this information is used by the master - while some is just for performace recording, and may be used at a later time. The most important pieces of information a node writes to its status block are:
- Internal state (available / not available)
- Known max score (may be used in the future to detect invalid configurations)
- Vote/bid messages
- Other nodes it thinks are online
The heuristics themselves can be any command executable by 'sh -c'. For example, in early testing the following was used:
<heuristic program="[ -f /quorum ]" score="10" interval="2"/>
This is a literal sh-ism which tests for the existence of a file called "/quorum". Without that file, the node would claim it was unavailable. This is an awful example, and should never, ever be used in production, but is provided as an example as to what one could do...
Typically, the heuristics should be snippets of shell code or commands which help determine a node's usefulness to the cluster or clients. Ideally, you want to add traces for all of your network paths (e.g. check links, or ping routers), and methods to detect availability of shared storage.
If another node comes online with a lower node ID while a node is still bidding for master status, it will rescind its bid and vote for the lower node ID. If a master dies or a bidding node dies, the voting algorithm is started over. The voting algorithm typically takes two passes to complete.
Master deaths take marginally longer to recover from than non-master deaths, because a new master must be elected before the old master can be evicted & fenced.
(a) CMAN believes the node to be online, and
(b) that node has made enough consecutive, timely writes to the quorum disk, and (c) the node has a high enough score to consider itself online.
<quorumd interval="1" This is the frequency of read/write cycles, in seconds.
tko="10" This is the number of cycles a node must miss in order to be declared dead.
tko_up="X" This is the number of cycles a node must be seen in order to be declared online. Default is floor(tko/3).
upgrade_wait="2" This is the number of cycles a node must wait before initiating a bid for master status after heuristic scoring becomes sufficient. The default is 2. This can not be set to 0, and should not exceed tko.
master_wait="X" This is the number of cycles a node must wait for votes before declaring itself master after making a bid. Default is floor(tko/2). This can not be less than 2, must be greater than tko_up, and should not exceed tko.
votes="3" This is the number of votes the quorum daemon advertises to CMAN when it has a high enough score.
log_level="4" This controls the verbosity of the quorum daemon in the system logs. 0 = emergencies; 7 = debug.
log_facility="daemon" This controls the syslog facility used by the quorum daemon when logging. For a complete list of available facilities, see syslog.conf(5). The default value for this is 'daemon'.
status_file="/foo" Write internal states out to this file periodically ("-" = use stdout). This is primarily used for debugging. The default value for this attribute is undefined.
min_score="3" Absolute minimum score to be consider one's self "alive". If omitted, or set to 0, the default function "floor((n+1)/2)" is used, where n is the total of all of defined heuristics' score attribute. This must never exceed the sum of the heuristic scores, or else the quorum disk will never be available.
reboot="1" If set to 0 (off), qdiskd will *not* reboot after a negative transition as a result in a change in score (see section 2.2). The default for this value is 1 (on).
allow_kill="1" If set to 0 (off), qdiskd will *not* instruct to kill nodes it thinks are dead (as a result of not writing to the quorum disk). The default for this value is 1 (on).
paranoid="0" If set to 1 (on), qdiskd will watch internal timers and reboot the node if it takes more than (interval * tko) seconds to complete a quorum disk pass. The default for this value is 0 (off).
scheduler="rr" Valid values are 'rr', 'fifo', and 'other'. Selects the scheduling queue in the Linux kernel for operation of the main & score threads (does not affect the heuristics; they are always run in the 'other' queue). Default is 'rr'. See sched_setscheduler(2) for more details.
priority="1" Valid values for 'rr' and 'fifo' are 1..100 inclusive. Valid values for 'other' are -20..20 inclusive. Sets the priority of the main & score threads. The default value is 1 (in the RR and FIFO queues, higher numbers denote higher priority; in OTHER, lower values denote higher priority).
stop_cman="0" Ordinarily, cluster membership is left up to CMAN, not qdisk. If this parameter is set to 1 (on), qdiskd will tell CMAN to leave the cluster if it is unable to initialize the quorum disk during startup. This can be used to prevent cluster participation by a node which has been disconnected from the SAN. The default for this value is 0 (off).
use_uptime="1" If this parameter is set to 1 (on), qdiskd will use values from /proc/uptime for internal timings. This is a bit less precise than gettimeofday(2), but the benefit is that changing the system clock will not affect qdiskd's behavior - even if paranoid is enabled. If set to 0, qdiskd will use gettimeofday(2), which is more precise. The default for this value is 1 (on / use uptime).
device="/dev/sda1" This is the device the quorum daemon will use. This device must be the same on all nodes.
label="mylabel"/> This overrides the device field if present. If specified, the quorum daemon will read /proc/partitions and check for qdisk signatures on every block device found, comparing the label against the specified label. This is useful in configurations where the block device name differs on a per-node basis. ...>
<heuristic program="/test.sh" This is the program used to determine if this heuristic is alive. This can be anything which may be executed by /bin/sh -c. A return value of zero indicates success; anything else indicates failure. This is required.
score="1" This is the weight of this heuristic. Be careful when determining scores for heuristics. The default score for each heuristic is 1.
interval="2"/> This is the frequency (in seconds) at which we poll the heuristic. The default interval for every heuristic is 2 seconds.
tko="1"/> After this many failed attempts to run the heuristic, it is considered DOWN, and its score is removed. The default tko for each heuristic is 1, which may be inadequate for things such as 'ping'. />
* Heuristic scripts returning anything except 0 as their return code are considered failed.
* The worst-case for improperly configured quorum heuristics is a race to fence where two partitions simultaneously try to kill each other.