The DOWN Shutdown Agent is a mechanism to help solve the areas where no take over took place with the standard IPMI Shutdown Agent. The feature is based on the Survival Authority which is mandatory for geographically separated clusters. The Survival Authority is optionally available for co-located clusters, which would allow a node to take over in the case of inter-node communications failure. It is designed to protect against some double failure scenarios (but naturally not a failure of both cluster nodes). A failure of the Survival Authority itself leads to the same scenarios without takeover as the IPMI Shutdown Agent alone.
A drawback of adding another shutdown agent is that the total waiting time for the secondary node (node 1) to activate the shutdown facility is the sum of all shutdown agents that could be running on the primary node. As a result, the takeover time that involved node 1 taking over from node 2 will increase.
The Survival Authority functionality is implemented on a third (off-board) SLES 10 SP2 machine for redundant co-located and geo-separated integrated duplex configurations (redundant Telephony Control servers with integrated Lotus Sametime Applications software). In the case of redundant co-located and geo-separated standard duplex configurations (redundant Telephony Control Servers without integrated Lotus Sametime Applications software) the Survival Authority is implemented as a software module integrated with the Lotus Sametime Unified Telephony Assistant software of the off-board Lotus Sametime Applications server.
The DOWN shutdown agent, when activated, will send an SNMP trap through theadmin network to a device called the Survival Authority. The SNMP trap contains the information that the partner has left the cluster and that it is awaiting a decision by the Survival Authority on what to do next. The Survival Authority keeps a flag for the "Survival Mode" of each cluster for which it is responsible (more than one cluster is allowed). This flag is initialized to "Off". On receipt of the SNMP trap, the Survival Authority sends back an SNMP set command with one of two actions: Take Over or Die, hence the name: DOWN Shutdown Agent. The SNMP set command is sent to the IP address of the admin network (same interface the trap was sent from).
If the flag was set to "Off", the Survival Authority sends Take Over command sets the flag to "On," and records the name of the node that was allowed to survive.
If the flag was set to "On" and the name of the node requesting advice is not the name that was recorded, then the Survival Authority sends a Die command.
If the flag was set to “On” and the name of the node requesting advice is the name that was recorded, then the Survival Authority sends the Take Over command again.
The Lotus Sametime Unified Telephony (OSC) server is running an SNMP subagent that receives the set request and writes the result (Take Over or Die) in a file that is being read by the DOWN shutdown agent. Upon reading the result, the shutdown agent will cause the node either to take over or to shutdown.
When inter-node communications are restored, the Survival Mode flag is reset by an SNMP trap sent from the OSC server to the Survival Authority.
A new Stand Alone Service (SAS) feature is offered in the legacy system software release. Instead of shutting down one node, the node that takes over goes into a "Stand Alone Primary" state while the node which is supposed to shutdown and reboot goes into a "Stand Alone Secondary" state. This allows phones local to each node to continue making calls to each other or even calls to the PSTN via the local gateway available on that network.
When the high priority node (node 2) is powered off, the wait time for node 1 to activate the shutdown agents (after the cluster interconnect failure is detected) is the sum of all shutdown agents that would be run on the high priority node (node 2).
Parent topic: Cluster Redundancy
FSC PrimeCluster Protection Mechanisms