Reliability is the primary goal of Lotus Sametime Unified Telephony, and clustering is necessary to provide this reliability. A reliable component structure provides an effective base for cluster administration.
The Lotus Sametime Unified Telephony hardware and software components work together to attain the following reliability goals:
To provide faster data replication and better performance for peak traffic in normal operation by using a two-node active-active configuration, with each node acting as hot/standby for its partner. This configuration also protects against silent faults through continuous hardware/software monitoring and testing.
To minimize node switchover, which reduces transient call loss and network connectivity outages. This is accomplished with redundant local disks, network connections for each node, and power supplies. Each node also contains duplicated Ethernet cards which ensure that the physical path for the external communication with one node is backed up by a second path—a second Ethernet port on a different Ethernet card, and a second LAN switch.
To provide static load sharing for fast and reliable busy/idle handling, because only one node writes the busy/idle and call status for the subscriber or feature server.
To provide effective component management through process configuration control using process and alias groups.
The Lotus Sametime Unified Telephony redundant configuration can be deployed as geographically co-located or separated node configuration. The separated configuration can be distinguished further, into whether nodes are in the same subnet, diffferent subnets or connected only via layer-3 (IP).
Cluster Redundancy with Geographic Node Separation
Geographic node separation reduces the risk of total loss of voice services when one of the nodes is out of service due to a fire, flood, hurricane, building damage, and so on. Lotus Sametime Unified Telephony allows geographic separation of the cluster nodes, either in the same subnet or in different subnets.
Cluster Redundancy with Co-Located Nodes
One option of an Lotus Sametime Unified Telephony redundant system is when two computing nodes are geographically co-located.
Geographic Separation - Nodes in Same Subnet
For cluster redundancy with geographic separation with nodes in the same subnet the nodes may physically reside in the same location or in different locations. The latter case requires a VLAN bridging to logically interconnect them into the same subnet.
Geographic Separation with Layer-3 Cluster Interconnect
For a cluster redundancy with layer-3 cluster interconnect, only a layer-3 IP connection is required between the two nodes.
A Survival Authority (SA) decides on the shutdown or continuation of processing for a node in the failover case wherein the two nodes cannot communicate over the cluster cross connects and cannot shutdown the partner node via the maintenance controller. Its main function is to determine which node should continue running, and which node should be shut down during a massive power outage or building failure. In the case of a geographically separated cluster, the Survival Authority is mandatory. For a co-located cluster a SA is optional if the cross connect between both nodes is a physical LAN cable (e.g. both nodes in the same rack) and mandatory if the cross connect link between the nodes is setup over switches. In this case, the SA must be on a separate server (separate subnet).
The standalone service option is available beginning with V4. If it is enabled, the node that does not get the permission to take over from the Survival Authority stays active. Standalone service is available to duplex server configurations and is intended for geographically separated installations. These geographically separated installations can be layer-3 network separated or geo-separated with layer-2 connectivity (both nodes of the layer-2 configuration are on the the same IP networks but at different locations) . In case of a virtual Telephony Control Server cluster standalone service is only allowed for a layer-3 network separated installation.
Standalone Service Subscriber Feature Impacts
Feature activation and deactivation, as well as subscriber-controlled input is blocked in standalone secondary mode. This is necessary because the database of the standalone secondary node is overwritten with the database of the primary node when the cluster is re-established.
Standalone Service Other Network Elements Impacts
During standalone mode, other network elements are unavailable to the standalone secondary node.
Standalone Service TLS Connection Impacts
Before Lotus Sametime Unified Telephony starts a new SIP call, it reads information about the existing TLS connection between Lotus Sametime Unified Telephony and applicable SIP phones from the database; it then uses this information for the duration of the call.
FSC PrimeCluster Protection Mechanisms
The Fujitsu-Lotus Sametime Unified Telephony PrimeCluster software allows the Lotus Sametime Unified Telephony application software on each node to know the state of the other node. This information is important because in clusters, resources are usually controlled by one of the cluster nodes. All other nodes keep a backup of the resources in case they need to take over control if the controlling node fails. The cluster software has as a task to prevent the so-called “split brain” situation where two nodes of a cluster think that they are controlling the same resource.
IPMI Shutdown Agent
The IPMI Shutdown Agent is a mandatory mechanism to avoid split brain situations in case of a loss of communications between nodes in a redundant server configuration. It is designed to protect against single point of failure scenarios.
DOWN Shutdown Agent
The DOWN Shutdown Agent is a mechanism to help solve the areas where no take over took place with the standard IPMI Shutdown Agent. The feature is based on the Survival Authority which is mandatory for geographically separated clusters. The Survival Authority is optionally available for co-located clusters, which would allow a node to take over in the case of inter-node communications failure. It is designed to protect against some double failure scenarios (but naturally not a failure of both cluster nodes). A failure of the Survival Authority itself leads to the same scenarios without takeover as the IPMI Shutdown Agent alone.
This section describes the Lotus Sametime Unified Telephony; hardware components and how they contribute to cluster redundancy functionality. The three relevant hardware components are the Computing Node, the Ethernet Switch and the Remote Access Card.
This section describes the Lotus Sametime Unified Telephony software components and how they contribute to cluster redundancy functionality. Two software components are relevant: the PRIMECLUSTER and the RTP software.
Lotus Sametime Unified Telephony supports redundant active/active applications for cluster softswitches. During normal operation a redundant cluster operates in a loadsharing operation (active/active). Both nodes participate in traffic processing, observe each other and backs up the other nodes configuration data.
The primary focus of the Lotus Sametime Unified Telephony failover strategy is to preserve stable calls and billing data, and to ensure that resources are not left in an unresponsive state—such that a given resource cannot be accessed without restarting a device, gateway, or the system itself.
Any single process instance failure does not affect service, with the possible exception of the call (context) being processed at the time of the failure. Each call context of a particular type is accessible by all process instances of that same type. If the last accessible process instance of a type fails on a node, the backup instances on the backup node take over. However, if the last accessible process instance of a type fails on the last active node, service is affected.
An Ethernet failure can be caused by the failure of an Ethernet card, Ethernet port, or Ethernet cable. If a failure occurs, the Lotus Sametime Unified Telephony node's Linux bonding driver switches the IP address to the second Ethernet port on the same Lotus Sametime Unified Telephony node, then sends out a gratuitous ARP to update the routing tables in the LAN switch.
Ethernet Switch Failure
If a failure occurs in the Ethernet switch that is carrying active call data, each system component detects the failure and switches its links to the other Ethernet switch so no data is lost.
A redundant pair of routers operates in active-standby mode. Each of them being ready to take over the other router’s servicing function in case of a failure.
Double Ethernet Switch Failure
A double Ethernet switch failure causes the partner node to take over because of loss of partner cluster interconnect communication after consultation of the Survival Authority. At the same time the router loses connection to the LAN, the WAN reroutes the traffic to the partner location.
If a node fails, stable calls (those in conversation state) are preserved but unstable calls may be dropped.
Location Interconnect Failure
A failure of the interconnection between the two locations is detected by both nodes, and each node takes over the IP addresses of the partner. Signaling traffic is still routed to both nodes as before. Both nodes attempting to take over each other’s function has to be avoided, so one of the nodes will be shut down.
Catastrophic Site Outage
If a catastrophic site outage occurs, failover actions vary depending on the type of geographic node separation.
Double Failures with Collocated Nodes
The cluster is also guarded against some double failures of the same kind without loss of service. Depending on which combination of failure occurs there is no outage, some service reduction, or in a few symmetrical double failures a total outage.
Double Failures with Geographical Node Separation
Geographically separated nodes in the same subnet are protected against the same double failures as collocated ones, but additionally is also guarded against some more double failures of the same kind at one data center without loss of service.
Lotus Sametime Unified Telephony Admin Network Failures
If one node cannot be reached for administration and maintenance—that is, if both admin Ethernet ports of the Linux bonding driver are unavailable—Lotus Sametime Unified Telephony provisioning and maintenance is still possible via the partner node. Only direct hardware maintenance is not possible.
Lotus Sametime Unified Telephony Signaling Network Failures
If one node cannot exchange signaling messages—that is, if both signaling Ethernet ports of the Linux bonding driver are unavailable—it tries to send messages via the partner node as long as the cross-channel is available.
Lotus Sametime Unified Telephony Billing Network Failures
If the communication of a Lotus Sametime Unified Telephony cluster node and its billing network fails - that is, if both billing Ethernet ports of the Linux bonding driver are unavailable - the consequences depend on the file reporting mode.