Table of Contents
This article addresses improving Domino uptime. Domino uptime can be improved in many ways. One way is by making all components redundant and the second way is making the infrastructure…more simple. Yes, simple. If you have a simple and clear infrastructure, it runs better. In this article, we discuss Domino components only. Redundant power, cooling, network, HDD and other components that lay below Domino need to be redundant as well.
Keep It Simple
The rule of thumb is not to mix configurations. If you have a mail cluster, let it provide mail to users. Do not put other things on this server, including all applications, web server, and products, such as Lotus Quickr, Traveler, and Sametime. For these types of products, you should use a separate server. On top of Domino, you can have only one such add-on product. So even if you have an additional server for such products, do not put them all on top of each other.
IBM Domino licence states that if you need to install Traveler, you do not need to buy additional licences (for additional Domino server). An advantage of putting Traveler on a separate server is that it improves security, because Traveler will be placed in the DMZ zone. Even if somebody does hack it, it will be an empty server. Nothing else will be compromised because all databases and applications will be located on other servers that are not available from the internet. If you use Sametime Entry to enable users chatting with each other, or you use Sametime for audio and video conferencing, then again, you should place it on a separate server. This will make your servers and services less dependent from each other.
Tips: Not every such product will work with any patch level of Domino. For example, a fix for your mail environment, might be not compatible for the Microsoft Quickr server running on the same machine. Before you upgrade to production environment, make a clone and test to make sure it works fine first.
Redundant Domino Parts
There are different approaches on how to improve availability of the service to users and systems that access Domino. Depending on how users and applications access Domino, there are different ways on how to achieve this goal.
Is it possible to measure availability of a server? Yes. There are two factors in this calculation: PLANNED downtime and UNPLANNED downtime. Planned downtime is unavailability of a server which was forecasted and planned beforehand and it does not impact users. For example, you plan to put a new version of Domino or patch operating system during off hours. No one will notice that you brought the server down. After you complete your task you bring the server back online. On the other hand, if there is a power failure during the normal working hours or the operating system or Domino crashes, this is considered UNPLANNED downtime. During this time, users cannot send or receive mails or access databases. They will call the help desk and register support calls. This unplanned downtime impacts the user's work activity.
The following table lists different approaches you can use to reduce unplanned Domino downtime depending on the size of your Domino environment (small, medium, or large)
Seamless failover for Lotus Notes users.
ICM-Internet cluster manager
The following flowchart provides guidance on the approach best suites to your environment to reduce system downtime.
#1. Requires Enterprise or Utility license for each server.
#2. Requires Database adaptation for failover. All data (as determined by the Domino Administrator) is duplicated on the servers in the cluster. If data is corrupted on one server and deleted during a consistency check (Fixup), it will be automatically restored (replicated) from other. If data is deleted on one server, it is deleted from the other servers as well.
#3. You can use Domino Express licences if your organization is less than 1000 users.
#4. Servers uses one disk to store data. So if one server goes down, the other server is up using the same data. If data is corrupted, both servers will operate on corrupted data. Servers use the same data.
We explain how you can reduce the downtime of your Domino server using various approaches for the remaining sections of this article.
Choose this approach if: Organization size is Medium, Large or Domino downtime in unacceptable.
A Domino cluster is a group of servers of two or more servers that provides failover for Domino applications. Failover is a process in which the Lotus Notes client is redirected from one server to another when the primary server is not responding or is overloaded. The advantage of failover is that users will continue to have access to critical resources. In the most current versions of Domino, user will not receive an error messages and the failover happens transparently for the user.
In case of a Domino cluster, database replicas of clustered databases are located on different servers. Clustering provides not only failover, but also load balancing. Overloaded servers can pass users to other servers that are not so busy.
In general, there are two types of operating system clusters, Active-Active, and Active-Passive. For Active-Active cluster, all servers are available and serve users' requests at the same time. For Active-Passive cluster, when one server (the Active server) serves users' requests, the other server (the Passive server) waits and does not serve users. When the Active server goes down, the Passive server notices that and it will become the Active server and will start providing service to users.
Domino clustering requires a licence for every server included in the cluster. It needs to be purchased from IBM or an IBM Business Partner. Thus, there are additional licence costs for this solution. Some databases automatically support clustering and failover. Other databases, developed in house, do not support clustering by default and need to be programmed to support clustering.
Some databases provided by IBM already have clustering support, for example Mail Database. Designer Help lists LotusScript functions that help to make databases work in a cluster. For instance, Database.Open is a regular database open function and Database.OpenWithFailover is a cluster version of it. Database.OpenWithFailover will try to open database. If the database is not available, it will automatically failover to another server. Enabling a database to support cluster mode is not about substituting one function with another. There are different challenges that need to be solved by developers. For instance, two documents are modified at the same time on different servers. You need to think about document locking to prevent Save Conflicts. Agents are another issue to consider in a clustered environment. Domino supports "After new mail has arrived" agents failover, but it does not support scheduled agent failover.
The following technique can be used to solve scheduled agent to work in a clustered environment:
There is one master agent that triggers all other agents (slave agents). This master agent is scheduled on “Any server”. It is executed every X minutes on all nodes of cluster. To avoid this agent to actually running on all servers at once, we need somehow try to run the agent first on one server. If it is down, then we run the agent on other nodes. You can create a profile document in a database where this Master Agent lives. You can put two fields in this profile: PrimaryServer and TimeStamp. PrimaryServer is the name of server on which agent should run and trigger all other agents. When the master agent runs successfully on one server, it updates profile document with a timestamp and this profile document is replicated to other servers via cluster replication. Master agents on other servers checks if there is an up-to-date timestamp. If yes, then they quit execution as they know that the Master agent on the primary server is already at work. If there is a big gap, between NOW() and the last timestamp, then the Master Agent on the other server understands that the Primary server is dow, and it backs up the Master agent. It then triggers all of the other agents.
Nodes of a Domino cluster can be located in different buildings, cities or countries. Domino provides the Geo-Clustering option. In case of fire, or if it is impossible to work in the building, all information is available from the alternative location.
Domino Clustering is the best way to provide high-availability to Lotus Notes users. Failover occurs and is seamless in the more recent versions of Domino, otherwise a prompt to redirect to a different server is displayed. You may build a Domino cluster on top of different operation systems. If you cannot provide cluster awareness of your applications, you can choose other solutions such as an operating system cluster.
Additional Resources on this.
For more information, see
One type of clusters is the operating system (OS) cluster. There are two or more servers and Domino process is running on one server. When there is a need to switch server or there is a hardware failure, the other server starts the Domino process on the other server. In both cases, server will run under one and the same SERVER.ID and same IP address. In case of Domino cluster, they have different names. If ServerA is running, and we need to do some maintenance, we give command to the other server to take control. Then the same ServerA is started on the other node. Users may experience a small delay when the first server goes down and the second server starts up.
Almost every operating system provides an option to build an OS cluster. An additional licence may be required for this. For example, on Linux -Heartbeat daemon provides a solution to build cluster on Linux OS basis. In configuration of this daemon, you define primary node and processes that need to be monitored. If one server goes down, the second node takes control. It will map shared disk, where Domino data is located and will start Domino server on another machine. The OS clustered Domino server appears to end users under the same name and same IP address. If there are systems that access Domino by host name, such as POP3/IMAP/LDAP/HTTP, they will successfully reach the server.
If you have less than 1000 users in your company, you may use the Domino Express licence for the OS cluster. Since a limitation of the Express licence is that you have less than 1000 users and that Domino clustering is not used. From that point of view, you can have Domino high availability at relatively low price.
If you use an OS cluster, then you will NOT use the Lotus Notes client failover feature, from Release 8.x and 8.5.x. Domino supports failover for opened databases. in case of OS cluster, users may need to re-open databases.
You MAY use an OS cluster in conjunction with Domino cluster. This is a supported configuration.
Internet Cluster Manager (ICM)
In the earlier sections, we discussed high availability of the entire Domino server. Starting from Release 5.x, there is an option for high availability – Internet Cluster Manager, also referred as ICM. ICM provides a failover for WEB clients who access your iNotes server or intranet and company homepage. ICM is an additional task that is loaded on a Domino server. It is quite easy to setup ICM if you already have a working Domino cluster. ICM is an addition to Domino cluster, and it works only with Domino cluster.
In the Domino configuration you define which Domino cluster ICM should look for, if you have many clusters in your Domino environment. ICM plays a re-director role, like a dispatcher who (re)directs landing planes. When a new HTTP request comes in, ICM knows which servers are available and redirects users to one of them. When one server goes down, ICM notices that and the subsequent new requests are directed to another available server. After some time laps, ICM sends Domino a ping command to check which servers in the clusters are available. When new requests come in, ICM knows which servers are up and which ones are down and sends the new requests to the running servers.
A working scenario of ICM
We have two mail servers, mail1.company.com, and mail2.company.com. ICM listens for requests on webmail.company.com. When users type the webmail.company.com address, this request goes to ICM. ICM already knows which servers are up and it will send the user redirection requests. The URL is changed to mail1.company.com or mail2.company.com in user’s browser and the user is asked to authenticate.
It is advised to run ICM on a separate server than the Domino servers. Otherwise, if ICM runs on an overloaded server, there is a probability that this server may become unavailable and ICM will be not reachable for clients. If that occurs, ICM will not redirect requests. If the ICM runs on a separate machine it just sends back redirect information to clients. The ICM system should have low unplanned downtime.
What will happen if ICM is down
The cluster will be up, but users will not be redirected. If you want to improve availability of ICM you can have several ICM servers and assign multiple IP addresses to one host. Then, the failover of ICMs will happen at the DNS/OS level. With this approach you can have a high level of ICM availability.
Deploy Single-Sign-On for web servers will give your users additional benefits. If mail servers share one common LTPA Token, then if users change the URL from mail1.company.com to mail2.company.com, there will be no additional login screens. If one mail server goes down while the users read mail, users have to go back to webmail.company.com host, and ICM will redirect users to another available server. With Single-Sign-On set up, users will get to their mails without additional login forms.
iNotes High Availability Configuration
For information on iNotes high availability configuration, see
IMAP failover (Domino 8.5 new feature)
Release 8.5.2 of Domino added new functionality to IMAP users. Now IMAP users can failover to different servers, and the IMAP client will not be confused. If you have many IMAP users, you may benefit from this feature. There are some additional steps needed to configure IMAP support on Domino Clustered servers so refer to Technote #1429885 or the 8.5.2 Administrator help. In conjunction with multiple IP addresses assigned to one host, or software proxy, IMAP users will be redirected between available servers.
For additional information, see
Failover and the IMAP server
Lotus Traveler Server High Availability
At the moment Lotus Traveler does not provide an option for native clustering. An enhancement for Lotus Traveler support is registered. You can still build many Traveler servers. If a user manually changes the IP address of the server, the Traveler client will do a Full Sync of Data, which is quite time/traffic consuming task. This is because of the design of Traveler server.
One of the best solutions to make Traveler available in cluster is to put it on OS cluster or use Proxy server which works in Active-Passive mode.
Traveler Clustering and failover
One more option for improving availability of the servers is load balancer. Load balancer is a hardware device or software that can check if target servers are available. When new request comes in, it will redirect request to one of the available servers. Hardware balanced can be used to cluster POP3/IMAP/SMTP users between servers.
In addition to POP3/IMAP/LDAP/SMTP protocol failover, you can use load balancer to switch Lotus users between Domino servers. If you want to use IP-Sprayer (load balancer) with Lotus Notes, you should have additional parameter described in Technote 1233210. You can deploy this parameter with the help of policy/desktop settings.
Notes client fail to connect to Domino servers behind a network sprayer
Software proxy (IBM HTTP, nGinx, etc)
There are software programs that work like proxy servers, and they can do automatic failover of servers from which they request data. Some solutions like IBM HTTP can provide failover (reverse proxy) of HTTP/HTTPS protocols. Some others can serve SMTP, POP3, IMAP, HTTP for example NGINX. There are different vendors and every proxy functionality is different from the protocol prospective. Depending on your need, if only HTTP failover is needed or additional protocols like SMTP, POP3, you can select which solution to be used.
Sametime and QuickR High Availability
Sametime and QuickR server may be also clustered. If these resources are critical for you, deploy clustering which will provide high availability for Lotus Sametime or Lotus QuickR users. Follow the links below to be guided how to deploy cluster for Lotus Sametime and Lotus QuickR.
QuickPlace clustering guide: configuration and managing places
Disaster Recovery Plan
This section deals with Domino recovery if you have OS / hardware failure. Recovery plan is a document that defines a sequence of actions and responsibilities during server restore. Test recovery is a procedure that needs to be done after you deploy a new backup solution. In addition to this, a test recovery needs to be repeated every year to be sure changes done in the environment are reflected in the backup policy. Test recovery shows that everything is fine with backup procedure. Test recovery should be ordered without the backup team, so they will not prepare additional (full) backup copies. The purpose of this to understand what problems you have in the test case, and eliminate problems in future.
When you put a new server in production, be sure that this server is included in daily backup procedure. Nobody knows when recovery will be needed or what data will be needed. It can be one text or configuration file, stand-alone mail file, database that is part of application or the entire server. It is vital to have a recovery plan for your Domino environment. You should write it like you are going on vacation, you know that there will be problems, and you do not want to have calls to your cell phone. Your colleagues should be able to do the recovery according to your documentation. Describe what need to be restored, how to restore them and in what sequence, installation locations, IP addresses and phone numbers. This document should be kept up to date. It should be printed and stored in an available place. Do not keep this only in electronic format. If the system is down you will not be able to access it.
It is advised to do the test recovery of the entire server. You can do this on a separate machine, a test server. Be sure to restore it on an IP isolated machine so when you bring the server up it does not replicate other production servers. If you do the full recovery once, you will be able to do this again smoother and faster in a real life. Do spend time describing the steps you performed in the document. In a real life, you will do this at least several times faster than the first time. Test the recovery. Find and highlight the things you may have documented wrong in your current backup plan. For example, you backup .nsf files by a Domino specific backup solution and backup everything else with an OS backup solution, except the Domino DATA folder. In that case some important files, such as cert.id, server.id, notes.ini may be excluded from backup. Test recovery is ensuring that everything is fine with your backup solution and approach.
In your recovery plan describe sequence of the restore. How should the entire server be restored, one mail file, or one document (alternative location, then copy paste).