ShowTable of Contents
The purpose of this document is to provide you with useful information for troubleshooting general Lotus Domino server performance issues.
So, you have identified a performance issue on your Domino server. What should you do now?
One of the main problems with performance issues is that the nature of the issue is elusive. The resolution to an issue in one area of a system may lie in a totally unrelated area. So, in such cases, has the issue really been resolved? Yes and no. Even though the issue has been resolved, the true nature of the issue was not identified. Thus, you may have just alleviated the symptoms temporarily. Due to the complexity of computer systems, performance can improve or worsen, either smoothly or in chunks (a step function versus a smooth curve).
An example of a smooth decrease in performance could be when adding a few users to a server, the overall performance of the server degrades gradually. An example of a step change could be that an application is modified to store and retrieve larger notes. This change may stress the NSF buffer cache past its optimal usage, result in a huge increase in disk IO, and the server performance slows down miserably. If it is a smooth change, then modest changes in operation will produce only modest changes in performance. If it is a step change, modest changes in operation will often produce a drastic change in performance. When possible you should make one change at a time, and then closely monitor the system for changes in performance.
The scope of this document is not to identify how to maximize performance, but rather to focus on issues where server performance is affected by factors that are adversely affecting the system. Note that this is very different from "tweaking" a system to maximize performance. The issues that will be described have been affected by a change to the status quo in a system's operation. You will be presented with a process for identifying "what the problem is," documenting the nature of the issue, diagnosing the problem and taking corrective actions, and then determining if the corrective actions had the necessary effect.
Ask yourself these preliminary questions.
1) How is the problem manifesting itself? What does the problem look like? What are the indications that a problem exists? The key point here is that there is a state of normalcy that is expected. The existence of a performance issue is causing the server to operate outside the bounds of normal operation. Why is it important to state this? Many times a customer is certain that there is a problem but unsure of the normal state of operation. For example, let's say we get a network problem where we are expected to resolve an issue with disk performance. But how do we know what is normal for the system? Is 10MB/sec normal? Is 100MB/sec something we should be working towards?
When dealing with a performance problem we absolutely must restrict our investigation to obtaining a state of normalcy for the system. Why would we do this if we can get more performance through further investigation? Because what has affected your system is a single set of variables that must be addressed. Once we have addressed those variables and made the necessary changes, the balance and normal operation of the server has been restored. Once we go beyond addressing the issues that caused the deviation from the norm, we begin to enter a different arena. Now, changes to the system are being made not to restore the prior balance, but rather to alter the profile of the system to a new and possibly better state. At this point the changes become more experimental rather than a fix. While not wholly a bad thing, the changes could possibly make things worse, and the scope of the issue effectively becomes never-ending.
2) Another major question to ask is, "Where is the problem coming from?" To help answer this question, think of your system as divided into two logically separate areas: resources and resource management. Within those areas we can make certain divisions. For resources, divide the computing capability under CPU, IO and memory. IO is a special category since it can and should be subdivided for logical purposes as Disk IO and Network IO, And for resource management, partition the issues as being based in the application (i.e., Domino), Operating System (OS) or hardware. Visually we might envision a table that looked something like the following:
You would be surprised at how many people fail to think along these lines. Because so many areas of computing can overlap in their manifestation and resolution isn't necessarily where the problem is, most people will rely on intuition and experiences. While this can be effective, it takes time to build a large library of experiences to use as reference. And even then, it becomes an impossibility to teach others how to use that knowledge intuitively. It is not advisable to follow that route as it can lead to misconceptions and incorrect diagnosis for those relatively new to performance troubleshooting. By using a layered approach to resources and their management, we are able to use the logic of each layer to help in identifying where a problem may exist.
3) How reproducible is the problem? This is fairly important because without some measure of reproducibility, our ability to determine what the problem is and make changes that affect the issue goes down dramatically. How can we document or test to see if what we thought the issue was, was really it? How could we possibly distinguish a single instance of a problem as even being something outside of a random event . If it only happened once, we can't.. If we can't collect data on the problem, then there isn't much we can do to make a determination. Because the nature of resolving performance issues is in no way concrete, it is vitally important to keep in mind that the process of resolution is tremendously iterative. While it might be nice to be able to point to the table above and be able to state that is where the problem is, more often than not you find yourself either iterating through each cell in the table or using your experience and expertise to guide you to the most likely solution as a guess.
Documentation is a key element in any type of problem determination, and performance problems are no exception. It is the one step that changes this from being a random process to a scientific one. Of course we can make changes to the system in accordance to what we think problem is. But if you have no evidence that there is a problem in that area, you are essentially just making a guess. Given that you may be dealing with a number of different parties as well as trying to present an argument for action to your management chain, the question you have to ask is how likely will different people take what you have to say based on your feelings alone. This actually happens quite a bit and sometimes it goes quite far. To cover your bases, however, you need to not only be able to identify the general area in which the problem exists, but also the changes that exist which were caused by the problem. This way, even if you can't determine the root cause of the problem yourself, you have the basis for a discussion to which others can add value. Of course in order to show that something has changed, you will want to keep statistics on how things were before the problem occurred. Keeping data on hand is relatively cheap and will ultimately decrease your time to resolution. While it may seem unimportant when there is not a problem, documentation can prove invaluable if an issue occurs.
The table below lists various tools useful for troubleshooting on a Windows system:
|perfmon - % processor time
|perfmon - pages/sec, Available
|perfmon - Disk Queue Length, Avg Disk sec/Read
task manager - VM size
Hardware level tools
The NSD, semaphore debug, and Domino statistics (show stat) are of particular importance in troubleshooting performance issues.
Semaphores are a variable which restrict access to a resource. For example, you might have a semaphore which protects a file from concurrent access. It might be a bit value, where 1 means something is using the file and 0 means that file is not being used. So if another process wants to use this file, it checks the semaphore to make sure that nothing is using it (0) before it takes control of the file and flips the semaphore to a 1. Because Domino uses so many shared resources and there are so many processes in contention for these resources, you can initiate debug (debug_capture_timeout=1 in the notes.ini) within Domino that will provide information about any semaphore requests which take too long to process. This is invaluable since if Domino is slow, it is usually because it is waiting for something. The debug output will reveal what that is.
The NSD can be considered a Swiss Army Knife for issues related to Domino. It takes a snapshot of the system at all levels to provide details on activity, statistics and configuration at the time it was run. Two key areas to mention are the stack dump and memcheck. The stack dump is relevant because, regardless of the platform, it shows the routines or functions that were called for each thread in a process, for all Domino processes. By looking at the top function of a stack we can see the most recent activity. In the examples below, the nserver thread 53 of 68 is sleeping. Essentially, it's not doing anything. And nsched thread 1 of 3 is attempting to lock memory. If we wanted to see if it was successful, we would take another NSD to see if the thread was able to move past that function.
# thread 53/68: [ nserver: 3900: 2884]
### FP=0353ff38, PC=7c90eb94, SP=0353fee0, stkbase=03440000, stksize=262144
# thread 1/3: [ nsched: 1400: 3696]
### FP=0012f988, PC=6000153d, SP=0012f988, stkbase=00030000, stksize=20480
The memcheck portion of the NSD is a rundown of the how the memory Domino has is being utilized. It can give information on system memory usage, handle usage, network usage, in-use database structures, and file usage. Because of the scope of the topic, it won't be covered here. Suffice it to say, however, memcheck comes in very handy for various performance issues.
Domino statistics (show stat) can provide great insight into what is going on from a statistics point of view. Although a historical view can be gathered from the statrep, often it's more effective to just type "show stat" at the Domino console to capture data at the time of the problem.
At this phase of performance troubleshooting you would typically start to engage the experts in each area. Here you are tasked to interpret the results of what is seen, and from that infer what needs to be done. Unfortunately, this is not always as easy as it sounds. Determining the source of a problem requires not only knowledge but an understanding of the results captured in the documentation. For example, it is quite common for one person to collect stats that may indicate that memory utilization is not very good (e.g., overcrowding rejections.) One expert in the area might deem that this definitely identifies a problem with lack of available memory. While another expert may feel the watermark lacks significance and not likely to be the cause of the problem. The major pitfall here being that we made changes to affect what we documented rather than the problem itself. This further cements the need to have a specific problem on which to focus, which allows you to conclude if the changes produced the desired effect, based on the symptoms of the problem.
as we can only truly conclude that our changes had the desired affect based off the symptoms of the problem and not on what we think we are seeing.
Some of the major hurdles in this phase are a solid understanding of the architectural limits and operation of each type of resource manager with each resource. Of course, this is rather a broad topic. Which is why the need for teaming and engaging the right resource is so valuable.
number of cpu's/parallelism
|OS limitation - user space, disk cache, process size
type/content of document
|architecture (SAN, file system)
IO management configurables
Logical volumes management
|architecture (switches, cabling, disks)
physical volume management
packet management (size, route)
For each area we need to ask ourselves, "Is the problem primarily a throughput issue or a bandwidth issue?" In other words, are we constricted by the ability to use the resources or is the resource, or lack of it, what is causing the problem. Bandwidth issues tend to characterize hardware issues and throughput issues tend to characterize either OS or application issues. For example, in some cases we have seen where use of the Nagle algorithm (data is bundled together to reduce the number of packets sent) can negatively affect performance as the system has to wait on what is an artificial delay. In those cases, however, it wasn't that there was a lack of bandwidth, but rather the use of it. One thing to keep in mind is the efficiency of a resource usage may cause one to think they are running out of resources when in reality it is a throughput issue. Let's say we have a system which has no available CPU left. The natural reaction may be to add CPU. Yet on closer inspection, it is found that the processor is generating an abnormal number of context switches. In this case, it was not that there wasn't enough CPU, but how the CPU was being used.
Lastly, after we have made the changes we want to test to see if they had the desired effect. The test in our case is relatively easy. Because we know that there was a state of normalcy we simply need to determine if that state of normalcy has returned. We will also want to monitor the statistics that helped us focus on the source of the problem as well as the manifestation of the problem itself. They should correlate with the implemented changes. If they do not, that proves the problem was something other than we thought it was and we must start the procedure again.
To get a better idea of how to apply the principles presented here, in the remainder of this document, we will review various types of problems that you may find. We will provide examples on what tools we chose to use, and why we chose to use them. Finally, we will give an analysis of why we came to a diagnosis and what the resolution to the problem is.
NOTE: Don't get tied to theory. In a performance issue, it isn't always a simple equation. To make even an initial problem area identification, you will need to at least iterate through each of the possibilities. And if it is a more in-depth problem, you may need to iterate through them in more depth. This is why all-in-one tools like the NSD are so valuable.
Typically, CPU issues fall into one of two categories: 1) A high CPU load (i.e., CPU running on or near 100%), or 2) The CPU load is very low even though the overall Domino performance is sluggish. You can manage the CPU at the hardware, OS, or application (Domino) levels.
The hardware level is often the most basic level of the three. The BIOS will support a given number of CPUs and will report to the OS the number of CPUs installed. You would encounter performance issues should you attempt to run three partitioned Domino servers on a single CPU running at 700 Mhz. In this environment, the system would not meet the minimum CPU requirements for the three servers. It is possible that when first implemented, however, the servers would perform without consequence. But over time the server load may change, thus affecting system performance.
There is an added layer with larger systems such as AIX, iSeries (AS/400) or zSeries (OS/390). These systems allow administrators to configure the physical system into logical partitions (LPARs.) Each LPAR can be assigned either a full CPU, a portion of a CPU, or multiple CPUs. It is important that administrators be aware of the resource allotment per LPAR. For example, if half of the CPU is allocated for the single LPAR running three partitioned servers, the CPU could easily run at 100%, which would result in less than par performance of the servers.
|CPU running very high (at or near 100%)
||Too many Domino server partitions
Poor load balancing with tasks
|Other applications taking up load (hogging CPU)
||Not enough CPU
|CPU running very low
Insufficient number of threads
Server configuration (Max Trans)
|High paging - configuration
Thread priority changed
|High paging - not enough memory
The OS level will determine how hardware is implemented and also the amount of CPU which will be used in overhead. Besides the CPU load generated by the OS and related add-in tasks, you must also consider the load generated by other applications running on the server. For example, if Domino is running on a primary domain controller, that is also the DNS and DHCP server, the OS would consume more memory than it would on a dedicated system. Domino performance is also affected when an application requests a page of memory and the OS must swap something from real memory to virtual memory to grant the request. This process, which is known as paging, is part of the normal process for most systems. As paging increases, however, there is an increase in disk IO which can slow down the response. Over time there have been significant advances made to physical disks to increase throughput and speed; however, physical memory continues to be faster than a disk read. The OS will begin to thrash as paging increases, and once the system reaches a state of thrashing, it spends more time swapping memory than it does processing CPU requests. Thus, the system can no longer respond in a reasonable amount of time. At the OS level you can set the priority for a given program or thread. While Domino is designed to install with the best possible thread priority, it is possible that the priority of a given thread may need to be adjusted to provide a better user experience.
The addition of server tasks, functions, or users on Domino increases the system load. As users start to experience the potential of Domino they may start to expand their use of and load to the server. At times one thread may use a large amount of the CPU, and yet at other times, a single thread may be in a locked state, causing other threads to wait pending the release of the lock. Domino has the ability to throttle the load on the system by using a notes.ini setting which limits the number of concurrent threads that can run on a given partition. This can cause a thread to go into a "wait" state if the maximum number of concurrent threads is already running. Even the way that an agent is written can adversely affect the overall performance of the Domino server. If an agent is written to perform a full-text search on an non-indexed database, Domino may create a temporary full-text index. Once used, the index is deleted. The next time the agent runs, however, a new temporary FTI will be created regardless if there were any changes to the database. This adds extra overhead and will decrease the performance of the agent and the server in general. Overall, the administrator has the ability to configure Domino to service the expected load. The way that the administrator configures the server will determine the effect on the overall user impact caused by the changes.
This section will describe potential bottlenecks which relate to the CPU. The data that should be collected will vary depending on the suspected problem. Some of the data will be collected by discussing the environment with the hardware administrator, others from OS level tools, and some from Domino. Remember that some systems have the ability to limit the amount of CPU allocated per LPAR. The physical system may have 24 CPUs while the LPAR hosting Domino may have only one.
The data to collect from a hardware point of view may change depending on the OS that is being used. For example, on an AIX system the System Devices are listed in the output of an NSD. This includes a list of the CPUs configured on the system and their settings. The list below is from a large AIX system. The NSD shows that they have a total of 4 CPUs allocated to this LPAR. There are additional CPUs in the physical box that are not allocated to this LPAR.
name status location description
proc0 Available 00-00 Processor
proc2 Available 00-02 Processor
proc4 Available 00-04 Processor
proc6 Available 00-06 Processor
On a Windows NSD, this information is located near the top of the output and is listed in the OSVersion line (OSVersion : Windows XP 5.1 (Build 2600), PlatID=2, Service Pack 2 (1 Processor)
). This information is passed from the OS. We can also find information on a Window system looking at the System Properties window.
The speed and number of CPUs will determine the number of Domino partitions, users and tasks which can be run. For more information about sizing a Domino server, refer to the following:
"Domino for IBM xSeries and BladeCenter Sizing and Performance Tuning
" and "Domino for iSeries (AS/400)"
There are many tools available to view the CPU usage on a system. For a Windows system, you can use Windows Task Manager.
The figures above shows that the CPU is running at 100%. In one case, an administrator had begun a Disk Cleanup midday. The cleanmgr.exe thread was using 93% of the CPU. User response time declined until this thread released the CPU. Just because the CPU is running at 100% does not mean that there is a problem. A review of what thread or process is using the bulk of the CPU can reveal the root cause. During the Domino startup process there will be a spike in the amount of CPU that is used by Domino; this is normal. The load will reduce once the server startup completes.
Domino has its own set of tools that can be used to collect information. This includes Domino statistics (show stat) or the Domino server information screen (show server).
[01A8:0006-08F8] Lotus Domino (r) Server (Release 6.5.3 for Windows/32) 04/10/2006 04:22:55 PM
[01A8:0006-08F8] Server name: SET_Test1/Support
[01A8:0006-08F8] Server directory: g:\notes\data
[01A8:0006-08F8] Partition: g.notes.data
[01A8:0006-08F8] Elapsed time: 01:10:17
[01A8:0006-08F8] Transactions/minute: Last minute: 6; Last hour: 4; Peak: 23
[01A8:0006-08F8] Peak # of sessions: 2 at 04/10/2006 04:20:55 PM
[01A8:0006-08F8] Transactions: 35 Max. concurrent: 20
[01A8:0006-08F8] ThreadPool Threads: 40
[01A8:0006-08F8] Availability Index: 100 (state: AVAILABLE)
[01A8:0006-08F8] Mail Tracking: Not Enabled
[01A8:0006-08F8] Mail Journaling: Enabled, Local Destination
[01A8:0006-08F8] Shared mail: Not Enabled
[01A8:0006-08F8] Number of Mailboxes: 1
[01A8:0006-08F8] Pending mail: 0 Dead mail: 0
[01A8:0006-08F8] Waiting Tasks: 0
[01A8:0006-08F8] Transactional Logging: Not Enabled
[01A8:0006-08F8] Fault Recovery: Enabled
[01A8:0006-08F8] Activity Logging: Not Enabled
[01A8:0006-08F8] Server Controller: Not Enabled
[01A8:0006-08F8] Diagnostic Directory: g:\notes\data\IBM_TECHNICAL_SUPPORT
[01A8:0006-08F8] Console Logging: Enabled
[01A8:0006-08F8] Console Log File: g:\notes\data\IBM_TECHNICAL_SUPPORT\console.log
The output above shows "Availability Index: 100 (state: AVAILABLE)," which represents the server availability index (SAI). The SAI is a number from 0 to 100 representing a relative availability of the Domino server. Each Domino server periodically determines its own workload based on the response time of the requests the server has recently processed. The workload is expressed as a number from 0 to 100, where 0 indicates a heavily loaded server and 100 indicates a lightly loaded server. It is important to understand that the SAI is not a percentage. In Domino 6.x and later releases, SAI is computed by examining the difference between the elapsed time to perform transactions on an idle system and the elapsed time to perform them on a loaded system. The SAI can be tuned depending on the hardware capability of the system. Remember, it is a relative availability for that server; each server may have different SAIs based on load and server capability. If the SAI changes, this is an indication of a change in performance. For more information, refer to the document titled "Domino 6.x Server Availability Index (SAI) -- Understanding How SAI Is Calculated" (#1164405
You can also configure Domino to provide information regarding its performance by enabling "Server_Show_Performance=1" in the notes.ini. The output will show the number of transactions the Domino server performs per minute as well as the current number of active users on the system. This information, which will post to the server console every 60 seconds, can be helpful in tracking system performance over a period of time. It will show peak usage times both for users and transactions. Transactions include any transaction that the server is performing (e.g., Router, agent manager, replication, virus scanner, and user actions).
[01A8:0021-0254] 04/10/2006 04:30:47 PM 6 Transactions/Minute, 0 Notes Users
[01A8:0021-0254] 04/10/2006 04:31:47 PM 0 Transactions/Minute, 0 Notes Users
[01A8:0021-0254] 04/10/2006 04:32:47 PM 4 Transactions/Minute, 0 Notes Users
Diagnosis and Corrective Actions
Over time the load on a Domino server can increase, which can affect the overall performance of Domino. Additional Domino tasks (such as, POP3 or HTTP) and add-on applications (such as, LEI or anti-virus) may alter the response time experienced by users. To improve the response time, it may be necessary to alter the hardware level. These changes can include adding CPUs, increasing the CPU allocation, or balancing the CPU load.
Performance troubleshooting is an iterative process; therefore, by removing one bottleneck you could potentially reveal another. Let's say we have a system that had a paging issue that was solved by adding memory may now have a CPU problem.
On a Windows system, Perfmon can be used to monitor the number of pages per second. It is considered excessive when the pages per second multiplied by the AvgDiskSec per transfer multiplied by 100 is greater than 12% to 16% (100*(Memory:Pages/sec * PhysicalDisk:AvDiskSec/Transfer)). The figure below shows that the average pages/second is 129.
Domino gives users the ability to create and schedule agents in their mail file as well as other databases. User agents can potentially generate enough load on a server to cause performance problems. The default settings for Agent Manager may not be suitable for your environment. By adjusting the number of Max Concurrent agents and the Max % busy before delay (found on the Server Configuration document), you can control the impact on performance caused by the agents. In addition to throttling the agents, you can also restrict who can run agents on the server. (By default, everyone can run Simple and Formula agents. You could potentially alleviate performance problems due to agents by moving certain agents to run at non-peak times.
Function calls within agents can also affect Domino server performance. A user may create a simple agent that will perform a search on their mail file. If the mail file does not have a full-text index the agent may call a full-text search which will cause the server to generate a temporary full-text index. This full-text index will be used once and then destroyed. The following is written to the Console log when an agent results in the creation of a temporary full-text index:
[13970:00002-00001] 12/11/2003 08:09:01 Warning: Agent is performing full text operations on database 'mail/USER.NSF' which is not full text indexed. This is extremely inefficient.
If this agent was scheduled to run once every ten minutes, the extra load on the server generated by the creation of the temporary full-text index could have significant impact on the overall server performance. Administrators can easily avoid this issue by manually creating a full-text index on the mail file. This way, the agent can continue to run, but with a reduced impact on server performance.
Many times a Domino server is configured to serve a particular function (e.g., a mail server.) Over time, the function of the server will expand to include additional tasks such as HTTP. Typically, HTTP is considered to be a very light load that is simply serving up pages. However, when users access their mail file via HTTP, the pages are dynamic and the load can increase dynamically. You can balance the load by moving certain tasks to other servers. Sometimes, as the overall Domino server load increases, it may become necessary to add additional servers to the environment to handle the increased load.
When the overall server performance is poor yet the CPU load is low, this may indicate that a Domino throttling issue is occurring. Another common indication of a throttling issue is when OSWaitEvent calls are found in call stacks that cannot be related to a specific semaphore lock or third-party application. Within Domino you can control the amount of worker threads and the number of threads that can process concurrently by enabling the notes.ini parameters Server_Pool_Tasks and Server_Max_Concurrent_Trans:
- Server_Pool_Tasks - This setting controls the number of physical threads in the IOCP thread pool (per port) which can run on the server. Note: The total number of threads running will always be higher when checked in an NSD due to overhead threads (i.e., maintenance threads.)
- Server_Max_Concurrent_Trans - This setting controls the number of physical threads which are allowed to execute transactions concurrently. This control was of particular importance prior to the introduction of IOCP thread pools in Domino 5.x. In Domino 5.x and later releases, this parameter can be useful to control the number of concurrent transactions which affects the amount of virtual memory needed at a given time.
The default value for Server_Max_Concurrent_Trans is 20. The default value for Server_Pool_Tasks is the number of Server_Max_Concurrent_Trans multiplied by 2. The Server_Pool_Tasks is used to determine the number of server threads for the Domino server. The number of server threads is the Server_Pool_Tasks multiplied by the number of NRPC ports defined.
On a large scale server with a generous amount of real memory, a large virtual address space and plenty of CPU power, both of these parameters may need to be increased to obtain optimal performance.
Increasing the Server_Max_Concurrent_Trans and the thread pool should be done in a controlled process. If the pool is increased too much the server may become unstable. It is not recommend to set Server_Max_Concurrent_Trans to "-1" or "1000", as these values allow an unlimited number of concurrent transactions which can lead to instability. A valid range for Server_Max_Concurrent_Trans is between 20 and 100, as long as there is available CPU. As server threads consume system resources (memory, CPU, network), it is important that the total number of server threads be managed. The amount of server threads a server can support depends on the server configuration. Typically, 80 - 150 is an acceptable maximum range for server threads.
Another cause of a slow system can be semaphore hangs. Domino uses semaphores to control access to key functions and data. There are times when semaphores are locked and released. During this processing a thread may have to wait for a semaphore that it needs. This "wait" is not considered a hang and is part of normal thread processing for Domino. When the semaphores are locked for extended periods of time, however, performance issues can occur. The issues may surface as client-to-server connectivity, mail flow delays, and/or agent delays. A prime example of a long-held lock is when a virus-scanning application locks the mail.box to scan a message. During this time the Domino router may be locked out of the mail.box, which will result in longer than expected message delivery times. Unlike a custom agent or user agent, the virus scanner cannot simply be disabled to help with performance. In such cases, using multiple mail.box databases and/or increasing virus-scanner threads could help reduce the delivery times.
To determine which threads are holding locks, you must enable semaphore debug. This will generate a SEMDEBUG.TXT file in the IBM_TECHNICAL_SUPPORT directory of your Domino server. Along with the SEMDEBUG.TXT file you will also need an NSD from the time of the slowdown. The SEMDEBUG.TXT will show the thread ID that is holding the lock, the thread ID that is waiting and which semaphore is being held. The NSD will also allow you to identify a thread/task name with a thread ID.
For more information regarding semaphores, refer to the document titled "Semaphores and Semaphore Timeouts"(#1094630).
Memory is a key area of focus for performance issues primarily because processes rely on it so heavily to maintain their speed. As such, problems with memory can easily have a distinct and significant impact on Domino performance. Typically, memory problems boil down to either: 1) There is not enough memory, or 2)There is adequate memory, but it is not managed well (i.e., poor configuration.) Memory management, of one sort or another, takes place at each of these levels:
- Hardware (BIOS, PROM, firmware etc.)
- Operating System (Virtual memory manager)
- Application ( Domino Memory Manager)
|Not Enough Memory
||i.e. Domino not seeing the memory
||i.e. Hitting architectural limits
||i.e. Not enough memory
|Not managing memory well
||i.e. Domino is overusing memory
||i.e. Not representing memory to the OS
The hardware level is usually the most basic. The BIOS represents what memory is available to the OS. You should verify that the memory reported is correct, that physical memory is available, and that the BIOS correctly reports the memory on the system.
At the OS level memory serves as a disk cache and a working set. The disk cache (file cache) is a part of RAM that is set aside to temporarily hold data read from disk. A disk cache doesn't have to hold an entire file, as a RAM disk does, but it can hold parts of running application software or parts of a data file. For more information, refer to the following URL: http://www.microsoft.com/technet/archive/wfw/7_agloss.mspx?mfr=true. The working set is executable pages, functions, and instructions that are actually brought into memory during a phase or operation of the executable. See this URL for more information: techpubs.sgi.com/library/tpl/cgi-bin/getdoc.cgi .
|Quantity of memory
||Will adding more memory help? (architectural limits)
How much memory is the process using? (contention for memory)
|Will adding more memory help?
||Are the process using too much memory? (fragmentation)
Is the memory for the process being made readily available?
|Is memory readily available?
Is file cache being used effectively?
One key feature of the OS memory managers is to maximize performance by balancing the use of physical memory (RAM) and disk operations. An OS will try to maximize the use of RAM by taking apart some disk space (paging space) and combining it with the RAM to create a virtual memory space. This allows for the perception of more memory by the application even though the whole space does not perform as RAM would. This is important to mention to make the distinction that problems labeled as "memory problems" may come in the form of how the use of memory is balanced, rather than the amount of memory itself. For example, an application might run fine if it thinks it has only 500MB of memory, thus using all the physical resources. But if it thinks it has 1GB of memory (500MB physical and 500MB in paging space) it might perform significantly worse because it's trying to use more memory under the assumption that all the memory is the same, when in actuality it is not. The salient point here is that it is up to the administrator to understand these differences and what the memory manager is trying to do in order to diagnose a memory issue.
Lastly, let's talk about some issues that come up with Domino and its memory manager. Essentially, Domino relies on its own memory manager to fulfill requests for memory. In doing so, memory is taken from various pools of varying sizes, each for different reasons. It is important to know that each of these pools have a limitation in size and that each process is limited in the number of handles that can be allocated. The largest of these pools is called the UBM buffer, which is sometimes referred to as the NSF buffer pool. Because it serves as a central resource, overall memory usage takes its queue from how large this single pool is.
Our main concerns with memory at this level are : 1) Do we have enough memory to fulfill the needs of the Domino memory manager? and 2) Are there any problems with how the pools are being utilized?
At the most basic level, all that really matters is that the memory is available for use. But within each layer there are issues that can arise in situations where memory management runs into a problem. Typically, this is either because of inefficient usage of memory or a lack of available memory. In most cases, this type issue can be resolved by just adding more resources. However, there are limitations to each layer. For example, the Windows OS is limited to 2GB of user virtual address space per process, and in total. That means no single Domino process can use more than 2GB of memory (private memory + mapped shared memory) and the sum of user memory can't go beyond 2GB. In addition, Windows has a limit for disk cache, page table entries, page pool, etc. Also, the Window's default configuration is to support 4GB of memory.
This section describes some potential bottlenecks related to memory and their symptoms.
|Disk data collection tools
||i.e., Domino statistics, NSD, memory dumps
||i.e., perfmon (W32), perfpmr (AIX), GUDS (Solaris), NSD
||i.e, memtest86 (x86), diag (AIX), PROM (Solaris)
1) Is there enough memory available (as reported via Task Manager or Perfmon?) Mainly, hardware issues manifest a lack of resources. The key difference between hardware and OS issues is that hardware issues relate to how much memory is available and issues with the actual physical storage of memory. OS issues deal with how well we manage making that memory available. One limitation with the Windows memory manager is the lack of configurability doesn't always make OS issues obvious. The litmus test for whether it is a hardware issue comes down to whether adding or changing RAM could make a difference.
2) Is there a problem with the memory? Sometimes the problem is the resource itself. Normally this takes on the form of corruption or unexpected behavior without any pattern. This typically results in a crash; however, adverse effects on performance can be caused by bad memory chips. For example, a server might run slow one minute, and then reboot the next minute. Running Memtest86 (http://www.memtest86.com) is a good way to get a handle on whether there is a problem with the memory chips by exhaustively testing every address under a number of scenarios.
When dealing with OS memory issues you can use the following table to determine the types of questions to ask.
|Quantity of memory
||What is the individual process virtual memory size?
What is the total memory allocation among all processes?
Can the OS handle more memory?
|Is enough memory available for cache?
||Are there signs of fragmentation
Is paging occurring?
|Is the memory management configuration appropriate?
These statistics can be taken via perfmon on a Windows platform.
- Memory counter - Available memory.
- Available Bytes - Memory Available Bytes < 50 MBs . Alternately you can use the Platform.Memory.RAM.AvailMBytes provided through Domino.
- Memory Object - Pages/sec counter. Pages/sec is the number of pages read from the disk or written to the disk to resolve memory references to pages that were not in memory at the time. This can occur when a process must retrieve data from disk. In and of itself, this statistic does not necessarily indicate a lack of memory as high paging periods can occur when accessing data from disk. But high pages/sec does indicate how hard the system is working to fulfill memory requests. This statistic should be used in conjunction with the Available Bytes statistic to determine if a memory condition is occurring.
- Memory Object - Pool Nonpaged Bytes counter. The nonpaged pool is an area of system memory that cannot be written to disk (in other words paged out). Because of the limitations with this pool, it is necessary for the OS to limit its size. If the size of this counter exceed 115 MB or if the sum of this and the paged pool exceed 256 MB, you may have a problem. Things to look for are IO buffers in use, device drivers, and TCP sessions.
- Memory Object - Pool Paged Bytes counter. As opposed to the nonpaged pool, this pool represents the area of memory that can be paged out to disk. You may have a problem if this value exceeds 156MB, or 256MB when combined with the non-paged pool. Things to look for are memory mapped files, shared DLLs, registry size, and Windows user sessions.
- Process Object - Virtual Bytes counter /Task Manager - VM Size/Task Manager - Available memory. Using the perfmon or the Task Manager you can get a quick look at how much virtual address space is being used per process. For Windows, the limit for virtual address space per process is 2GB. Although this won't necessarily be the bottleneck in itself since the Windows OS by default can only use 2GB of user space with a 4GB maximum physical memory. Since a Domino system will be running multiple processes, it is unlikely that a process will run into the virtual space limitation before running into the system limitation. But from this statistic, you can get an idea of how much memory is being used by each process.
The screenshot below shows an example of the Virtual Memory Size (VM Size) per process. To display the VM Size column in Task Manager, select View -> Select Columns -> Virtual Memory Size.
Domino statistics (show stats)
- Database.BufferPool.PerCentReadsInBuffer. The buffer pool represents the largest single area of memory in use by a Domino system. Domino uses this as cache memory to act as a buffer for indexing functionality. If this value falls below 85% and the Database.BufferPool.Peak stat is equal to the Database.BufferPool.Maximum stat, steps should be taken to increase the buffer pool.
The size of the buffer pool can be increased or decreased indirectly by changing the amount of memory Domino sees (ConstrainedSHMSizeMB/PercentAvailSysResources/MEM_AddressableMemSizeMB) or directly through NSF_BUFFER_POOL_SIZE. Caution must be taken when using these settings. Setting the amount of memory which Domino can utilize too high can adversely affect system stability; whereas, setting it too low can adversely affect system performance. Refer to the document titled "How Much RAM will Domino Use?" (#1093891). For AIX, refer to the article titled "Lotus Domino on AIX memory usage explained" at this URL: http://www-128.ibm.com/developerworks/lotus/library/domino-aix-memory/
There are times when the reads in buffers can be too high (greater than 95%). If the pool is too large, you may actually be wasting memory. We have seen situations where running with 850MB in the NSF Buffer Pool with reads in buffer 96% was a waste of memory that could be used for other things. After lowering the buffer pool to 400MB, the reads in buffers was still running at 92% with good performance. The rest is given back to the system to allocate for other uses.
- Database.DbCache.OvercrowdingRejections - A counter for the number of rejections due to the overcrowding of the database cache. If this statistic is significant (i.e., greater than 50/hour) you should monitor this statistic for the times that it is increasing and whether it related to abnormal usage. If not and the Database.DbCache.CurrentEntries is approximately equal to or greater than Database.DbCache.MaxEntries, you will need to increase the size of the dbcache. This can be done indirectly through increasing the buffer pool size or directly through the NSF_DbCache_Maxentries parameter.
- Mem.Allocated - Total amount of memory allocated by the server.
- Mem.Allocated.Process - Total amount of non-shared memory allocated by individual processes.
- Mem.Allocated.Shared - Total amount of server memory allocated as shared memory.
- Mem.Availability - Availability of server memory.
Notes Memory Analyzer (memcheck) -> Shared Memory Stats
TYPE : Count SIZE ALLOC FREE FRAG OVERHEAD %used %free
Static-DPOOL: 111 801112064 662701728 137917400 0 941546 82% 17%
Overall : 111 801112064 662701728 137917400 0 941546 82% 17%
Notes Memory Analyzer (memcheck) -> Process Heap Memory Stats
TYPE : Count SIZE ALLOC FREE FRAG OVERHEAD %used %free
Static-DPOOL: 32 16777216 10731632 6016300 0 49156 63% 35%
VPOOL : 2 129916 39812 82432 0 7688 30% 63%
POOL : 209 3572332 2414200 1087900 0 77656 67% 30%
Overall : 32 16777216 9561300 7186632 0 134500 56% 42%
These sections show the allocation of memory from the OS to Domino. Look at these numbers to see how much memory is allocated from the OS to a particular process heap or shared memory. The "Overall" values gives an idea of how much of that memory is actually in use, and how well what has been allocated from the OS is being used. Looking at the example above, we can see in the Shared Memory example that Domino has ~3MB (3097152/1024/1024) of memory. Of that amount it is actually using ~550KB. This NSD was taken just after the server started so there isn't much activity and the high level of free memory is acceptable. But, if we noticed a large amount of memory allocated (let's say over 1.5GB) with the same level of free space, we would think there is a problem since Domino would not be using memory well.
It should be clarified that the Process Heap section exists for each Domino process. By using the values found in these sections we can gain an account from a Domino perspective where memory is allocated on the process and shared levels. We can then use those numbers to look at total values, specific usage, and correlation with the overall system to understand where resources are being used.
For shared memory usage:
Notes Memory Analyzer (memcheck) -> Memory Usage Summary -> Top 10 Shared Memory Block Usage (Time xx:xx:xx)
For process memory:
Notes Memory Analyzer (memcheck) -> Memory Usage Summary -> Top 10 [ PROCESS_NAME: PID] Memory Block Usage (Time xx:xx:xx)
Type TotalSize Handles Typename
0x0f04 420240 51 BLK_FT_SEARCHTABLE
0x29c5 369334 1 ???
0x29c4 196608 1 BLK_SRV_PERFNAV_PERFDATA
0x0130 82998 228 BLK_TLA
0x0149 76076 14 BLK_PHTCHUNK
0x29c3 65406 1 BLK_SRV_PLPOOL
0x29c2 65406 1 BLK_SRV_PS_COUNTERS
0x29c1 49152 1 BLK_SRV_PS_NAMETABLE
0x0380 29358 63 BLK_RM_PTHREAD_TRANENTRY
0x03fe 24576 1 BLK_COMPILER_STRING_STORE_MEM
|BY HANDLE COUNT
Type Handles TotalSize Typename
0x0130 228 82998 BLK_TLA
0x0f02 160 17600 BLK_FT_STATIC
0x0275 70 8126 BLK_NSFT
0x0380 63 29358 BLK_RM_PTHREAD_TRANENTRY
0x0f04 51 420240 BLK_FT_SEARCHTABLE
0x030a 47 15794 BLK_LOOKUP_THREAD
0x0149 14 76076 BLK_PHTCHUNK
0x0910 11 5066 BLK_SRV_NAMES_LIST
0x0f5d 2 220 BLK_ISERV_SHARED_DATA
0x0128 1 10480 BLK_PCB
These sections contain the memory usage for shared memory and private process memory listed in top 10 format for the number of handles used and total size. It is difficult to say whether a specific handle is overused since different handles have different utilizations and limits. But when combined with the memory dump, we can make some interesting observations:
For Shared memory and each process, you will have a section that looks like the following:
|*** Dump of Shared Handle Table
HDL 1 0150 S locked, refcnt 6, size 65412 D
HDL 2 0150 S locked, refcnt 6, size 65412 D
HDL 3 0150 S locked, refcnt 6, size 65412 D
HDL 4 0150 S locked, refcnt 6, size 65412 D
HDL 5 0141 S locked, refcnt 6, size 65412 D
MEM 1 0127 S locked, refcnt 6, size 12536 D
MEM 2 0416 S locked, refcnt 6, size 1080 D
MEM 3 0465 S locked, refcnt 6, size 10 D
MEM 4 0462 S refcnt 0, size 1484 D
MEM 5 0438 S locked, refcnt 1, size 538 D
|*** Dump of Handle Table for ProcessID 00000A84 (nserver)
HDL 1 0166 locked, refcnt 1, size 336 D
HDL 3 0146 G locked, refcnt 1, size 9222 D
HDL 4 0F5D locked, refcnt 1, size 116 D
HDL 5 0F5D locked, refcnt 1, size 116 D
HDL 6 0938 refcnt 0, size 34 D
MEM 1 0130 locked, refcnt 1, size 370 D
MEM 2 0128 locked, refcnt 1, size 10486 D
MEM 3 0171 locked, refcnt 1, size 316 D
MEM 4 044A locked, refcnt 1, size 18 D
MEM 5 0129 G locked, refcnt 1, size 524294 M
To make conclusions about the memory dump we have to process the data. The memory dump is primarily divided into sections that display data for shared memory and each process that is running. Under each of those sections are dumps of the handles in use as seen above. One thing we can do is take the sum of the MEMs (memhandle) and HDLs (handle) for each process separately and shared memory. We can also take the sum of the values in the 9th column to get an idea of what is being allocated. If after doing so we see that the values exceed the following, the server might be experiencing an overall problem with handles.
1) Total process HDLs exceed 9,000 or MEMs exceed 125,000
2) Total shared HDLs exceed 240,000 or MEMs exceed 450,0000
3) Total shared and process memory exceeds 1.5GB
For the above example, we have 5 HDLs for shared memory and 5 MEMs with a total 334K memory in use. Obviously memory usage here is quite low. Since our nserver data is only a sample fragment, those values would also be low. But if we suspected a leak of some kind, we could refer back to the NSD top 10 sections to see which handle might be the culprit.
Diagnosis and Corrective actions:
Hardware issues are relatively simple. If your memtest results come back with an error, the most likely candidate is that there is a problem with the memory chips.
Below are the questions you would want to ask for each type of problem and the corrective actions to take:
|Quantity of memory
What is the individual process virtual memory size?
Document the memory usage for each process through task manager, OS stats (i.e. perfmon) or nsd. Determine whether the total of mapped shared memory and private memory seem high or close to the architectural limit for your OS
What is the total memory allocation among all processes?
Document the total usage through OS stats or NSD. Does the total amount of private allocations and shared memory high or close to the architectural limits
Can the OS handle more memory?
Each OS has a different limit. For example, Windows has a default limitation of 2GB of user address space and process address space.
Is enough memory available for cache?
Total available memory the OS can use is different than what is allocated to user space. For example on Windows, the user space is 2GB max but the OS can use up to 4 GB. This is especially important to note that additional memory for filecache can be useful for systems with less than 4GB.
|Are there signs of fragmentation?
Review of the nsd. If so the question is what is allocating the fragmented memory. This is usually identifiable when the amount of memory Domino has in use is much smaller than the amount allocated by the OS.
- Is paging occurring? Review OS stats for out of character paging activity
Is the memory management configuration appropriate?
If there is enough memory and paging is still occurring review how cache is being managed may be needed
How Much RAM will Domino Use?
File Cache Performance and Tuning on Windows
Domino memory issues usually come down to making sure all the pools being managed have enough resources available to them. Some resources, like handles, have a fixed limit and if you are experiencing high handle usage, it's because the application is not using the resources effectively or it just needs too much.
Other resources (such as, available memory) is taken from the OS but seen through the configuration of the Domino memory manager. The most resource intensive pool, which is the UBM buffer, acts as a caching mechanism for Domino so it can utilize a great deal of memory. In this respect, if we see the percentage of reads decreases, it may be necessary to ensure that Domino has access to more memory.
More on Memory Management
Domino Tuning Parameters in notes.ini
Let's understand some things about performance, disk configuration, and Domino. Domino is a database-type application, and as such Domino performance relies on the underlying disk subsystem performance. If the underlying disks are poorly configured and slow, or if the usage has outgrown the capability of the disk configuration, Domino access may suffer.
Common Disk Terms and Configurations:
- Spindle - Older term referring to an individual hard disk similar to the hard drive in your desktop computer. Hard disks are rated by spindle rotational speed and access time capabilities, and faster hard drives usually cost more.
- RAID (Redundant Array of Independent Disks) - Group of disks/spindles, configured and accessed in a specified way, defined by the RAID level. Different RAID levels have different performance and redundancy characteristics. Select the appropriate RAID level based on the type of disk access needed. We have several recommendations below. For details on the RAID levels, see: http://en.wikipedia.org/wiki/RAID
- Direct Attached Storage (DAS) - A directly connected disk subsystem with multiple disks configured in a high-performance RAID configuration.
- Network Area storage (NAS) - A disk subsystem connected and accessible via the local area network (LAN). For best performance these subsystems should be on a dedicated LAN, not the same LAN that normal user network traffic uses.
- Storage Area Network (SAN) - Usually a fiber optic connected disk subsystem that has a large memory buffer. A SAN can be connected to multiple servers, and is made up of many disks/spindles. A SAN has intelligent hardware/logic built in to buffer reads and writes to its memory cache, making disk access very fast, since memory access is much faster than hard disk access. For information on Domino and SAN support, see: Technote #7002613.
For Domino, we normally recommend utilizing either 1) a DAS device in a high-performing RAID configuration made up of small fast hard drives, or 2) a high-performing SAN system. It is very important to collect and analyze the disk related statistics when you believe you are having disk related performance issues.
Disk issues can surface in many flavors, some of which can be difficult to identify. The problems could be caused at the Domino application level, at the OS level, or even at the hardware level.
|server crash/server Panic
disk thrashing, system slow
|not enough physical disks,
not enough bandwidth
|Domino is overusing disk
Domino is sluggish
|system is sluggish
high disk access times
|system is sluggish
inappropriate RAID levels
SAN fabric configuration
Some common issues related to disks are:
- Lack of disk space
- Disk are busy continuously (or at peak times) and run at 100%
- Disk access is slow (i.e., long read, write, and transfer times)
- Problems with certain databases surface, such as long-held locks, semaphore time-outs, etc.
Typical symptoms of disk issues include: periodic slowdowns, apparent hangs,a slow and/or sluggish system. The system may become unusable, as in the event of a hang or crash, or it may just run sluggishly. For performance issues, the system will usually recover or continue functioning on its own making it difficult to capture data during the issue. Many times, this type of issue gets worse over time as the system use increases, more users are migrated to the system, and the disk usage increases.
When it is identified that the disk subsystems are causing a performance problem, there are usually two ways to resolve the issues. Many times, adding more disks to spread the data across more drives/spindles will help. But it could also be the layout of the application, and/or how the disk load is spread.
There are three main areas of data collection to troubleshoot disk IO bottlenecks.
1. Application level (layout of Domino and components onto disks)
2. Operating System level (disk statistics)
3. Hardware level (physical layout/topology of the disk subsystem)
Note: If you have a SAN, additional diagnostics may be available at the SAN/controller level.
|Disk data collection tools
||Domino statistics, NSD, disk layout topology, Domino Administrator client
||perfmon (Windows), perfpmr (AIX), iostats, nmon, NSD
||proprietary disk tools, disk topology
Very important to looking at a potential disk issue, is to know how the system and disks are laid out physically as well as logically. This includes both from a hardware standpoint and a Domino (application) standpoint.
Physically, there are many ways the systems disks can be configured, which can affect performance; but to Domino, disk is disk. Domino doesn't care if the disk is local to the CPU, located in a disk enclosure, or located in a SAN connected to the system via fiber channels. Domino just expects that the disk is both accessible and quick. Any factors that slow down the disk access will slow down the application. You should collect this disk layout topology (physical, OS and application level) information for review and comparison to the recommended best practices for Domino.
There are several tools which can be used to monitor disk statistics. You should collect data over a full day's time period to trend the performance statistics over time, and to see if there are correlations with times of bad performance and the disk statistics.
The NSD also contains performance data, but mostly you will need to monitor disk performance using OS level tools (such as Perfmon on Windows, iostat on Unix, perfpmr on AIX, etc.). If a SAN/NAS is involved, the SAN vendor usually has tools for monitoring performance. In the case of a local disk, OS level monitoring software should be good. On Windows, use perfmon to monitor disk statistics during times of performance problems.
One can use Domino statistics or an OS utility such as perfmon on Windows, perfpmr on AIX, to collect disk statistics. nmon on AIX and Linux is also a good tool for monitoring the performance of a system.
Here are a few important statistics to collect and monitor over time
- Average Disk Queue Length is the average number of both read and write requests that were queued for the selected disk during the sample interval.
- Average Disk seconds/Read is the average time, in seconds (ms), of a read of data from the disk.
- Average Disk seconds/Write is the average time, in seconds (ms), of a write of data to the disk.
- Disk Reads/second is the rate of read operations on the disk.
- Disk Writes/second is the rate of write operations on the disk.
- Percent Disk time and Percent Idle time, depending on your system, it may be better to check the % idle time, which is the percent of the time that the disks are idle. Percent disk time can actually exceed 100%, because the counter includes time for overlapping IO requests.
The Domino Administrator client can be used to see database sizes and Domino statistics. If platform statistics are enabled, that information will also be available via the Domino statistics.
Diagnosis and Corrective actions:
1) Don't have enough disk space? Performance usually degrades when a system runs out of disk space. In Domino, this would usually lead to a crash/panic situation. First, check NSDs for a local disk section, and look for the amount of free space on each filesystem/directory. Make sure you check each of the filesystems, locations of Domino data, transaction logs, system temporary files, etc.
A low amount of available space can lead to disk fragmentation issues. A badly fragmented disk can dramatically lower overall system performance. Check for fragmentation via the OS level tools. We have seen disk fragmentation slowdown disk intensive operations by over 50%. Domino does a good job maintaining itself due to internal housekeeping processes. Compact and Updall should be used to maintain databases. When databases have unused space in them, this helps reduce disk fragmentation. If the database is not physically fragmented, having the free space provides a place in the database for new updates and data, instead of having to find free space somewhere else on disk. Compact helps to reduce fragmentation within a database.
It is also important to maintain some available disk space (e.g., 20 - 30%). Having adequate available disk space will allow utilities, such as Compact and other defragmenting applications, to be more effective. For information on Microsoft's recommendation on free disk space, refer to this URL:
2) Disks always busy, not enough throughput, or trying to do too much work on the disks? Or, disk access is slow (i.e., read/write/transfer times long.)
First, review the configuration of Domino and ensure it follows the recommended best practices.
The location of certain Domino components: transaction logs, view rebuild directory, temporary files, swap files, system page files, can all affect overall performance. Regarding configuration, the OS, Domino binaries, Domino data, and transaction logs should all be located on different physical disks/controllers. This allows them to operate independently of each other, and not compete for the same disks simultaneously. Best practices suggest that items such as backups, recovery, handheld-device polling, etc. occur on secondary servers (i.e.: a non-customer facing clustered server.) These programs can be very disk intensive in addition to the normal Domino processing. Also check the disk configuration, physical configuration, RAID levels, SAN, number of disks, and the speed of disks. This includes, number of disks that data is spread over, dedicated IO channels, analysis/configuration of memory buffers dedicated to the disk, etc. This usually means getting someone involved who is hardware savvy.
Disk drives have physical limitations on how many IOs a disk is capable of handling. Depending on the workload this limit can be reached. When the number of IO requests exceeds the disk's IO capacity, the IO requests will take longer and will be queued for its turn on the disk. Because of this, spreading out data files across multiple physical drives allows for parallel IO. Adding more physical disks and spreading the data files among them can resolve a lot of disk subsystem bottlenecks.
You should collect disk statistics for at least a full day's time period. Analyze the data for trends looking for correlations of worsened disk performance at times when performance was bad. High percent disks busy, low percent disk idle and long times to read/write/transfer point could indicate a disk configuration problem, or a problem at the application level. On Domino, semaphore and console data should be analyzed to look for correlations with semaphore time-outs. Also review the Console log for long-held locks. These things may assist in whether an application level bottleneck is occurring versus bad disk performance.
As far as the disk statistics mentioned earlier, collect data over time on the following variables:
- Average disk queue length
- Average seconds/read
- Average seconds/transaction
- Average seconds/write
- Percent disk busy time
- Percent disk idle time
The average disk queue length tracks the number of requests that are queued and waiting for a disk. If more than two requests are continuously waiting on a single-disk system, the disk might be a bottleneck. If the disks are always busy, or taking long times to read and write, this also indicates a disk issue.
One threshold calculation for disk bottlenecks uses the average disk queue length counter.
Rule of thumb: disk bottleneck threshold is (Disk Queue Length) / (number of spindles) > 1. Where the number of spindles equals the number of disks in the RAID set or, if it's a single disk, equals 1.
To generalize, the average seconds/read and average seconds/write should be a low number of milliseconds (less than 15ms) preferably in the single digit range. Anything above that is reason to investigate further.
Windows perfmon is a great tool to collect and analyze this data. On AIX, perfpmr can be used to collect this data. Also, nmon is available on Linux and AIX.
AIX example: filemon.sum from perfpmr taken during a period of good performance; read/write times are single digit msecs.
|VOLUME: /dev/emc_vg08_lv01 description: /opt/lotus/notesdata
reads: 3381 (0 errs)
read sizes (blks): avg 16.9 min 8 max 232 sdev 15.4
read times (msec): avg 8.261 min 0.233 max 156.753 sdev 9.754
read sequences: 1834
read seq. lengths: avg 31.1 min 8 max 640 sdev 40.5
writes: 897 (0 errs)
write sizes (blks): avg 68.4 min 8 max 176 sdev 54.9
write times (msec): avg 4.156 min 0.343 max 20.851 sdev 3.555
write sequences: 597
write seq. lengths: avg 102.8 min 8 max 4480 sdev 356.4
seeks: 2431 (56.8%)
seek dist (blks): init 17615672,
avg 2808275.4 min 8 max 35833336 sdev 4481023.2
time to next req(msec): avg 13.989 min 0.000 max 1239.942 sdev 49.447
throughput: 985.9 KB/sec
You should not rely on one counter to determine a bottleneck. Look for multiple counters to confirm your analysis. Also monitor the counters over a period of time and compare how the system performs at peak and non-peak times. Your disk subsystem should be able to handle the peak workload. Many times, the slow performance occurs only during peak times.
SAN, NAS involvement: As far as SANs go, there are several recommendations. It is critical to ensure that there is enough dedicated bandwidth between the Domino server and the SAN device. This is important as Domino maintains high levels of sustained IO, any slowdown/ throttling here will cause the entire server to slowdown.
A very busy Domino server could have an IO rates peaking as high as 60 to 100 megabytes/sec (mb/s). We recommend dedicated gigabit fiber/ethernet connections to SAN/NAS. A 1 gigabit pipe theoretically can support 128 mb/s throughput, realistically somewhat less sustained IO rate. A 100 megabit ethernet pipe theoretically can support 12.5 mb/s (100 mbits /8 bits/byte), although this is not enough bandwidth for a busy server. If switches are involved in your SAN disk fabric, we also do not recommend fan-in. Fan-in is where you have more defined bandwidth coming into a switch from your servers, and then it is funneled down to a lesser amount of bandwidth going to the actual SAN hardware. We recommend a 1-1 bandwidth configuration through the switches. For more information regarding SANs and Domino, refer to the document titled "Best Practices for implementing Domino in a SAN environment" (# 7002613).
Appropriate RAID (redundant array of independent disks) levels:
Using RAID has two main advantages: better performance and higher availability. RAID provides a way of storing the same data redundantly on multiple hard disks. By placing data on multiple disks, IO operations can overlap in a balanced way, thus improving performance. RAID at the OS/software level is not recommended with Domino.
Verify the appropriate RAID level to use for the various workloads. For write workload, such as transaction logging, we recommend using the RAID1 Enhanced levels or RAID 10, not RAID5. If transaction logging appears to be a contention point, we usually recommend turning off transaction logging on certain databases, such as the log.nsf and mail.box.
For Domino data, we recommend using RAID-5 or better yet, RAID 10 (1+0).Combining RAID-1 and RAID-0 is often referred to as RAID-10, which offers higher performance than RAID-1 but at much higher cost. For best performance, splitting up the disk IO will make operations process much faster. For example, put the OS on one mirror set, Domino program files on another, and the OS Swap/Page file on separate mirror set. Put transactional logs on a separate mirror set and separate controller and put domino data on a RAID 5 set. After doing so, if you are still IO contentioned for your Domino data directory then you can start looking at RAID 10. Once you have the proper configuration of disks, adding additional disks into the RAID set will increase the overall performance.
4) Certain databases appear to be causing problems, long held locks, semaphore time-outs. If you are getting long-held locks or semaphore time-outs always pointing to certain databases, check these for large sizes, or other things in common that they may have. Are certain views taking long to process? Is this database a bottleneck that every one needs to use? Are the identified databases very large? Are certain processes tying the database up? You may be able to reschedule the processes to run at other times, when locking that resource will not cause a problem.
If you are getting long held locks or time-outs on databases, check what is processing these. An NSD from the time of one of the long held locks can help see what the thread operating on the database is doing. Transaction logging or some other task, such as a backup task may be operating on the database, causing it to be locked for a long period of time. Remember, the larger the database, the longer most tasks will take.
Best practices for large Notes mail files
So, in summary, to resolve disk issues:
- Make sure you have enough disk and memory.
- Make sure disk is well maintained, defragmented.
- Make sure you have enough throughput capacity.
- Redistribute workloads, spread out the load across different filesystems.
- Reschedule high utilization tasks if possible
- Add throughput capacity, buses, disks, appropriate RAID levels..
- Plan, this is one bottleneck you want to avoid, it is usually not trivial to make changes.
Even the best-configured system can encounter difficulties on a problematic network. More to the point, even a poor-quality network on the client's side can have a negative impact on server performance. Unfortunately, "network problems" can be among the most difficult to diagnose; they can appear intermittently, apply to only a subset of users, vary in their intensity and, in some cases, can find their root cause in other applications/servers. Because of this complexity, network conditions are often the last consideration in troubleshooting and debugging. In this section, we'll look at when to consider "network problems" in your analysis, and we'll introduce a few tools that can provide useful information on network conditions.
Before going any deeper...check your names!
Naming schemes/problems are often found at the root of "network problems." Before moving to the more detailed troubleshooting steps below, take the time to audit your naming and addressing schemes. This means that you should check:
- Server configuration documents (Server documents, Connection documents, etc.)
- Host files (if present - be sure to check for them, because a host file will override any name servers!)
- DNS accuracy (does the name resolve to the expected address?)
- DNS availability (can the system reach its DNS server reliably?)
- Log files (to see what name(s) were used to reach the system(s) in question)
Ensure that you use consistent names (preferably the Fully Qualified Domain Name (FQDN)) for all servers in your deployment, and that those names are consistently resolved across your enterprise. This is of particular importance if you deploy servers with multiple network interfaces; in such cases, it is essential that clients resolve the name(s) to their appropriate IP addresses.
NOTE: The use of "short names," such as "SERVER01", in configuration documents is NOT recommended. Use of short names often leads to unexpected name resolutions, because the client(s) will append an unpredictable number of domain suffixes to any short name before attempting resolution. This is also a concern in multi-protocol environments (e.g., Novell, Appletalk) because short names might be resolved to systems on those networks, rather than the intended server. Again, use FQDNs as often as possible.
Types of network difficulties and their indicators
The network conditions which most often impact performance are an inability to connect, latency, packet loss/reordering, and outright disconnections. We'll discuss each in turn. The important point is that you need to know the "network layout" that connects your server/application to its clients; ask your networking team for a current map of your network(s), or involve them in the troubleshooting process.
1. Inability to connect
An inability to connect is usually indicated by error messages such as "Connection refused," "Connection timed out," or "No path to server." Once you verify that the recipient system is listening (i.e., in the TCP/IP "LISTENING" state for the TCP port in question, as reported by the netstat command), AND that the application is not rejecting connections, we should turn our attention to the network infrastructure. Some common 'connection blockers' are:
- software firewalls (e.g., BlackICE, ZoneAlarm, etc.) on the client machine. Ensure that outbound connections are allowed for the application in question.
- network firewalls (e.g., Cisco PIX, Raptor, Check-Point-1). Ensure that the appropriate connection permissions have been configured on the firewall.
- load balancer (e.g., WebSphere Edge Server, BigIP) failure/misconfiguration. If load balancers are in use, verify their proper configuration.
- outright network failure (i.e., "the pipe to Chicago is down"). Use tools such as pathping, ping, or traceroute to verify the network path from client to server.
It's important to remember that the error messages you receive may trace back to an "inability to connect" problem; for instance, an authentication failure may be caused by an inability to reach the authentication server. Be aware of all the connections being made/accepted by your servers/applications, and remember that network conditions may be affecting only part of the environment.
Latency (i.e. delay in transit) is a common problem in wide-area network (WAN) environments. WAN links are the points at which network congestion is most likely to occur; they are also the most likely source of an intermittent network performance issue. This characteristic of network traffic can best be measured by packet captures (using a tool such as Sniffer, WireShark, or Surveyor) taken at either endpoint of the connection. Most network analysis software can calculate latencies, in terms of both 'average' latency and 'latency since previous packet.'
Remember that some latency is to be expected in wide-area networking, especially where home networks (e.g. cable modem, DSL) and satellite connectivity (e.g. VSAT) are in use; your task is to determine the "typical" latency for your environment and identify any incidents of excessive latency. Isolating the point at which the latency is introduced may require multiple packet capture sessions, in which captures are taken from different points on the client-to-server network path; you'll need to engage your networking team for assistance in this activity.
IMPORTANT NOTE: The timing statistics of ping, traceroute and similar commands are NOT a proper measure of network latency. Those commands use a protocol (ICMP/IP) which is not subject to the same handling and prioritization as TCP/IP packets. In fact, most network infrastructure equipment will, when under congestion, discard any/all ICMP packets before it will discard any TCP packet. Therefore, those commands will, in most circumstances, indicate conditions that are more negative than those experienced by your "real" TCP/IP connections. Use these commands to indicate overall connectivity (i.e. "How are my packets getting to the server?"), but don't use their timing results for latency analysis in a congested network.
3. Packet loss
Some packet loss is unavoidable in a modern enterprise network. The TCP protocol expects some packet loss, and it provides a retransmission model with which to account for lost packets. Our concerns require us to ask two questions:
- What is the "normal" level of packet loss on my network, and
- Do I see instances of packet loss which exceed that "normal" level?
The answer to the first question varies from network to network, depending upon the technology involved; conventional wisdom suggests that a packet loss of 2-3% is "normal" for the typical LAN/WAN network environment, and that a packet loss approaching/exceeding 5% is indicative of a specific problem.
Since network equipment (e.g., switches, routers) does not track individual connections, the only way to diagnose packet loss is through traffic captures. Most network analysis software can indicate both missing packets in the inbound traffic and retransmissions in the outbound traffic (which imply packet loss in the outbound direction). It's important to note, however, that "missing packets" may also indicate a packet reordering problem, which we'll discuss below. Packet loss is most often the result of network congestion, so it's important to identify those points on the network path at which congestion is likely to occur, namely:
- the entrance/egress point to the network core (i.e., "the big switch in the data center")
- WAN boundaries (i.e., "the pipe to Chicago")
- network firewalls (this one is especially important in extranet deployments)
Determining the root cause of network packet loss is usually beyond the scope of the server/application owner; engage your networking team for assistance.
4. Packet reordering
As enterprise networks and network applications have grown, packet reordering has emerged as an increasingly common problem for the application layer. Every packet in a TCP/IP connection is stamped with a sequence number. The recipient's TCP/IP software knows what sequence number to expect next, and it cannot "package up" the data for delivery to the application until it holds all necessary packets in the proper order. However, network infrastructure devices, such as routers and switches, do not track packets by sequence number; under congestion, therefore, it is possible for packets to get "out of order", thus inducing a "reordering delay" on the receiving system. It's important to note that packet reordering is NOT the same thing as packet loss; however, significant out-of-order conditions (namely, packets arriving more than three packets out of order by sequence number) can trigger unnecessary "fast retransmissions" and further increase the "reordering delay". In extreme conditions, the recipient's TCP/IP window can be filled while waiting for out-of-order packets; this will have the net effect of 'gridlocking' the recipient until the problem is resolved.
The only way to identify a packet reordering problem is through traffic captures from both sides of the connection. Some network analysis products will identify a "TCP out-of-order" condition, but careful inspection of the packet data is necessary for confirmation. A reliable indicator of packet reordering problems is the presence of "Duplicate ACK" responses from the recipient; Duplicate ACKs are sent only when the client sees that its "next packet received" did not carry the sequence number it expected. If a recipient sends three Duplicate ACKs for the same 'missing' packet, the sender will perform a 'fast retransmission.' When you find a Duplicate ACK, you need to follow the traffic and identify the point at which the 'missing' packet appeared; you can then compare sequence numbers of the intermediate packets to verify that they were delivered out of order. Given the speed of today's networks, you may see a high number of Duplicate ACKs (the author has seen as many as 34) for a single packet, only to see that no 'fast retransmission' took place; this indicates that the packets were 34 places out of order (a major concern) but that the missing packet arrived in time (and was acknowledged in time) to avoid the fast retransmission. Even if no 'fast retransmission' took place, the "reordering delay" imposed upon the recipient adds more overall latency to the application.
It was mentioned, in passing, that network devices pay no attention to sequence numbers. This tells us that no device "on the network" can remedy an out-of-order condition. However, if a device receives packets out of order while it is in a congested state, it can make the resulting packet stream even more out of order. The author has seen networks in which the 'packets out of order' rate went from 4% (at the first congested device) to 48% (on arrival at the receiving system). This 'snowball effect' is indicative of a serious network congestion problem affecting multiple devices.
Needless to say, you'll need to engage your network team to properly diagnose and remediate packet reordering issues. The most common cause is buffer overflow/contention in switches and/or routers; most network management packages enable your network team to monitor buffer utilization on network devices. As with most network issues, concern should be focused on known "hot spots," such as firewalls, core switches, and border routers.
5. Outright disconnection
If you suffer from unexpected disconnections "while working" or "while logged in," the first step to ensure that the application/server is not terminating the connections. This can be determined through various debugging configurations; for instance, connections may be terminated if the TCP/IP stack runs out of memory. If, however, the application/server is not terminating the connections, the root cause is, almost always, a device "out on the network." The most likely causes are:
- Idle time-outs. Most firewalls and proxies have configured time-outs for "idle" connections, and many applications have their own idle time-outs as well. If the firewall/proxy time-outs are smaller than those of the application, you will see the connections silently terminated by the firewall, which will trigger a lengthy time-out when the application next tries to use the connection in question. Determine what idle time-outs are in place for your firewalls/proxies, and configure any application-level time-outs to a smaller value; this gives your application the "authoritative" decision on termination of idle connections.
- Wireless LANs. As wireless users move around, their connectivity roams from one access point (AP) to another. If this drop-and-reassociate cycle (which may involve a delay for authentication) consumes an excessive period of time, an application-layer time-out may occur. (When you receive reports of "dropped connections," always ask if the user is wireless!)
- VPN limitations. Some VPN applications impose their own limits on connectivity, and these can occasionally force connections to drop. This is particularly true of SSL VPNs, which are limited (by HTTP standards compliance) in the number of concurrent connections made to a single server. Be aware of the VPN technology in use in your enterprise.
Of course, it is possible for extreme network congestion (i.e., high packet loss, excessive latency) to trigger outright disconnections, in that either side could choose to "give up" on the connection. However, this is likely to affect all connections on the network, rather than just your application. Nonetheless, check with your network team for any indication of overall congestion problems before pursuing application-level troubleshooting.
Tools that can help
1. Server diagnostics/statistics
- Domino statistics include various network statistics, including per-connection and by-session information. See your Domino documentation for more information.
- The netstat command supports per-protocol statistics (check your OS documentation for the proper options), which can include packets retransmitted, packets sent/received, and the like.
2. Client diagnostics/statistics
- For the Notes client, client_clock can give you a general idea of client-side latencies on a per-transaction basis.
- The netstat command is available on most client operating systems.
- TCPview is a freeware connection monitor that shows all current connections in real-time; it is available for download from http://www.sysinternals.com.
3. Network Analysis software
- WireShark is a freeware packet capture/analysis package; it is available for download from http://www.wireshark.org.
- As mentioned above, the netstat command can give some limited statistics on general network traffic issues.
Network conditions can have a significant impact upon your application's performance. It is essential that you partner with your networking team to understand the general layout of your enterprise network, identify potential "hot spots", and gain an understanding of your network's "normal" performance level.
Monitoring and managing Domino servers is key to keeping them running well. Performance indicators need to be monitored regularly. Even if nothing else changes in a system, a user's habits or the way they use the system can change and can affect the overall performance of the system.
Effective monitoring enables continual service improvement by highlighting where services are being delivered successfully, and where help and/or changes may be needed.
This type of monitoring will help minimize downtime and increase response time. It can also be very useful for capacity planning. This is necessary as each environment is different, and each installation may require other third-party applications which may affect the performance of the servers.
Notes/Domino Best Practices: Performance
Domino Server Performance Diagnostic Data Collection
How Can the Nagle Algorithm Be Disabled?
Memcheck memory analyzer