ShowTable of Contents
Introduction
When is it important to understand performance of the IBM® Lotus® Sametime® user experience? And how do you do it?
The simple answer to the first question is all the time. In my workings with enterprise deployments, I have learned that it is not sufficient to know the status or performance of individual points along the way; rather, it requires a higher-level view. There are many moving parts that change and that affect performance on any given day.
What often is missing is the measurement and understanding of the entire path from client to server. Our teams are measured by the user experience; why not measure the performance of the client experiences?
Obviously, there are times when there's churn within the software environment supporting both client and server, in which case, it is important to understand any delta that has occurred as a result of the initiated changes. An example of this are fixes that are applied to deal with a performance issue or fixes that may negatively impact performance but are necessary.
Both situations require an understanding of what is being gained or lost, so as to best evaluate whether the changes are complete or need additional focus before considering the issue closed.
Beyond the server code-levels changes, there can be environmental changes that affect how users interact with the Sametime solution, and that can be critical to how the systems are providing their intended productivity tasks. Network path changes, firewall updates, and OS-level patches are just a few of the updates outside the server/client software that can be crucial in determining whether the environment's performance is acceptable.
These kinds of changes are not always well communicated company-wide and can affect only certain geographic areas, depending on where the changes occur and the network topology of any given region.
In short, when dealing with an enterprise solution supporting thousands or hundreds of thousands of users, it is best to be proactive in understanding how those users are experiencing the solution. This means creating or purchasing tools that constantly monitor the user experience and are able to clearly display and compare this information, to make the most use of it. This article explains the Watchit tool, available for no charge, which does just that.
Why understanding user performance is so critical
Recently we worked with a customer who was in the middle of some changes to their Lotus Sametime environment and struggling with how to actually measure the effect of their changes. These configuration changes were relative to their existing production environment as well as to the creation of a totally new environment.
Thus it was necessary to fully understand the performance of their current production system, supporting over 100,000 users, and how their newly designed and upgraded environment would perform under a similar load. Much was changing in the new environment, including the directory, cluster configurations, and server/client versions.
There were many moving parts that would be hard to measure individually, but the basic need was to understand what performance the end user would expect when the switch-over to the new environment began.
To be able to guess at what performance would be in the new configuration, it was important to know exactly what users currently saw in each geographical area. The existing customer environment consisted of business units throughout the world, with the server clusters located in the United States.
Some of the reasons to take a client view of performance are:
1. Software is measured by the end user productivity, not by server performance.
2. There are many factors, both hardware and software, that impact end user satisfaction.
3. Users in one geographical area may experience varying degrees of performance, but the server shows only a centric view of how well the services are supporting users' tasks.
This is not to say that server statistics are not an important part of the puzzle, but they are certainly not the whole picture. Server statistics can provide valuable insight to how each piece of the entire structure is functioning, but it cannot illustrate how users actually see the solution.
Sametime server stats such as those listed below are easily viewed from the server and provide a good view of the workload the current solution is providing.
- Total Community Log-ins. This is the total number of log-ins to Community Services on the Sametime server that you are monitoring. The Total Community Log-ins chart includes multiple log-ins from the same user. For example, if a user is logged in from both the Sametime Connect client and the Participant List component of the Meeting Room, the chart records two log-ins for that user.
- Total Unique Log-ins. If a user is simultaneously logged in from multiple Community Services clients, the Total Unique Log-ins chart records only one log-in for that user. A user logged in from multiple clients is considered a single "unique" log-in. Use this chart to determine the current number of Community Services users.
- Total 2-way Chats. The total number of 2-person chats taking place on the Sametime server. This chart includes only chats that were started from the Sametime server you are monitoring. For example, if you are monitoring Sametime server A, and a user who has specified Sametime server A as her home server starts a chat with another user, that chat will be counted in the Total 2-way Chats chart. You will not see chats that were started by users who have specified a server other than Sametime server A as their home server.
- Total n-way Chats. The total number of multi-person chats taking place on the Sametime server. This chart includes only chats that were started from the Sametime server you are monitoring. For example, if you are monitoring Sametime server A, and a user who has specified Sametime server A as her home server starts a chat with two other users, that chat will be counted in the Total n-way Chats chart. You will not see chats that were started by users who have specified a server other than Sametime server A as their home server.
- Total Number of Active Places. The Total Number of Active Places chart lists the combined number of n-way Chats and active meetings. Both n-way Chats and online meetings are counted as "Active Places." 2-way Chats are not counted in this chart.
Server statistics such as number of server transactions, CPU utilization, and cluster transactions are all useful when measuring the acceptability of the individual server or cluster as a whole. However, there is much more to determining whether the clients will be happy than just this view, which ignores the entire infrastructure on which the success or failure of the deployment depends.
The only way to take into account the total solution---that is, network, OS, client code, server code, server configuration, and end user geographic location---is to simulate critical tasks performed by users from each user population center.
Simulation alone is not enough; it must have the ability to easily collect performance statistics from each instance and allow for comparisons to relevant data collected previously or simultaneously to make it meaningful. It is only data, and it only becomes usable knowledge when you are able to use it constructively.
Using the Watchit tool, which we go into detail below, administrators can understand log-in response-time metrics, Sametime user-resolve performance metrics, and instant messaging (IM) delivery response time. This information is critical to fully understanding what the user experience is and makes it easy to create useful charts to gain further understanding of how changes affect your solution.
Using Watchit to obtain critical performance information
We have only begun to outline what kind of information Watchit can collect. The idea is simple:
Simulate critical tasks and, using the information from the generated logging, create descriptive, comparative charts to provide a before-and-after or a time-frame-designated picture of how your users see the solution you manage.
To do this, we explain how to set up the Watchit tool, where to use it and, once it has run, how to generate reports that you can use to make decisions on your environment effectively. Although Watchit is useful as a functional and performance monitoring tool for your Sametime Community server solution, for this article, we focus on how to use Watchit to collect performance information and easily create useful reports or charts with the created data.
A complete picture of the Watchit tool is described in the developerWorks® article, “
Monitoring availability, performance, infrastructure, and beyond using IBM Lotus Sametime.”
Using Watchit to collect important performance metrics is as easy as running the tool's awarecheck plug-in to simulate two users who perform the following tasks:
1. Log into Sametime and log out based on a defined interval.
2. Perform user resolves to test the performance of the backend directory.
3. Send IM's between the pair of users to measure performance and delivery.
Each task is repeated based on the configuration of the staware.properties file, which we now explore. Once the Watchit tool is installed within a directory, only two files need to be edited to begin this performance data gathering.
The first file, watchit.properties, tells Watchit what plug-in to run and, in this case, it should look like that shown in figure 1. This watchit.properties file tells Watchit we are only interested in running the Sametime user simulation plug-in.
Figure 1. Example watchit.properties file
Other plug-ins can be run at the same time with no interference, but in this case we are keeping it as simple as possible.
Once the watchit.properties file is in place, we now must configure the awarecheck plug-in, which uses the staware.properties file. Figure 2 shows an example of this file.
Figure 2. Example staware.properties file
Though the Sametime user 1 and user 2 information can be found in the “
Monitoring availability, performance, infrastructure, and beyond using IBM Lotus Sametime” developerWorks article, we focus on the first section and how these variables affect the content of your run. Specifically, we look at Check_Interval, Logout_Timeout, and IM_Interval. Table 1 lists the descriptions of these variables.
Table 1. Performance thresholds for the awarecheck plug-in
Function | Threshold |
Check_Interval=x |
Number of minutes that the users trade instant messages
|
Logout_Timeout= x |
Number of minutes that a user waits before logging back in
|
IM_Interval=x |
Number of seconds to wait to send the next instant message
|
Check_Internal is important because it controls the amount of time the simulated users perform the IM function, which in turn relates to how often log-ins and resolves occur. When users first log in, they also do a user resolve.
Once these are complete, the awarecheck plug-in performs the IM tasks until the interval has expired. This means that, the lower the interval, the more log-ins and user resolves will be in your test sample for performance metrics. If the interval is higher, then there will be more IM tests but less log-in and user resolves.
This can be used to test more of one function than another, depending on your needs; for example, administrators may want more log-ins and user resolves if they suspect that is the weak link in the environment.
Or, if there is IM latency, then a longer interval is preferred, so to focus more on IM. The ability to tailor it to your own needs is what makes the tool so flexible. User disconnects would also warrant a longer Check_Interval, to see if users become lost during log-in times.
Logout_Timeout is the amount of time after the Check_Interval is completed before the test starts up again. So if the goal is to get as many cycles through in any given time, then the lower this value, the better. If you want to stage the user simulations, it may be best for this to be a value that seems more like a handful of minutes rather than just an immediate log-in again. Again, the idea is flexibility.
IM_Interval is designed to allow for a normal response time. A normal user response is not to send an immediate IM back to the sender, since it may take a few seconds to respond. This value gives you that ability, but if the goal is to get as many IM's as possible in the sample set, then this value should be very low.
Debug: This setting is needed to generate the proper logging for the report generator. Use "Debug=true" in staware.properties file to ensure the proper logging. If not included, only basic information will be logged and no reports can be generated.
When and where to run performance measurements
Surprisingly, many administrators do not know what their average log-in times, average user resolve times, and IM times are. So, first things first: we need to get a baseline to determine what is the current expectation.
This is easy. Just run the the Watchit tool on a daily bases during business hours. Each run generates a log in the format:
watchit_YYYYMMDD_HH_MM_SS.log.
A sample log file is shown in Figure 4.
Figure 4. Sample generated log file from Watchit run

This log file contains all the information an administrator needs to collect the necessary performance data for log-in, user resolve, and IM delivery times. After these logs are collected for a week or two, we have a good picture the current state of the performance. Now what to do when changes are made to the environment?
Basically, we do the same run with the changes in place. Or, if this is a new environment, we should run the same tests with the existing intervals, to replicate the exact testing that was done on the previous environment. Even if the Sametime team is not aware of any changes made in the environment, if customers begin to call in with performance complaints, knowing the baseline, then customer performance and current Watchit data will be critical in isolating the root cause.
Creating useful response-time reports
Once the logs have been created from the runs, there are many options to generating reports. The logs can be processed by themselves or concatenated to provide a performance picture of any time period.
For example, if you run the Watchit tool on a daily basis, you can concatenate the logs for the five days of the week and run the report generator to yield a weekly view of the data. It's that easy. It may be useful to know how the solution performed on any given day, so as to understand capacity needs on a certain day of the week, for example, if trouble is only seen on Mondays. It can also be useful to look at the weekly view to get a larger sample size for log-in, user resolve, or IM delivery averages, minimums, maximums, and medians data.
There are two kinds of reports produced. One is a general text report, and the second is used to include in a spreadsheet to generate valuable charts. The report generation scripts are UNIX® shell-based parsing tools that can be run on any UNIX shell or Cygwin Microsoft® Windows® environment.
Text reports
First we look at the simple text report that can be generated from a Watchit tool run or from a combination of logs. Figure 5 shows a sample text report. Whether you generate the text report or the .csv report, the data is displayed the same:
- Maximum response time for Log-in, User Resolve, and IM delivery
- Minimum response times for Log-in, User Resolve, and IM
- Total number of Log-ins, User Resolves, and IM's sent
- Average response time for Log-in, User Resolve, and IM delivery
Figure 5. Median response time for Log-in, User Resolve, and IM delivery
To generate the report in figure 5, only one command is required:
./process_bot_output.sh <logfile.log>
These text reports are a quick and easy means to evaluate user response times in your environment. The reports can be even run on a running instance of the Watchit tool, so you can get real-time data, anytime, on the acceptability of the system's current state.
While these response-time reports of individual runs are useful for understanding any given user-to-user interaction, it is often relevant to produce reports that contrast multiple environments, so as to better illustrate changes in response times or study comparative environments.
In the beginning of this article, we outlined the many reasons for comparing environments: Understanding changes in your environment from week to week, or after configuration or software changes; or comparing new environments to existing production environments.
Understanding the performance delta and setting appropriate performance goals is often critical to implementing changes or deploying new environments successfully. From a debug perspective, it would be useful to know what our simulations are reporting when customers are reporting slow performance. This allows us to better isolate the problem and thus resolve it faster.
Graphical reports
The Watchit tool now provides a simple way to produce graphical reports that make it easy to compare results. The goal is to run Watchit instances either between new and old environments, comparing previous weeks' results to current results, or before and after specific configuration changes have been made.
If users report performance issues, it may be useful to look at Watchit results from the previous weeks to see if there is any delta in the logs and reports. The more information that can be provided by Watchit, the less impact to end users when we troubleshoot performance issues.
Generating these graphical reports is simple. The same logs generated by the Watchit tool that are used to generate the text reports can also be used to produce graphical Microsoft Excel-based charts (or any spreadsheet product) with just a few extra steps.
There is no extra configuration option in Watchit to enable this feature. Instead, a script called “process_bot_output_xls.sh”, is included with the Watchit package that generates a comma- delimited file to be imported into the spreadsheet. In this example, we use Microsoft Excel to demonstrate and use troubleshooting as our goal since that is the most common task presented to administration teams. Note, however, that the same methods apply to testing configuration changes or comparing different clusters or environments.
The goal is to have an instance of Watchit running in locations where your users reside. For this example, we use the scenario in which a single cluster is located in the US. with user populations in both the US and abroad. We want to compare two Watchit instances, comparing the metrics between two clusters within a larger environment.
Two instances of Watchit are created, one placed in the US. data center and the other in Asia. Watchit is run for the week in both environments, and now the administrators want to compare the end user performance of various Sametime tasks. This will highlight how much performance is lost (on the client side) supporting remote clients in Asia compared to the US. users.
To generate the reports:
1. Run the ./process_bot_output_xls.sh script on the bot log file (no need to stop bot):
syntax: ./process_bot_output_xls.sh watchit_010101:12:00:00.log series1 > series1.out
First parameter = Watchit log name; Second parm is series name for charting
Do this for each Watchit instance you want to compare
2. Combine all seriesX.out files into one file, using the command:
cat series*.out > total_series.out
3. Now generate the Excel reports, using the attached agent code
to auto-generate the reports with the concatenated out files. Attached to this article is the “Watchit_Report_Generator.nsf” database
and an agent that can be run from any Notes client to auto-generate the Log-in, Resolve and IM performance reports.
4. Run the report generator by selecting Create -- Process Watchit Data, as shown in figure 6.
Figure 6. Running the report generator
The sample automatic reports are shown in different Excel instances. Figure 7 shows the Log-in comparisons for Log-in between each series, and figure 8 shows the Resolve comparisons.
Figure 7. Sample Watchit Log-in report
Figure 8. Sample Resolve report
Figure 9 shows the IM delivery comparison for as many series (bot instances) that you combined process_bot_output_xls.sh script output for each series. In this case, we used two outputs, as shown above.
Figure 9. Sample IM Delivery report
Conclusion
Understanding the user experience and being able to quickly measure it is a powerful tool in better understanding your deployment. Measuring the performance impact of changes, tracking trends, and troubleshooting is made easier with the Watchit tool and some simple scripts. The ability to get an immediate snapshot of the performance of your environment is valuable information that can help you isolate problems, identify trends, and reduce outages.
Watchit is flexible enough to be run from any workstation and is quickly configured. Used in a variety of ways (recreating and alerting of issues, 24X7 monitoring, and performance impacts to configuration changes), Watchit can be deployed in any situation to quickly gain a better understanding of a problem or the customer experience.
Performance is more than server statistics. The entire path from user workstation to the server is the complete picture of the performance or function of your deployment. Using Watchit, you can make the most informed decisions as possible in maintaining your solution.
Resources
Read the developerWorks Lotus article, “
Monitoring availability, performance, infrastructure, and beyond using IBM Lotus Sametime”
Read the developerWorks Lotus article, "
Creating and using a real-time port monitoring application powered by IBM Lotus Sametime instant messaging."
Refer to the
IBM Lotus Sametime product page.
Refer to the
IBM Lotus Sametime product documentation.
Participate in the discussion forum
About the author
Jim Dewan is an Accelerated Value Leader for IBM Lotus currently designing a series of tools and bots to help customers monitor and debug their Lotus deployment. With ten years of Lotus Domino server development experience, Jim's previous role was as a Project Lead in the Lotus Domino administration team. Jim was also the Technical Lead for the Lotus Domino Linux on System Z effort, specializing in application development, toolkits, and enterprise data accessibility.