In today's environment, predictive analytics is a critical tool for improving your business. Implementing technology that lets you analyze big data, make projections, and produce actionable items to reduce the Total Cost of Ownership (TCO) in your messaging infrastructure gives your business an important competitive advantage.
This paper explains a proof of concept (POC) we created to analyze information from various data sources, with the goal of implementing predictive analytics processing in our Domino server production environment. We wanted to know which data in which format could be used to make predictions about the unique performance characteristics of each Domino Partition Server (DPAR). Could we use a DPAR's unique history to predict its performance at any given time during the week? Such a prediction would allow us to validate the latest data against this prediction and use it to generate alerts when performance differed from the normal range.
About the Domino servers
This POC used several production Domino Partition Servers (DPARs) in IBM's North America and Europe Global Notes Architecture (GNA) environments. The DPARs from North America had been monitored daily since September 1999. The DPARs from Europe were new, so data collection began with this POC.
To make this POC realistic, we ran against live data, rather than using a benchmark where we couldn't know the results or the workloads ahead of time. Because we were running in a live environment, it was critical that the POC not have any measurable impact on performance.
About the data collection
Domino data, process-level data, and native Linux platform statistical data were collected in a DB2 database for analysis and data mining. This process had been refined and enhanced over the past 15 years and the current DB2 schema included dozens of tables. For this study, detailed data was stored for 180 days, rather than the default of 45 days, to provide a larger data set. Detailed data was purged after 180 days, leaving a summary set of older data.
Why infrastructure matters: IO processing in System z
Most Domino servers at IBM, including those in this POC, run on Linux under z/VM on IBM System z. The Total Cost of Ownership (TCO) of this platform is superior for these type of workloads in both public and private clouds.
In 1986, System z introduced dedicated IO processors to address the scalability and performance limitations of software Virtualized IO (VIO) architecture. This architecture difference is the backbone for the legendary IO capabilities of System z. We had to take the unique IO processing capabilities of System z into account in our study to ensure accurate results.
On System z, a thread whose CPU usage is close to 100% of a core should be identified as a potential CPU loop. IO is offloaded to dedicated IO processors, so a situation causing close to 100% of a core usage can be identified as a potential DPAR performance issue with a specific application rather than a capacity issue.
System z has the ability to scale workloads vertically first instead of horizontally, which we had to take into account also. What would be an indication of a performance issue in a VIO architecture could be perfectly normal in this architecture. For example, here is a link to a previous DeveloperWorks article, describing a 2009 benchmark where we supported 100,000 active Notes users in one Linux Kernel under VM. During this benchmark, we achieved 26K IO per second with an average two to three millisecond response time, equating to about 3% of the IO bandwidth of the benchmark infrastructure.
A new approach to early troubleshooting
Most of the Domino server analysis done today tests for preset conditions and monitors conditions through a dashboard or alert-based system. It would be as if a visit to your doctor would start with a set of tests for known issues before the doctor ever asked, “How are you feeling?” To evaluate your health, the doctor would look at your results compared to averages from a base population's test results. If nothing of interest showed up, you would be sent home without ever evaluating you as an individual or being asked, “Is anything different that we should know about?”
Customizing test conditions for each DPAR would require a significant investment of time and effort to first understand each DPAR and then build the correct limits for each one. Also, over time, a DPAR's usage can change, requiring those customized preset conditions to need updating. The cost of this effort, along with lack of documentation and people with the analytical skills to understand the statistical data, is the primary reason why most customers do not implement individual DPAR monitoring.
Some examples of preset conditions for Domino are “CPU busy over 70%” or “Number of cache hits above n rate.” The problem with these preset conditions is that they don't take into account the unique environment for each DPAR. While a DPAR's cache hit rate may not exceed a threshold, the fact that it changes by 20% (when the normal fluctuation is only 5%) should be a concern.
Catching a server condition early when it first shows subtle changes is much easier to fix than waiting until it reaches a threshold. It's like going to the doctor when you first start to feel sick rather than waiting until you have a high fever and need an ambulance. In the Domino world, this requires understanding the unique nature of each DPAR and spotting a behavior change before it reaches a threshold. Our POC's goal was to have each DPAR define its own “normal” and then monitor itself against that standard.
Isolating statistics unique to each day and time
Most Domino server analysis evaluates hourly usage, based on the prime work shift and off shift. The following chart tracks a production DPAR in North America over two months, showing the average CPU usage for Mondays and Fridays. The y axis represents the number of CPU seconds used for this particular DPAR while the x axis started at midnight and was plotted for each 15 minutes for a daily 24-hour period. In this chart, the blue line represents Monday's average and the red line represents Friday's average.
Illustration 1: Monday and Friday CPU usage in a 24-hour time period
There was a clear difference in the DPAR's usage on Mondays and Fridays. Mondays tended to have higher spikes in the morning when people came in to work to read mail that had accumulated over the weekend. Fridays tended to show lower mail usage because more people took Fridays off or left earlier. Other days of the week also had their unique trends.
This statistical difference was even more pronounced when we looked at the distribution over the weekend days. The pattern for Saturday and Sunday was statistically abnormal when compared to a weekday pattern. The chart below shows the two-month average for Saturday and Sunday for the same period as our previous chart, with the blue line representing Sunday and the red line representing Saturday.
Illustration 2: Saturday and Sunday CPU usage in a 24-hour time period
The usage patterns are different from the weekdays and different from each other. However, each Saturday and Sunday produced a predictable pattern with spikes and valleys tied to maintenance tasks that started or stopped. This allowed us to include predictive analysis on maintenance tasks such as backups, indexing, or agents. While each one of these tasks was different and caused a different pattern, the change and length of the pattern when each task was active was predictable based on each DPAR's unique history.
Capturing time-sensitive statistics for each server
In this POC, we wanted to track statistics for each day of the week for each DPAR, so we could isolate usage trends unique to each day rather than averaging out usage over all weekdays. This led to statistically smaller ranges of what was normal for each DPAR on each day. Otherwise, different data points would have impacted the calculations of what was normal for a DPAR. For example, using all weekdays in a sample set would have allowed a wider range of statistically acceptable conditions. A Friday event would not only be evaluated against other Friday events, it would be evaluated against all events on all weekdays. This whole-week collection could capture large and significant events, but would miss smaller events that were precursors to serious conditions. On the other hand, scaling down this range to something smaller than a full day would have led to too many false positive alerts that would cause administrators to stop trusting the results.
Analyzing usage by each day of the week also allowed us to predict normal usage for nightly maintenance tasks scheduled to run on different days of the week and for varying lengths of time. For example, a backup on one night might back up twice as much data as another night and therefore run substantially longer. While backups would be different each night, each night would produce its own length. Collected over time, the results would allow us to define what was normal for a particular DPAR and task at a specific time and day of the week.
Also, we wanted to avoid having to filter the data for different tasks and treating them as outliers because the more we massaged the data to handle outliers the more suspect the results would be. We now had a +99% range when the data was considered normal for that day and 15-minute time period. By having time-sensitive data that showed a normal pattern, we could spot real issues more quickly.
Why we eliminated rate-based measurements
While isolating each day of the week allowed us to see unique patterns, the rate-based measurement typically used for Domino analysis did not work for performance analysis in our POC. Rate or capacity indicators such as CPU busy, IO performed, or Pages paged indicate how much capacity is left and if the DPAR is deviating from its history, but they don't indicate how well a DPAR is performing.
CPU busy or other rate-based values (from the native platform or Domino) are not good measurements of a DPAR's performance. For example, high CPU usage can be tied to performance problems, but it is not predictable. Varying workloads behaving normally can impact the results. More workload (a capacity issue) that is handled efficiently would cause a CPU increase with the DPAR performing normally. A looping thread causing a CPU increase would cause a performance degradation. Conversely, a drop in CPU would not mean that the DPAR was performing better if a bottleneck caused the drop and prevented the workload from running. Therefore, a CPU increase or decrease wouldn't necessarily be tied to a performance problem.
The following chart shows CPU usage on Mondays on one DPAR over a two-month period. The orange line was statistically out of range of the other samples and showed that CPU usage was abnormal on that day.
Illustration 3: Monday CPU usage over a two-month period
However, the orange line sampled a Monday that occurred on a US national holiday (Labor Day). The red line was also statistically below the other lines but was a Monday when a number of people on the DPAR were not working. The CPU dropped because there were fewer users working; it did not indicate an issue with the DPAR or the infrastructure itself. If we had allowed rate indicators in our POC, these samples would have created a statistically abnormal range that would have negatively impacted the statistical distribution of the data samples. Therefore, to create a more accurate range of what was acceptable, we had to remove these values from the sampling dataset.
Using cost-based values for performance analysis
Recognizing that rate-based values would not work as performance indicators, we turned to cost-based values to look at how many resources it takes to perform a unit of work. Cost-based values have been a good indicator of DPAR performance historically.
To calculate a cost-per-unit metric for Domino we merged the Domino statistical data with the native platform data. We chose the optimal 15-minute sample interval, which balanced the cost of collection with the benefits.
Note: We used the native platform to collect platform statistical data so we would have complete data for the entire sample interval we wanted. We did not use the PLATFORM. data in Domino to collect native platform data. The stat package uses a snapshot approach to collecting data every 20 seconds. If we looked at the data every minute, we would see the last 20 seconds, but would be missing the other 40 seconds. If we checked the data every two hours, which is the default statrep collection interval, we would see only 20 seconds, or 0.2%, of the two-hour sample interval.
Calculating cost per user
To calculate a cost-per-user value, we summed the CPU used by any of the Domino tasks and divided it by the number of active users in the 15 minutes being sampled. We used server.users.active15min since this value is not impacted by notes.ini changes like the server.users statistic. server.users reports different values based on the DPAR's configuration even if the workload is the same.
Charting the previous Monday sample as a cost-per-user metric, we saw very different results. The flat area in the middle of the chart represents the prime work shift when end users were active. Notice how this trend was flat while the corresponding CPU usage shown in Illustration 3 spiked.
Illustration 4: Monday cost-per-user over a two-month period
Benefits of the cost-per-unit indicator
Evaluating the cost-per-user results, we decided to accept the cost-per-unit indicator for the following reasons:
The derived values were a good performance indicator. Fluctuations in the resource usage could be associated with the fluctuations in the workloads and there was a direct correlation between the two.
The workload was being delivered in a scalable fashion. The holiday Monday outlier in the previous chart was well within the range of the other Mondays. If the cost per unit had increased as the workload increased, then this value would have been outside of the other data ranges.
Values that seemed to be outliers (the holiday drop and two CPU spikes) were actually valid data and workload spikes. They weren't data points that could skew our data, but became part of the base that anchored the data.
While the prime work shift user data was consistent, the off shift fluctuated a lot. However, this fluctuation was consistent within the various workloads being executed at that particular time and were consistent over time.
These values were now predictable because of the consistent trend that was established.
Depending on the workload running, this cost-per-unit metric within a DPAR could be extended to other units, such as:
Cost per transaction
Cost per HTTP hit
Cost per agent
Cost per message delivered
Cost per index
Creating a prediction range
When we go to the doctor, we expect to be asked questions that pare down the number of tests to be performed based on possible health conditions. So is the case with DPAR performance analysis. We wanted to set up conditions that gave us answers to how the DPAR was doing and triggered further analysis only as needed.
To work with the cost-per-user samples to trigger further analysis, we generated a distribution range based on the samples' upper and lower ranges. We used a 2.5 sigma to build a 99% range, with the red line showing the 2.5 sigma upper limit and the blue line showing the 2.5 sigma lower limit. The values below zero were an indicator that this model was still producing unacceptable values.
Illustration 5: Monday cost-per-user, upper and lower range
To narrow the distribution range further, we determined that there were really two sets of data within our data. One set represented the upper limit and the other set represented the lower limit. Also, the calculations for the upper and lower limit were different since the distribution pattern was different for each. The following chart shows the new model, with the green line representing the modified upper limit and the yellow line representing the modified lower limit.
Illustration 6: Monday cost-per-user, predictive range
Test driving the prediction
With all expected conditions in place, we were ready to validate the prediction against the DPARs. We first collected 90 days of data to allow the system to build a unique baseline for each DPAR. With a sample for each day of the week at 15-minute intervals, we had 12 data points per sample interval to build our baseline.
We already had a system that was collecting data daily from a sample set of DPARs in North America. We expanded this system to include the three DPARs from Europe and the six in North America. The criteria we used was:
They could not measurable impact the existing production environment.
They did not require the installation of additional products.
They would preserve the collected data on the source system if the prediction system was down or being worked on.
Based on these requirements and the existing system we already had in place, the following process was established on each Linux guest to collect the Domino and native platform data.
On each Linux guest, we used a cron job to perform a Domino show stat and show trans command into each DPAR's standard in. We redirected the output of standard out to a timestamp file containing the result from the DPAR. If multiple DPARs existed in the same Linux guest, each DPAR was targeted.
We used a second cron job timed to match the first job, which collected the thread and process level data. It was critical to have the same data sample match the same interval.
Note: We needed a consistent interval, so we did not use the Domino statrep collection because the interval “slips” with each collection. The collection interval is designed to include an asleep interval, so the next collection time includes the asleep interval plus the previous collecting time.
We synchronized the SAR data collection to match the same interval and collect times.
We used a separate system and Linux guest to pull the data and process the data.
The process looked for new data files on Linux guest images.
New files were retrieved to the local guest.
Each file was processed and loaded into our DB2 database, which is described in more detail below.
We validated the new data against the prediction and documented the results.
Here is a sample section of output from the Domino show stat command.
Illustration 7: Domino show stat command results
Loading data into DB2
To load this data into DB2, we needed to convert it from cumulative data to interval data where appropriate. We were already using parm files in our environment to add in new statistic values or different SAR headers dynamically to avoid coding changes for new or outdated values. So we created a parm file to handle the creation of our DB2 constructs and map the raw input data to the appropriate DB2 constructs.
The following images show a sample that mapped Domino clustering statistical data to our DB2 cluster table.
Illustration 8: Domino statistical data mapped to DB2 tables
In this sample, the values were separated by commas and had the following meaning:
The DB2 table name to be used.
The DB2 column name to be used.
A value of xxx indicated to ignore this value. Items such as max and average were stored in Domino because they are cumulative values. We had interval data in DB2, so these values could be derived from the data itself, rather than storing them unnecessarily.
The Domino statistic name to be used.
A binary flag (0 or a 1) indicating if this value was cumulative or not.
A value of 0 indicated the load process should subtract the previous interval value to obtain this interval's value.
The corresponding DB2 data type to be used for this interval/performance table.
The corresponding DB2 data type to be used for this daily/capacity table.
Any data processing needed for the values.
Since we were scaling Domino vertically on System z, many of the values were unnecessarily high. By converting the data such as KB to either MB or GB, we were able to reduce the chances of exceeding the DB2 construct. In the sample, we converted KB to MB with the UNITS%1024 statement.
How the data is processed from the interval tables to the daily tables.
The entire DB/2 schema was built from this parm file, including the interval and summary tables. The following image shows a portion of the DB2 tables.
Illustration 9: DB2 tables
The following image shows what the data loaded into the server interval DB2 table looked like using the SQL command 'select * from wojo.server where date = '4/1/2014':
Illustration 10: Sample DB2 server interval table
Using long-term data to see trends
Storing multiple years of data in the summary and daily tables allowed us to perform capacity planning analysis over multiple years and trend-line analysis of projected DPAR usage. It also allowed us to see yearly patterns. For example, we saw a gradual drop in usage from April through August each year, corresponding with employees taking vacation time. Around Labor Day weekend, most employees were back at work and we typically saw a 10% to 20% increase in DPAR capacity around the holiday weekend.
The following chart shows this Labor Day weekend pattern in 2004 and 2005. You can see why predicting a 10% to 20% usage increase at this time each year would be critical to efficiently managing resources in a messaging infrastructure. The cost-per-user values did not change because the DPARs were performing the workload per unit the same. Remember that the models used to predict performance are different from those used for capacity planning.
Illustration 11: Labor Day usage increase in 2004 and 2005
Making predictions for the next week
Our process built predictions each day that were valid for almost one week out. For example, at 1:30 AM on Thursday we would build next Wednesday's prediction.
The following chart compares incoming values in blue to the predictions over a two-day period. As we refined the calculations, we were able to predict the high and low range most of the time.
Illustration 12: Incoming values compared to high and low predicted range
Analyzing the relationship of server tasks on a DPAR
In addition to looking at a cost-per-unit measurement, we also looked at the relationships of the tasks running. Although the resources increased or decreased, the relationship between the tasks running showed a consistent behavior. Here is a sample showing the relationships between server tasks for a DPAR running during prime work shift.
Illustration 13: Server task relationships during prime work shift
To compile this data, we first looked for the task with the highest CPU usage on this DPAR for the previous few weeks, which turned out to be the server task, our “high flyer.” For each 15-minute interval, we then compared the cycles used by this high flyer to the cycles used by the other tasks. For example, assume we had a server task using 100 CPU seconds and a router task using 15 CPU seconds. 100/100 would produce a value of 1 for the server task and 15/100 would produce a value of .15 for the router task.
By dividing the other tasks by the high flyer, we saw what was typical for this DPAR and confirmed that the task relationships were fairly stable during the day. We could now use this typical profile to spot variations in the DPAR's behavior. When problems occurred, variations allowed us to trace back the performance problem and isolate the change in behavior to a specific time period. Identifying the start of the abnormal behavior gave us information that helped us identify the root cause of a problem.
In the following chart, we looked at the prediction of this relationship compared to the actual values over a two-day period.
Illustration 14: Relationship of server tasks during prime work shift and off shift
Unlike the cost-per-user measurements where prime work shift was the flat line at the bottom, prime work shift in this relationship is the flat line at the top. While off shift showed a different pattern again, we could still build a valid prediction to measure against it.
Remember that these patterns were unique for each DPAR (values will vary) and that we were not defining or entering the actual values. We were looking at the patterns in the data, letting each DPAR's history define what was normal.
Not all spikes are bad
Another useful statistic to analyze was the cost of delivering a message. The following chart shows the cost per message delivered over a two-day period. While the number of messages delivered per interval varied, the cost per message stayed about the same. The exceptions were the two spikes in the chart where the delivery cost increased by over 1000% for messages delivered during this interval. While the 1000% increase is extremely high, the spikes occurred at a time of very low usage, so they had very little CPU impact. In this case, the spikes were expected as they corresponded to the daily maintenance task of mail database compaction. If they had been missing, we would have investigated why the scheduled task didn't run.
Illustration 15: Spikes correspond to nightly server task
Why infrastructure matters, again: Tracking the relationship of CPU usage to IOs
As discussed earlier, System z handles virtualization of the IO differently (hardware, firmware versus software). This difference allowed us to look at the relationship of the CPU usage by a process to its IOs. The cost of the IO was not included in CPU processing on System z, which allowed us to isolate CPU usage and compare it to the workload running.
The next two charts show the relationship of CPU to IOs. While the scale is different for the server task in the first chart and the cluster replicator in the second chart, the pattern is very similar. CPU usage diverging from the IOs was often the earliest indication of a performance problem. Rather than being faced with an emergency DPAR failure or performance degradation, we used this ratio to proactively identify when a DPAR started to deviate from its normal pattern so we could take restorative action quickly.
Illustration 16: CPU usage compared to IOs for server task
Illustration 17: CPU usage compared to IOs for cluster replicator
Domino transaction statistics will be valuable in the future
Domino transaction data that you receive with a show trans command lists all Domino transactions that have executed since the start of the DPAR. The type of transactions and their execution rates and time fluctuate during the day and on the weekends, but the pattern is consistent over a full-week interval.
There are some known bugs in the reporting of these transaction statistics, including missing transaction identifiers and duplicate transactions with different values. When these issues have been resolved, we recommend investigating the use of Domino transaction data in predictive analysis.
Automating alerts to prompt further investigation
If you went to the doctor and said, “I have an ear ache,” you would receive tests that focused on the ear that was hurting. If the tests indicated a problem with your inner ear, more specific tests on your inner ear would be ordered. In the same way, the predictions generated in this study didn't identify a particular problem but they provided a constant and low-cost way to check on the health of each individual DPAR. If one of the indicators pointed to a potential issue, administrators could receive alerts and decide what deeper investigation was needed. Alerts could also trigger additional analysis automatically, which would give administrators more information when they received the alert. For example, knowing the typical distribution of server tasks, you could generate an alert at the first sign of a deviation in a particular DPAR. Alerts would need to take into account the severity of the deviation from the norm as well as the duration. For example, a DPAR with a minor deviation for one sample would be less of a concern than a DPAR showing a deviation for multiple intervals or a single deviation substantially out of the normal range.
Our study found that several monitoring values have good potential for indicating how well a DPAR is performing compared to its previous history. By monitoring these values, you can be alerted early to potential performance issues on your DPARs. You can also use these values to predict a DPAR's future performance.
Here are our tips for setting up predictive analysis in your environment:
Make sure the data is complete
Neither Domino nor the native platform can deliver the statistical values by themselves; they must be merged. The relationship of this data provides better prediction values than the actual raw data in each component. However, the actual values are what allow you to troubleshoot and isolate a specific performance issue.
Know your data
Synchronization of the data sources to be consistent is critical.
Cumulative data must be converted to interval.
Avoid excessive data manipulation.
Beware of identifying normal-range data as an outlier.
Spikes and valleys are not all bad and some are normal.
Understand that each DPAR is unique
Each DPAR has a unique trend and values.
Each DPAR's history can be used to predict what “normal” should be for that DPAR.
Know the difference between capacity and performance indicators
Cost and ratios are performance indicators.
Rates are capacity indicators; they are not performance indicators.
Moving from dashboard monitoring to predictive monitoring frees up resources and allows you to identify and focus much more quickly on DPARs with performance issues. The ideas in this article give you one more tool for managing a high-performing, high-availability infrastructure.