ShowTable of Contents
This purpose of this article is to detail the high-level steps involved in testing IBM® WebSphere® Portal performance and resolving performance bottlenecks in a simple and straightforward way. Understanding the tools and how to use each tool is imperative. Each tool has its own learning curve and can take some time to master; however, it’s the mastering of these tools that will lead to the ultimate goal, that is, ensuring all Portal deployments are highly successful.
Performance test planning
Clearly defined performance test goals must be established prior to running load tests. These performance test goals must include a throughput component (for example, page views / hour) as well as a response time component (for example, the 90th percentile response time must be less than 4 sec).
Additionally, performance testing entry criteria must be established; for example:
Performance testing entry criteria:
- Clean logs (no exceptions during main code path testing)
- No application logging to SystemOut.log
- Perform Homepage single-user time line analysis
- Non-Secure Sockets Layer (SSL) traffic to/from Portal Server; this requirement will be lifted once the performance Service Level Agreements (SLAs) have been met.
Clearly define performance SLA metrics:
- Peak page views per hour
- “Top 5” pages (“Top 5 pages is just a way to represent the critical end-user actions. There actually may be more than 5 pages.)
- Peak page views per hour for each of the “Top 5” pages
- Response time acceptance criteria measurement for each of the “Top 5” pages (95th percentile; 3 sec or less)
Performance testing methodology
Concentric Circles testing
The Concentric Circles approach to testing should be followed, the basis of which is to start with the simplest environment, optimize that environment, and then expand it. The simplest environment is a single-node server running one JavaTM
Virtual Machine (JVM).
It is in this environment that the single-user testing is performed, which is an essential step in understanding how the end-to-end system behaves. This step includes verifying and validating that the end-to-end caching strategy and timeout strategy are working properly.
Once the single JVM environment is optimized and all performance SLAs are met, the test can be expanded to include an additional JVM in the cell. Close attention must be paid to the amount of JVM-to-JVM communication; it must be kept to a minimum to ensure the solution scales properly.
Here’s an example of the list of steps in Concentric Circle testing:
(1) Perform Quality Assurance (QA) Baseline Test: Single JVM, authenticated page with the About portlet (that is, the out-of-box portlet) with IBM default theme
(2) Determine Baseline transactions per second (TPS)
(3) Expand Baseline Test to include second JVM on different node (horizontal testing)
(4) Double Load for single JVM Test: Total TPS across both JVMs should be approximately 1.8 times the TPS for one JVM
(5) Apply Portal v7.0 Tuning Guide suggestions
(6) Re-run Baseline Tests with Portal v7.0 Tuning Guide suggestions applied
(7) Repeat Baseline Tests with customer’s authenticated Homepage
(8) Perform Bottleneck Analysis
(9) Perform JVM Analysis
(10) QA "Soak" Testing: Run 60% of peak load for 72 hours
(11) Perform production validation and verification tests
Several tools are needed during the Performance Bottleneck Analysis phase, including:
- Fiddler. A free tool that breaks down the response timeline from the end-user’s perspective.
- Extended Cache Monitor. An IBM WebSphere Application Server (WAS) utility that provides statistics (number of entries used, cache hit ratio, etc.) on the DynaCache instances.
- Cache Viewer Portlet. A WebSphere Portal Server portlet (provided by IBM Support) that provides additional information about the internal caches used by Portal.
- Page & Portlet Render Timers. Displays server-side page and portlet render times for a given page, making it easy to identify the “long poles” in terms of the slowest portlets on the page.
- Portlet Load Monitoring Filter. A feature that allows Portal administrators to protect their portal by limiting the number of concurrent requests and the average response time allowed for JSR 168 or JSR 286 portlets. If a portlet exceeds either the defined maximum number of concurrent requests, or the average response time, then Portlet Load Monitoring no longer allows further requests by the portlet.
Instead, the portal renders the portlet as unavailable, and the portlet code is no longer called for further requests. This way, the Portal installation is protected from non-responsive portlets consuming an increasing number of threads.
- Network Traces. Depending on the operating system, the appropriate network trace command would be run (tcpdump or snoop) to capture the packets to and from the various servers.
- Wireshark. A free network protocol analyzer tool. We use it to analyze the network trace output files obtained from tcpdump and snoop.
- J2EE Monitoring Tool. This is typically a tool such as ITCAM, CA’s Introscope, and HP Diagnostics that provides insight into where the time is spent from a JVM perspective.
- IBM Support Assistant (ISA) Workbench. We use various Java-based tools to analyze thread dumps and verbose garbage collection (GC) output, specifically, “IBM Thread and Monitor Dump Analyzer for Java” and “IBM Monitoring and Diagnostic Tools for Java – Garbage Collection and Memory Visualizer.”
High-level performance monitoring
Many areas need to be monitored during load tests, including frontend / authentication systems (WebSeal, SiteMinder, Web Servers, etc.), backend systems (database servers, LDAP servers, Web Services, etc.) and, of course, the network. Subject Matter Experts (SMEs) are required to monitor their respective component during load tests.
Fiddler is used to determine the actual response time observed from an end-user perspective. The Page & Portlet Render Timers are used to determine the server-side render times. If there is a large disparity between Fiddler’s response time (high) and the Page & Portlet render times (low), then it implies the bottleneck is before the Portal Server (frontend / authentication systems, Web Servers, WebSphere Plug-ins, etc.).
If the Page & Portlet Render Times are high, then it implies the bottleneck is in Portal or any backend component, and further Bottleneck Analysis is required to determine the root cause of the slowdown.
From Portal’s perspective, here is a list of initial data that must be collected:
- Fiddler traces
- Page & Portlet Render Timers screenshots
- CPU monitoring
- JVM Heap analysis (verbose GC must be enabled)
- Web Container Threads monitoring
- Database Connection Pool monitoring
- J2EE Monitoring tool to capture long-running transactions
- All log files (SystemOut.log, SystemErr.log, native_stdout.log, native_stderr.log, etc.)
This section and Section 7 are excerpted from the developerWorks® article, “IBM WebSphere Portal: Performance testing and analysis
Environment (mirror of production)
To meet the performance test objectives, the performance test environment needs to either be the production environment itself or a “mirror” of it that has, as much as practically possible, the exact same hardware, topology and backend systems. If any piece of this complex test topology is different than its production counterpart, one must extrapolate the results in the test environment to predict its effect in the production environment.
These extrapolations require detailed knowledge of Portal and the deployed applications that are generally not available to the testing organization. By making the test environment equivalent to the production environment, the confidence in the test results as it relates to what will actually happen in production becomes acceptable.
An important goal of the test environment is repeatability of results. As slight changes are made in the system, repeatability ensures that one can accurately measure the effect of these changes. For that reason, it is optimal to have the system on an isolated network during the performance testing.
Running the performance test on the production network introduces variability (for example, user traffic) which can skew metrics such as page render response time. There is also a more pragmatic reason to isolate the performance test network. Putting Portal under stress will likely put the corporate network under stress. This is often problematic during normal business hours.
If placing the performance test on an isolated network is not feasible, one should try and insure that the components of the test are all collocated on the same subnet of a network router. Normal Portal best practices would recommend using a gigabit Ethernet connection between the Portal and its backend applications.
Optimally, this would extend to the LDAPs, the Web Servers and important backend services. It’s also crucial that the load generator be on a local segment to the web server and/or the Portal itself.
Portal infrastructure baseline
In contrast to the mirrored production environment, it is generally advisable to also do an incremental set of “baseline” tests that exercise the infrastructure. At that point, new tests should gradually augment the Portal with customer written code. Problems then found can be assumed to be issues with the most recent augmentations.
The first test should be of the complete Portal infrastructure using an “out of the box” Portal. The database should be transferred and security should be enabled. Also, all frontend web servers, firewalls and load balancers should be in place.
Security managers (for example, IBM Tivoli® Access Manager or SiteMinder) should also be in place and correctly configured. Create a simple home page with a couple of portlets that do not access any backend systems (for example, the About portlet).
Create a simple load testing script which accesses the unauthenticated home page logs in (authenticates) and then idles without logging out. From this point, simulated users (Vusers) should be ramped up until the system is saturated. Using the bottleneck analysis techniques described below, find and fix any bottlenecks in the infrastructure. Note the performance baseline of this system.
Now, add to the system any customized themes and skins and repeat the previous test. Find and fix any important bottlenecks and note the performance baseline in the revised system.
Finally, as described below in this article, add in the actual portlets to be used on the home page and perform bottleneck analysis. Once again, note the performance baseline of this system. This “baseline” environment can be very effective in finding bottlenecks in the infrastructure that are independent of the application. Further, it can provide a reference when analyzing the extent to which the applications place additional load above and beyond the basic Portal infrastructure.
A proper performance test also requires the use of a load generator that produces simulated user requests for Web pages. It is important that this tool produce metrics like response time and page views per second. These allow the tester to determine when the system fails its SLA contract or is saturated to the point that injecting more page requests per unit time does not result in higher page production.
It is important that the load generators have sufficient “virtual users” (vUsers) to drive the system to saturation. Note that “virtual users” do not map directly to “actual users.” A vUser represents an “active” channel on the load generator and may be simulating multiple “actual users”. However, only one of those “actual users” can actually be requesting a Web page per vUser.
It is important, especially in a Portal context, that if the system requires authenticated access to the Portal application(s) under test, then sufficient unique test user IDs exist in the LDAP and scripts ensure that only a reasonable amount of duplicated log-ins occur during the test. “Reasonable” in this context accounts for the fact that some users may each have a couple of instances of a Web browser open with the same Portal log-in ID.
Portal has a large caching infrastructure for portal artifacts, which are generally cached on a per-user basis. So, if the load simulation uses the same user ID for all tests, performance would appear artificially high since the artifacts would not need to be actually loaded from the LDAP and Database.
Creating the simulation
To tune the Portal system to handle large numbers of users and to accurately predict its ability to handle specific number of users correctly, it is important to determine the most probable scenarios for users of the system. The test must then accurately simulate those user scenarios via the load generator.
One effective way to do this is to list the most likely use cases. Write a script or as many as are practical for each of these use cases. Now, assign a probability of likelihood that a percentage of the whole user population will execute that scenario.
As the test is run, assign use cases to Vusers (virtual users) in the same proportion as the expected general population. As the number of Vusers is “ramped up” as discussed later, try to maintain this proportion.
Driving to saturation
It is important to understand what is meant by “saturation”. Saturation is defined as the number of active Vusers at which point adding more Vusers will not result in increase throughput (page views per second). Each different simulation would likely have a different saturation point.
The saturation point varies depending on the usage pattern. More generically, that saturation point is the point at which requesting more pages per unit time does not actually result in more pages rendered per unit time. To effectively drive a system to saturation, add a small number of Vusers at a time and let the system stabilize, observe whether page views per second increases and then add more Vusers.
“Stabilize” in this context means that the response times are steady within a window of several minutes. On Load Runner, if one plots Vusers versus Throughput (page views per second), the page views per second will initially rise linearly with the number of Vusers, then reach a maximum and actually decrease slightly from that point.
The saturation point is the number of Vusers at which the page views per second is at maximum. In a not well behaving system, throughput (page views per second) could continue to rise even after the response times for some transactions have exceeded the maximum allowed. So, while driving a system to saturation, it’s also important to monitor the response times and make sure that the response time is in the acceptable range.
A common question in performance testing is at which rate should the Vusers be ramped up in the system. It is highly discouraged to use a technique like ramping up several hundred users as fast as possible until the system collapses. This is not representative of reality and will not provide repeatable results.
As mentioned earlier, it is important to model reality and therefore one should predict and measure the actual highest ramp rate that would ever be expected. This typically occurs during the morning hours when users first log in to the portal. IBM recommends that a small fixed number of users (for example, 2 Vusers/5 seconds) be ramped up for a set period of time (say, 5 minutes).
Additional users should not be added until the system has stabilized (at least 5 minutes, preferably 15 to 20 minutes, which will allow one to look for degradation when the work load is constant), then loop back and add another batch of Vusers in the same fashion. This technique allows the portal time to fill the various caches in an orderly fashion and provides for the ability to more accurately detect saturation points.
The goal of bottleneck analysis is to remove impediments to driving the system to a higher load. The metric defined for “higher load” is a higher number of Page Views per Second at saturation. Therefore, bottleneck analysis removes impediments to improving the saturation point.
Bottlenecks in a Portal environment under load are generally the result of contention for shared resources. This contention can be the result of synchronized java classes, methods or data structures, contention for serial resources (for example “SystemOut” log) or excessive response times in backend databases or web servers.
However, one must also be aware of bottlenecks such as network itself. So, it is important to ensure that the testing is done over a network similar to that which the end-users of the site will have. Components like router and firewalls can also impose congestion control or could be poorly tuned. As load increases, contention for these resources increases which makes it easier to detect and correct. This is why effective load testing is a requirement for bottleneck analysis.
The process of performing bottleneck analysis under load is quite straightforward. For a particular performance analysis (for example, LoadRunner) simulation:
- Ramp a single Portal JVM to saturation.
- Determine the bottleneck(s) that exists at saturation
- Resolve the bottleneck(s)
- Unless satisfied with system capacity, return to Step 1 and find the next bottleneck
Note that this process is iterative; fix one bottleneck and then re-test to find the next one. A single JVM is used in this case because detection of the bottleneck is much simpler. Finding and resolving cross-JVM contention can be more complex.
Once a portal is at saturation, a Java thread dump (via a “kill -3” command in Linux® or UNIX®) is typically taken to determine the cause of the bottleneck against the portal Java process under test. A “kill -3” can generate either a thread dump, a heap dump, or both. It’s recommended to disable heap dump generation, unless it is needed.
Thread dumps occur quickly, and the system recovers fast; however, neither is true for heap dumps, especially with large heaps. A thread dump shows the state of all threads in the JVM. The general procedure is to look for threads that are all blocked by the same condition or all waiting in the same method of the same class.
In general, search for threads that are blocked or in a wait state. After ascertaining why certain classes statistically show up blocked, you can remove that reason---and thus the bottleneck. There is an art to Portal bottleneck analysis that takes time and experience to master.
Production verification and validation
A critical step that is often overlooked or not executed is production verification and validation. Many times the performance tests are run in a QA or Performance Test environment. Since every environment is different, it’s imperative that some type of automated performance tests are run in the production environment.
Typically a subset of the full-blown performance tests are run to verify and validate that the production environment is performing properly. Many customers use their “regression test bucket” as the basis of the production verification and validation tests. This step essentially ensures the Web site “launch date” is as successful as possible.
Resolving Portal performance bottlenecks is a straightforward process. First, you start with the proper planning by establishing the performance tests goals up front. Understanding the various troubleshooting tools is imperative in breaking down the response time observed by users. Lastly, verifying and validating the caching strategy is another key step in ensuring all Portal production deployments are highly successful.
Tell us what you think
Please visit this link to take a one-question survey about this article:
IBM Support Technote #1316528, “Collecting Data: Performance, hang, or high CPU issues for WebSphere Portal 6.1 / 7.0 / 8.0:”
WebSphere Portal Family wiki article, “Performance management tools for IBM WebSphere Portal:”
developerWorks WebSphere Portal zone:
About the author
is an Executive IT Specialist currently working on the IBM Software Services for Collaboration (ISSC) team. He has more than 19 years of direct on-site customer support and specializes in troubleshooting end-to-end performance issues. He joined IBM in 1986 and has held numerous technical positions, working at the IBM Poughkeepsie and IBM Austin labs, and currently residing in Hopewell Junction, NY. You can reach him at firstname.lastname@example.org.