Troubleshooting IBM WebSphere Portal and IBM Lotus Web Content Management server hangs
Anuradha D Chitta
Advisory Software Engineer
IBM Software Group
Bangalore, KA India
October 2009
Summary: This article presents a troubleshooting guide for IBM® WebSphere® Portal and IBM Lotus® Web Content Management server hangs, explaining how to identify and isolate their root causes.
Contents
1 Introduction. 1
2 Administrator’s checklist 2
2.1 Check for batch jobs being run on the server 2
2.2 Check verbosegc for heap usage. 2
2.2.1 Determining the cause of the memory outage. 2
2.3 WebContainer thread contention. 4
2.4 Lack of response from external sources. 6
2.5 Checking logs for hung threads. 8
3 Conclusion. 8
4 Resources. 8
About the author 9
1 Introduction
When a production server becomes unresponsive, administrators are inclined to restart the server as quickly as possible, to reduce the downtime. However, restarting the server without collecting the diagnostics will leave you with little information to troubleshoot what has caused the hang.
In this article we discuss how to identify what caused the hang and explain the necessary information to collect before restarting the server.
Some of the common factors that lead to server unresponsiveness include:
- Lack of heap space
- WebContainer Thread contention
- Lack of response from external sources
2 Administrator’s checklist
Below is a checklist that administrators can use to troubleshoot WebSphere Portal and Web Content Management server hangs:
- Check for batch jobs being run on the server
- Check verbosegc for heap usage
- Generate and review threaddumps
- Check logs for hung threads
2.1 Check for batch jobs being run on the server
First, check for batch jobs or scheduled tasks running at the time of outage. Some of the tasks that can take up resources on the server include search crawls and Web Content Management tasks like memberfixer and Java™ Content Repository (JCR) indexing.
- You can check the WebSphere Portal server CPU and memory usage by using vmstat and top on UNIX® servers.
- Web Content Management search crawls are scheduled to run every 4 hours by default; if the content is not changing too frequently, make sure the crawls are spaced out to reduce load on the server.
- Make sure the batch jobs taking up resources are scheduled during off-peak hours.
2.2 Check verbosegc for heap usage
Before moving the servers to production, you should have set WebSphere Portal / Web Content Management Java Virtual Machine (JVM) heap (memory) settings to optimal values, after tuning the server through load tests.
Even with this tuning exercise, however, unexpected load and large object requests coming from the application code can make the server run out of heap space and fail to satisfy Java object allocation requests. This can lead to excessive garbage collection cycles by pausing the threads, resulting in a server hang.
2.2.1 Determining the cause of the memory outage
First, let’s discuss the memory limitations on 32-bit platforms. The total size of a process can reach up to 2G on 32-bit platforms, which includes both the heap memory as well as native memory required by the native (jni/jdbc prepared statements, OS native calls, etc.) code to allocate objects.
In such cases make sure you do not let the maximum heap size grow to larger than 1.5G, leaving 500M for native memory allocations.
There are two types of Out of Memory (OOM) conditions:
(1) Complete heap exhaustion. When the server is totally out of heap space, the garbage collection cycles take longer and, during that process, the application threads are paused until the garbage collection cycle ends.
Make sure verbosegc is enabled on the server, check the native_stderr.log for allocation failures just before the outage, and check the amount of free heap space.
Look for this output in the verbosegc log:
Typical Allocation Failure when the server is totally out of heap space:
15:48:55 2009
0% free (588040/1342175744), in 10088ms>
= 32), weak 0, final 2, phantom 0>
15:49:02 2009
= 32), weak 0, final 5, phantom 0>
JVMDG217: Dump Handler is Processing OutOfMemory - Please Wait.
JVMDG315: JVM Requesting Heap dump file
JVMDG318: Heap dump file written to D:\IBM\WEBSPH~1\APPSER~1
\heapdump.20090822.154902.5580.phd
JVMDG303: JVM Requesting Java core file
JVMDG304: Java core file written to D:\IBM\WEBSPH~1\APPSER~1
\javacore.20090822.154944.5580.txt
JVMDG274: Dump Handler has Processed OutOfMemory.
From the verbosegc logs, if you notice that the server is running below 10% free heap for a long period of time, you might want to increase the Heapsize.
At the same time, investigate what is consuming the heap by generating a Heapdump or by analyzing the generated Heapdumps. For more information, refer to the IBM Support Techdoc, “Webcast replay: Using IBM HeapAnalyzer to diagnose Java heap issues.”
(2) Large object request failure. If the JVM is not able to satisfy an allocation request for an object of reasonable size, even when there is a lot of free heap, it indicates the heap is highly fragmented.
Make sure the KCluster is tuned and avoid making too many large object requests by uploading large files, etc.
The size of the object being requested by the applications can be identified from the allocation failure records in native_stderr.log as follows:
The above Allocation Failure is due to the code making a very large object request (36M) from the heap. Once we know this to be the cause of the failure, enable the following environmental parameter:
ALLOCATION_THRESHOLD=5000000
This will print out the stacktrace of every request that is larger than 5M in the native_stderr.log.
Once you have the stack of the code making such large object requests, you can get the owner of that code to consider reducing the object sizes, to avoid such large object allocations.
If this code cannot be changed for some reason, then you can set aside a chunk of heap for large objects alone so that it remains fairly unfragmented, satisfying large requests by providing enough contiguous heapspace.
You can do this using the property –Xloratio, which sets aside n% of your heap for large objects only. You can find more information on setting the KCluster and loratio properties in the IBM Support Technotes, “Avoiding Java heap fragmentation with Java SDK V1.4.2” and “How to allocate large objects into Large Object Area on IBM SDK 1.4.2 SR1 and later.”
2.3 WebContainer thread contention
Requests coming into WebSphere Portal / Web Content Management are served by Web Container threads. The WebContainer thread-pool setting needs to be tuned according to the expected load during peak times. When the server stops responding, the first thing we need to do is generate threaddumps, using the following mechanisms:
Microsoft® Windows®:
wsadmin.bat [-host host_name] [-port soap_port_number] [-user userid[-password password]
wsadmin> set jvm [$AdminControl completeObjectName type=JVM,process=WebSphere_Portal,*]
wsadmin>$AdminControl invoke $jvm dumpThreads
UNIX:
kill -3 PID
The path to the location of the Javacore file will be in the verbosegc output. On Solaris the threaddumps are printed into the native_stdout.log.
You can examine Javacores (threaddumps) to see what the WebContainer threads are doing and check for any deadlocks reported. Look at the state of the WebContainer threads. If most of them are in state:R (Running), it indicates that the server is under excessive load.
Now look at the code executing on these threads and generate subsequent threaddumps, to see how the threads are progressing. If most of the threads are in state:CW (Conditional Wait), check what condition these threads are waiting on.
Sample threads waiting on each other resulting in a deadlock:
Deadlock detected !!!
NULL
2LKDEADLOCKTHR Thread “WebContainer: 15" (0x58BD5520)
3LKDEADLOCKWTR is waiting for:
4LKDEADLOCKMON sys_mon_t:0x588AB898 infl_mon_t: 0x588AAE38:
4LKDEADLOCKOBJ org.apache.log4j.Logger@37E9FCF8/37E9FD00:
3LKDEADLOCKOWN which is owned by:
2LKDEADLOCKTHR Thread “WebContainer: 8" (0x56C5C7A0)
3LKDEADLOCKWTR which is waiting for:
4LKDEADLOCKMON sys_mon_t:0x58523918 infl_mon_t: 0x00000000:
4LKDEADLOCKOBJ java.lang.StringBuffer@3B9F5148/3B9F5150:
3LKDEADLOCKOWN which is owned by:
2LKDEADLOCKTHR Thread “WebContainer: 15" (0x58BD5520)
Review the code executing on the above threads and engage the respective developer/owner of that code.
Sample thread stacks and the corresponding activity:
An idle WebContainer thread that is waiting for incoming requests looks like this:
3XMTHREADINFO "WebContainer : 1" (TID:0x56F8DD00, sys_thread_t:0x51DF0478, state:CW, native ID:0x00000B58) prio=5
4XESTACKTRACE at java/lang/Object.wait(Native Method)
4XESTACKTRACE at java/lang/Object.wait(Object.java:231(Compiled Code))
4XESTACKTRACE at com/ibm/ws/util/BoundedBuffer.waitGet_(BoundedBuffer.java:190(Compiled Code))
4XESTACKTRACE at com/ibm/ws/util/BoundedBuffer.take(BoundedBuffer.java:545(Compiled Code))
4XESTACKTRACE at com/ibm/ws/util/ThreadPool.getTask(ThreadPool.java:817(Compiled Code))
4XESTACKTRACE at com/ibm/ws/util/ThreadPool$Worker.run(ThreadPool.java:1480(Compiled Code))
If all the WebContainer threads are in idle state as shown above, and the server is still not responding to any requests, it indicates that the Web Server or something in front of WebSphere Portal is causing a bottleneck.
Examine the code stack executing on thread WebContainer : 1, and follow the progress of this thread in subsequent Javacores:
Multiple WebContainer threads waiting on a lock, held by “WebContainer : 1”:
3LKMONOBJECT java/lang/Object@070000003A981CD8/070000003A981CF0: owner "WebContainer : 1" (0x000000011C8CF700), entry count 1
3LKNOTIFYQ Waiting to be notified:
3LKWAITNOTIFY “WebContainer : 2" (0x000000011D5AC800)
3LKWAITNOTIFY "WebContainer : 3" (0x000000011DF18A00)
3LKWAITNOTIFY “WebContainer : 4" (0x000000011DGDA600)
2.4 Lack of response from external sources
Applications often rely on external sources like Database or LDAP to process the requests. Application servers access Database using the Datasource connection pools. When the connections in the pool run out, the WebContainer threads are hung, waiting for a connection from the pool.
To determine whether this is the reason for server hang, you can look in the threaddumps to see how many threads are waiting on the connections from the pool.
A typical thread stack waiting on a Datasource pooled thread:
3XMTHREADINFO "WebContainer : 27" (TID:0x807C4D68,sys_thread_t:0x4533CE28, state:CW, native ID:0x83D2) prio=5
4XESTACKTRACE at java.lang.Object.wait(Native Method)
4XESTACKTRACE at com.ibm.ejs.j2c.poolmanager.FreePool.queueRequest(FreePool.java(Compiled Code))
4XESTACKTRACE at com.ibm.ejs.j2c.poolmanager.FreePool.createOrWaitForConnection(FreePool. java(Compiled Code))
4XESTACKTRACE at com.ibm.ejs.j2c.poolmanager.PoolManager.reserve(Poolanager.java(Compiled Code))
4XESTACKTRACE at com.ibm.ejs.j2c.ConnectionManager.allocateMCWrapper(ConnectionManager.java(Compiled Code))
When there are a large number of requests waiting on the pooled connections, make sure the Connection pool size is set greater than the Threadpool size. Also, make sure the Database is not slowing down releasing these established connections.
A typical thread showing the thread waiting on Database response:
3XMTHREADINFO "WebContainer : 1" (TID:0x3030FC00, sys_thread_t:0x806FE328, state:R, native ID:0x4ED9) prio=5
4XESTACKTRACE at java.net.SocketInputStream.socketRead0(Native Method)
4XESTACKTRACE at java.net.SocketInputStream.read(SocketInputStream.java(Compiled Code))
4XESTACKTRACE at com.ibm.db2.jcc.b.gb.b(gb.java(Compiled Code))
4XESTACKTRACE at com.ibm.db2.jcc.b.gb.c(gb.java(Compiled Code))
4XESTACKTRACE at com.ibm.db2.jcc.b.gb.c(gb.java(Compiled Code))
…….
4XESTACKTRACE at com.ibm.db2.jcc.c.lf.c(lf.java(Compiled Code))
4XESTACKTRACE at com.ibm.db2.jcc.c.lf.next(lf.java(Compiled Code))
4XESTACKTRACE at com.ibm.ws.rsadapter.jdbc.WSJdbcResultSet.next(WSJdbcResultSet.java(Compiled Code))
4XESTACKTRACE at com.ibm.wps.datastore.impl.ResourcePersister.loadDependants(ResourcePersister.java(Compiled Code))
4XESTACKTRACE at com.ibm.wps.datastore.impl.ResourcePersister.findInternal(ResourcePersister.java(Compiled Code)
When threads remain in the above state for a long time, which you can determine from the subsequent Javacores, it indicates that either there is a very large query being run, or the Database is not responding.
When using Web Content Management, the JCR queries are auto-generated based on the selected criteria, so make sure the Menu and Navigation cmpnts are optimally designed and the Database is well maintained by running reorg and dbstats on a regular basis.
Typical Thread stack showing lack of response from LDAP server:
3XMTHREADINFO "WebContainer : 8" (TID:0x776154F8, sys_thread_t:0x4C89C9A8, state:CW, native ID:0xBDFC) prio=5
4XESTACKTRACE at java.lang.Object.wait(Native Method)
4XESTACKTRACE at com.sun.jndi.ldap.Connection.readReply(Connection.java(Compiled Code))
4XESTACKTRACE at com.sun.jndi.ldap.LdapClient.getSearchReply(LdapClient.java(Compiled Code))
4XESTACKTRACE at com.sun.jndi.ldap.LdapClient.search(LdapClient.java(Compiled Code))
4XESTACKTRACE at com.sun.jndi.ldap.LdapCtx.doSearch(LdapCtx.java(Compiled Code))
4XESTACKTRACE at com.sun.jndi.ldap.LdapCtx.searchAux(LdapCtx.java(Compiled Code))
4XESTACKTRACE at com.sun.jndi.ldap.LdapCtx.c_search(LdapCtx.java:1751)
4XESTACKTRACE at com.sun.jndi.toolkit.ctx.ComponentDirContext.p_search(ComponentDirContext.java:386)
4XESTACKTRACE at com.sun.jndi.toolkit.ctx.PartialCompositeDirContext.search(PartialCompositeDirContext.java:347)
4XESTACKTRACE at javax.naming.directory.InitialDirContext.search(InitialDirContext.java:259)
4XESTACKTRACE at com.ibm.ws.wmm.ldap.LdapConnectionImpl.searchAll(LdapConnectionImpl.java:3528)
4XESTACKTRACE at com.ibm.ws.wmm.ldap.LdapConnectionImpl.search(LdapConnectionImpl.java:3678)
4XESTACKTRACE at com.ibm.ws.wmm.ldap.LdapConnectionImpl.search(LdapConnectionImpl.java:2120)
When there is no response from the external sources, engage the appropriate administrators to check the problem on the Database or LDAP side. Make sure all the Web Content Management & WebSphere Portal best practices and tuning guides are followed when tuning the backend resources.
2.5 Checking logs for hung threads
IBM WebSphere Application Server provides a hung-thread detection function, whereby the thread monitor checks all managed threads in the system. When ThreadMonitor detects that a thread has been active longer than the time defined by the thread monitor threshold, the application server logs a warning in the WebSphere Application Server log.
This warning indicates the name of the thread that is hung and how long it has already been active.
The following message is written to the log:
[8/25/09 17:15:30:335 EST] 00000020 ThreadMonitor W WSVR0605W: Thread "WebContainer : 0" (0000004a) has been active for 722918 milliseconds and may be hung. There is/are 2 thread(s) in total in the server that may be hung.
Starting with WebSphere Application Server 6.0.2.29, a new property is available, which can be used to automatically generate threaddumps whenever a hung thread is detected:
com.ibm.websphere.threadmonitor.dump.java
NOTE:
· Value: Set to true to cause a Javacore to be created when a hung thread is detected and a WSVR0605W message is printed.
· The thread reported in the SytemOut.log can be cross-checked with the Javacore.
· The stack executing on that thread has been in the same stack for the number of milliseconds reported by the ThreadMonitor.
3 Conclusion
Hopefully you now understand how to identify the areas that can cause a WebSphere Portal server to become unresponsive, including how to determine whether the issue is caused by lack of memory due to activity on the threads, how to identify what backend resources are causing the bottleneck, and the next actions that should be taken.
4 Resources
MustGather: No response (hang) or performance degradation for IBM WebSphere Portal 5.1:
http://www-01.ibm.com/support/docview.wss?rs=688&uid=swg21209459
IBM WebSphere Portal version 6.1.x Tuning Guide:
http://www-01.ibm.com/support/docview.wss?uid=swg27013972
IBM WebSphere Portal Performance Troubleshooting Guide:
http://www-01.ibm.com/support/docview.wss?uid=swg27007059&aid=1
WebSphere Portal and Lotus Web Content Management performance tuning guides and supplemental content:
http://www-01.ibm.com/support/docview.wss?uid=swg21314715&loc=en_US&cs=utf-8&lang=en
Best practices for using IBM Workplace Web Content Management V6:
http://www.ibm.com/developerworks/websphere/library/techarticles/0701_devos/0701_devos.html
About the author
Anuradha Chitta is an Advisory Software Engineer working with the Web Content Management team at IBM's Pune, India, facility. She was a team lead for Portal Performance and search components in IBM US, and worked extensively with JVM issues related to hangs, crashes, high CPU usage, etc., before relocating to IBM India. Anu holds a Masters degree in Computer Science from LSU, and is an IBM Certified WebSphere ND 6.1 and Portal V6.0 System Administrator.