/ Java Heap leak using Weblogic 10.0 and Apache Lucene 2.3.2 ~ Java EE Support Patterns

4.06.2011

Java Heap leak using Weblogic 10.0 and Apache Lucene 2.3.2

This case study describes the complete root cause analysis of a Java Heap leak problem experienced with Oracle Weblogic 10 and using the open source text search engine Apache Lucene 2.3.2.

Environment specifications

·         Java EE server: Oracle Weblogic Portal 10.0
·         OS: Solaris 10 64-bit
·         JDK: Sun Java HotSpot 1.5.0_11-b03 32-bit
·         Java VM arguments: -XX:PermSize=512m -XX:MaxPermSize=512m -XX:+UseParallelGC  -Xms1536m -Xmx1536m -XX:+HeapDumpOnOutOfMemoryError
·         Search Engine API: Apache Lucene 2.3.2
·         Platform type: Portal application

Monitoring and troubleshooting tools
·         VisualGC 3.0 (Java Heap monitoring)
·          IBM Support Assistant 4.1 - Memory Analyzer 0.6.0.2 (Heap Dump analysis)

Problem overview

-          ·        Problem type: Java Heap memory leak observed via JConsole monitoring


A Java Heap leak was detected in our production environment following a capacity planning initiative of a the Java EE environment and infrastructure which involved close data monitoring and analysis.

This finding did also explain why the support team had to restart the Weblogic environment on a weekly in order to avoid severe performance degradation. A degradation of performance of a Java EE server over time is often the symptom of a memory/resource leak.

Gathering and validation of facts

As usual, a Java EE problem investigation requires gathering of technical and non technical facts so we can either derived other facts and/or conclude on the root cause. Before applying a corrective measure, the facts below were verified in order to conclude on the root cause:

·        What is the client impact? HIGH (if Weblogic is not restarted every week)
·        Recent change of the affected platform? No
·        Any recent traffic increase to the affected platform? No
·        Since how long this problem has been observed?  This problem has been identified since several months but no corrective action taken until then
·        Is the Java Heap depletion happening suddenly or over time? It was observed via VisualGC that the Java Heap (old generation space) is increasing over time with a full depletion rate of ~7 days
·        Did a restart of the Weblogic server resolve the problem? No, Weblogic restart currently used as a mitigation strategy and workaround only

·         Conclusion #1: The problem is related to a memory leak of the Java Heap space with a full Java Heap depletion / failure rate of ~ 10 days

Java Heap monitoring

The Java Heap old generation and Eden space were both monitored using Java VisualGC 3.0 monitoring tool. The review of the VisualGC data was quite conclusive on the fact that our application is leaking the Java Heap old gen space on a regular basis. The next logical step was the Heap Dump analysis.

JVM running for 6 days and 7 hours



JVM running for 7 days and 8 hours



JVM running for 8 days and 7 hours



Heap Dump analysis

Find below a step by step Heap Dump analysis conducted using the ISA 4.1 tool (Memory Analyser).

1) Open IAS and load the Heap Dump (hprof format)

2) Select the Leak Suspects Report in order to have a look at the list of memory leak suspects

3) As per below, the Apache Lucene org.apache.lucene.store.RAMInputStream object was identified as our primary leak suspect

4) The final step was to deep dive within one of the Lucene object in order to identify the source of the leak



Conclusion

The primary Java Heap memory leak seems to originate from the Apache Lucene framework and due java.lang.ThreadLocal variables still maintaining reference to org.apache.lucene.store.RAMInputStream instances with memory footprint up to 30 MB for each instance.

Root cause and solution: Apache Lucene bug report #1383!

We did some research on the Apache issue tracking system and found a bug ISSUE-LUCENE-1383 reported back in 2008 which did correlate with our Heap Dump analysis findings. Find below a description of the problem:

"Java's ThreadLocal is dangerous to use because it is able to take a surprisingly very long time to release references to the values you store in it. Even when a ThreadLocal instance itself is GC'd, hard references to the values you had stored in it are easily kept for quite some time later.

While this is not technically a "memory leak", because eventually (when the underlying Map that stores the values cleans up its "stale" references) the hard reference will be cleared, and GC can proceed, its end behaviour is not different from a memory leak in that under the right situation you can easily tie up far more memory than you'd expect, and then hit unexpected OOM error despite allocating an extremely large heap to your JVM.”

“The patch adds CloseableThreadLocal. It's a wrapper around ThreadLocal that wraps the values inside a WeakReference, but then also holds a strong reference to the value (to ensure GC doesn't reclaim it) until you call the close method. On calling close, GC is then free to reclaim all values you had stored; regardless of how long it takes ThreadLocal's implementation to actually release its references.“


This problem was fixed starting in Apache Lucene version 2.4.

Solution and next steps

The solution will require an upgrade of Apache Lucene from version 2.3.2 to 3.1. The project is still in the early analysis phase and I will provide an update and results as soon as available.

1 comments:

Excellent post. and also most of the times both 32 bit and 64 bit works as these expect in some cases where if there are any machine specific calls are called. 32 bit takes 4 bytes long,64 bit takes 8 bytes of size, so 32 bit takes less size compared to 64 bit machines.
To find out the 32 bit or 64 bit 32 bit or 64 bit

Post a Comment