Solving Java GC Pause Outages in Production

Java Duke
Just thinking about how to configure HAProxy with two backend Java servers to be HA, despite GC pauses.

Java programs pause periodically to recycle temporary variables, known as garbage collection (GC). This is called a “GC Pause.”

The description “Stop the World” (STW) illustrates their true severity – GC pauses are a slow-motion train wreck for incoming requests. They can last from hundreds of milliseconds to minutes, and require intense CPU activity.

Executive Summary:

  • If you have a latency-sensitive requirement, don’t use Java – use C or Go 1.8+ [GC benchmarks]
  • If you want to use Java, follow the best programming practices listed below to reduce garbage collection pause time, or consider paying Azul $3,500/server
  • HAProxy can be used with option redispatch to load balance across multiple Java servers to maintain availability during GC pauses. You can either use the HAProxy drain feature for rolling deployments, or in more complex setups, iptables.
  • Bonus tip: Java GC pauses don’t only impact your application, they also affect their entire environment like a grenade – performance tools written in Java pause, tomcat pauses, even reflection APIs are paused.

If you’re new to this topic, please read:

Willy: “I work with people who use a lot of Java applications, and I’ve seen them spend as much time on tuning the JVM as they spend writing the code, and the result is really worth it.” Anybody have some extra time? 😐

My operational requirements for Java in production are:

  1. understand GC pause activity for my application servers
  2. control GC pause activity to a reasonable and bounded extent
  3. configure HAProxy load balancer to not send requests to servers undergoing GC pauses (ie. don’t lose requests)
  4. use an affordable amount of RAM to accomplish the above, preferably 8 or 16 GB in a shared VM environment.

1. Understand GC pause activity for my application servers

Detailed GC logging and heap dump on OOM can be enabled with:

-XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintGCTimeStamps -XX:+HeapDumpOnOutOfMemoryError

and you can specify a separate GC log with:

-verbose:gc -Xloggc:/tmp/gc.log

See “Understanding Garbage Collection Logs.”

2. Control GC pause activity to a reasonable and known extent

One of the biggest challenges is to control the frequency, duration and intensity of GC pauses …

Some Java configuration approaches:

  • set heap size and compaction percent only somewhat above need. That will cause GCs to be more frequent, but also faster or the opposite …
  • set heap size to large amount and compaction to 100%, then trigger GC after hours
  • investigate alternate JVMs.

An example of some of the tuning options:

java -Xms512m -Xmx1152m -XX:MaxPermSize=256m -XX:MaxNewSize=256m

JRockit JVM: Tuning For a Small Memory Footprint
Tuning Java Virtual Machines (JVMs)
Weblogic Tuning JVM Garbage Collection for Production Deployments

Programming best practices to reduce GC pauses:

  • use streaming file IO with Files.lines() instead of reading into a String or hashmap, or use memory-mapped files
  • rewrite portions of your application to correctly use StringBuffer instead of String
  • Reduce object copies – if you do not have a problem with thread safety, then you don’t need immutable objects.
  • call dispose() method when available, such as SWT image class
  • for HashMaps, call clear() to re-use the memory later, but set to null to GC it
  • split java server into real-time and batch servers where possible with appropriate minimal heap sizes for each
  • preallocate array memory using the length parameter to potentially avoid re-copying the entire array for new elements
  • Note that Java debuggers change the lifetime of variables so that they can be viewed longer scope-wise. Caveat emptor.

3. Configure HAProxy load balancer requests to not be sent to servers undergoing GC pause events

The first thing to do is to read up on HAProxy’s option redispatch feature. Continue reading for more in-depth considerations.

This is tricky for several reasons:

  • health checks can be passive or active. Both have check gaps that won’t notice a GC starting before a request is sent
  • even if GC notifications are enabled and the server health check is red, HAProxy will not know (see above)
  • even if GC notifications are enabled and the server health check is now green, HAProxy will not know (see above) 🙂
  • the HAProxy options log-health-checks and redispatch may be helpful

a) Some things to think about:

  1. understand your GC pattern
  2. use HAProxy socket interface to drain, then disable one backend
  3. wait for zero connections
  4. force a GC (easier said than done in Oracle Java since System.gc() is only a request for GC), or restart the Java server
  5. use HAProxy socket interface to enable the Java server.

This method would be risky with two Java servers, since during maintenance on one server, the other could GC pause. (facepalm)

b) Another possible approach would be to handle MemoryPoolMXBean MEMORY_THRESHOLD_EXCEEDED events. Maybe that can be used to update the health check on the server side and send a drain socket request to HAProxy if you reliably had advance notice and could force a GC now, trying the Java Tool Interface ForceGarbageCollection()?

c) And another idea is to write a sentinel file every 250 ms, and if it reaches 750 ms, assume a GC is happening and drain HAProxy. Unfortunately the TI events GarbageCollectionStart() and GarbageCollectionEnd() are sent after the VM is stopped, so you’re limited in what you can do when you need the most flexibility.

Some Java 8 Classes related to GC notifications:

  1. MemoryPoolMXBean – “The memory usage monitoring mechanism is intended for load-balancing or workload distribution use. For example, an application would stop receiving any new workload when its memory usage exceeds a certain threshold. It is not intended for an application to detect and recover from a low memory condition.”
  2. GarbageCollectionNotificationInfo
  3. GarbageCollectorMXBean

Also, investigate mod_jk and AJP. tomcat uses the same heap as your application, so tuning is very important here too.

4. Use an affordable amount of RAM to accomplish the above, preferably 8 or 16 GB in a shared VM environment

If you work in a VM consolidation environment, it’s important to minimize the footprint of your base image and also applications. See above for rewriting applications to minimize heap and GCs.

SO: Is there any correlation between an out of memory scenario and blocked threads?
Garbage Collection JMX Notifications Example Code
Blade: A Data Center Garbage Collector
How to Tame Java GC Pauses? Surviving 16GiB Heap and Greater
SO: Garbage Collection Notifications
Letting the Garbage Collector Do Callbacks
How to force garbage collection in Java?
SSL Termination, Load Balancers & Java
Github: Measuring Java Memory Consumption – sample code
Java is not “angry” with you.
Set State to DRAIN vs set weight 0
Scalable web applications [with Java]
Examples of forcing freeing of native memory direct ByteBuffer has allocated, using sun.misc.Unsafe?
Lucene ByteBuffer sample code
Improve availability in Java enterprise applications
The Four Month Bug: JVM statistics cause garbage collection pauses
Memory management when failure is not an option

Making Garbage Collection faster
The Complete Guide to Instrumentation: How to Measure Everything You Need Inside Your Application
Java heap terminology: young, old and permanent generations?
5 Coding Hacks to Reduce GC Overhead

Java Debugger Changes Lifetime of Variables
Objects Should Be Immutable
Thread Safety and Immutability
Azul Blog: So, G1 may become the default collector for Java 9?
Java and Scala Type Systems are Unsound


Golang: sub-millisecond GC pause on production 18gb heap HN
Getting Past C
Go GC: Prioritizing low latency and simplicity
Sub-millisecond GC pauses in Go 1.8 Graphs


CASSANDRA-5345: Potential problem with GarbageCollectorMXBean
Java GC pauses, reality check

This entry was posted in Cassandra, GC Pauses, Java, Microservices, Open Source, Oracle, REST API Programming, Tech and tagged . Bookmark the permalink.

Leave a Reply

Your email address will not be published.

This site uses Akismet to reduce spam. Learn how your comment data is processed.