One of the key strengths of JVM is automatic memory management (Garbage Collection). Its understanding can help in writing better applications. This becomes all the more important as enterprise server applications have large amount of live heap data and significant parallel threads. Until recently, main collectors were parallel collector and concurrent-mark-sweep (CMS) collector. This blog introduces the various Garbage Collectors and compares the CMS collector against its replacement, a new implementation in Java7 i.e. G1. It is characterized by a single contiguous heap which is split into same-sized regions. In fact if your application is still running on the 1.5 or 1.6 JVM, a compelling argument to upgrade to Java 7 is to leverage G1.
We all know that Java programming language is widely used in large server applications which are characterized by large amounts of live heap data and considerable thread-level parallelism. These applications are often run on high-end multiprocessors. Although, throughput is important for such applications, but they are often also sensitive to latency. It becomes important for telecommunication, call-processing applications where delays of even milliseconds in setting up calls can adversely affect the user experience. The Java virtual machine (JVM) specification mandates that any JVM implementation must include a garbage collector (GC) to reclaim unused memory (i.e., unreachable objects). However, the behavior and efficiency of a garbage collector can heavily influence the performance and responsiveness of any application that relies on it.
HotSpot JVM Architecture
The HotSpot JVM architecture supports a strong foundation of features and capabilities that help in realizing high performance and massive scalability. The main components of the JVM include the class loader, the runtime data areas, and the execution engine.
The three main components of JVM responsible for application performance are heap, JIT compiler and Garbage Collector. All the object data is stored in heap. This area is then managed by the garbage collector selected at startup. Most tuning options help in sizing the heap and choosing the most appropriate garbage collector. The JIT compiler also has a big impact on performance but rarely requires tuning with the newer versions of the JVM.
While tuning a Java application, the key factors to consider are Responsiveness, Throughput and Footprint.
Responsiveness – Indicates how quickly an application or system responds with a requested piece of data. For applications that intend to be responsive, large pause times are not desirable. The aim is to respond in short periods of time. Examples include:
- How quickly a desktop UI renders pages.
- How prompt a website is.
- How fast can a database be accessed.
Throughput – Maximizing the amount of work by an application in a specific period of time. High pause times may be acceptable for applications that focus on throughput. Throughput may be measured by the following criteria:
- Number of transactions completed in a given time.
- Number of jobs executed by a batch program in an hour.
- The number of database queries completed in given time.
Footprint – The amount of heap size occupied by an application.
Typically, for any given application tuning, two out of the above mentioned three factors are chosen and worked upon. For example, if high throughput with minimal footprint is required, then the application would have to compromise on responsiveness. This is so as in order to keep footprint small, frequent garbage collection would be required which may lead to pausing the application during garbage collection.
The Java HotSpot VM garbage collectors are based on Generational Hypothesis. It is based on following principles:
- Most objects die young
- Only a few live very long
- Longer they live, more likely they live longer
- Old objects rarely reference young objects
Most allocated objects will die young. Few references from older to younger objects exist.
These two observations are collectively known as the weak generational hypothesis, which generally holds true for Java applications. To take advantage of this hypothesis, the Java HotSpot VM splits the heap into three physical areas, as depicted in figure below:
Young Generation – This is the place where most new objects are allocated. It is typically small and collected frequently. Since most objects in young generation are expected to die quickly, the number of objects that survive a young generation collection (also referred to as a minor collection) is expected to be low. Minor collections tend to be very efficient as they concentrate on a space that is usually small and is likely to contain a lot of garbage objects. Young generation is further compartmentalized into an area called Eden plus two smaller survivor spaces. Most objects are initially allocated in Eden. The survivor spaces hold objects that have survived at least one young generation collection and have thus been given additional chances to die before being considered “old enough” to be promoted to the old generation. At any given time, one of the survivor spaces (labeled From in the figure) holds such objects, while the other is empty and remains unused until the next collection.
Old Generation – Objects that are too big to fit in young generation are allocated directly from old generation. Similarly, objects that are longer-lived are promoted (or tenured) to the old generation. The old generation is typically larger than the young generation and it gets occupied more slowly. This results in old generation collections (also referred to as major collections) to be infrequent but lengthy.
Permanent Generation – The Permanent generation contains JVM metadata which describes the classes and methods used in the application. It is populated by the JVM at runtime based on classes in use by the application. In addition, Java SE library classes and methods may be stored here.
Different garbage collection strategies are followed for different regions.
Serial Collector – The Serial Collector does collection for both young and old generation, serially. There is a stop-the-world pause during both minor and major collections. The application processing resumes once the collection is finished. The Serial Collector works fine for client side applications that do not have low pause time requirements.
Parallel Collector – These days, machines with a lot of physical memory and multiple CPUs is quite common. The parallel collector takes advantage of multiple CPUs rather than keeping most of them idle while only one does garbage collection work. It uses a parallel version of the young generation collection algorithm utilized by the serial collector. It is still a stop-the-world and copying collector, but performing the young generation collection in parallel leads to decrease in garbage collection overhead and increase in application throughput. Server side applications that run on machines with multiple CPUs and don’t have pause time constraints benefit from parallel collector.
Concurrent Mark-Sweep (CMS) Collector – In many cases, end-to-end throughput is not as important as fast response time. Young generation collections do not cause long pauses most of the times. However, old generation collections, though infrequent, can cause long pauses, especially when large heaps are involved. To address this issue, the HotSpot JVM includes a collector called the concurrent mark-sweep (CMS) collector, also known as the low-latency collector. It collects the young generation the same way the Parallel and Serial Collectors do. Its old generation, however, is collected concurrently along with application threads. This results in shorter pauses.
Let us now see how CMS does garbage collection for both the young and old generations.
Young Generation Collection by CMS Collector – CMS collects young generation using multiple threads just like Parallel Collector. Figure below illustrates a typical heap ready for young generation collection:
The young generation comprises of one Eden and two Survivor spaces. The live objects in Eden are copied to the initially empty survivor space, labeled S1 in the figure, except for ones that are too large to fit comfortably in the S1 space. Such objects are directly copied to the old generation. The live objects in the occupied survivor space (labeled S0) that are still relatively young are also copied to the other survivor space, while objects that are relatively old are copied to the old generation. If the S1 space becomes full, the live objects from Eden or S0 that have not been copied to it are tenured, regardless of their age. Any objects remaining in Eden or the S0 space after live objects have been copied are not live and need not be examined. Figure below illustrates the heap after young generation collection:
The young generation collection leads to stop the world pause. After collection, eden and one survivor space are empty. Now let’s see how CMS handles old generation collection. It essentially consists of two major steps – marking all live objects and sweeping them.
Old Generation Collection in CMS
The marking is done in stages. At the beginning there is a short pause, called initial mark, which identifies the set of objects that are immediately reachable from outside the heap. This has a stop the world pause. Thereafter, during the concurrent marking phase, it marks all live objects that are transitively reachable from this set. Since the application is running and updating reference fields (hence, modifying the object graph) while the marking phase is taking place, not all live objects are guaranteed to be marked at the end of the concurrent marking phase. To care of this, the application stops again for a second pause, called remark, which finalizes marking by revisiting any objects that were modified during the concurrent marking phase. As the remark pause has a substantial stop the world pause, multiple threads are used to increase its efficiency. At the end of the remark phase, all live objects in the heap are guaranteed to have been marked. Since revisiting objects during the remark phase increases the amount of work the collector has to do, its overhead increases as well. This is a typical trade-off for most collectors that attempt to reduce pause times.
The heap structure after mark phase(s) can be seen below where the live objects are in light blue colored blocks.
After marking of all live objects in old generation is done, the concurrent sweeping happens which sweeps over the heap, de-allocating garbage objects in-place without relocating the live ones. As the figure illustrates, object marked with dark color are assumed to be garbage. After the sweeping phase, the dark colored objects are removed and only blue colored (live) objects remain. The free space is not contiguous and the collector needs to employ a data structure (free lists, in this case) that records which parts of the heap contain free space. As a result, allocation into the old generation is more expensive. This imposes extra overhead to minor collections, as most allocations in the old generation take place when objects are promoted during minor collections. Another disadvantage of CMS is that it typically has larger heap requirements. There are few reasons for this. First, a concurrent marking cycle lasts longer than that of a stop-the-world collector. And it is only during the sweeping phase that space is actually reclaimed. Given that the application is allowed to run during the marking phase, it is also allowed to allocate memory, hence the occupancy of the old generation potentially will grow during the marking phase and drop only during the sweeping phase. Additionally, despite the collector’s guarantee to identify all live objects during the marking phase, it doesn’t actually guarantee that it will identify all objects that are garbage. Some objects that will become garbage during the marking phase may or may not be reclaimed during the cycle. If they are not, then they will be reclaimed during the next cycle. Garbage objects that are wrongly identified as live are usually referred to as floating garbage.
The heap becomes fragmented due to the lack of compaction and it might also prevent the collector from using the available space as efficiently as possible.
Finally, it is very tedious to tune the CMS collector. There are lots of options and it takes a lot of experimentation to arrive the best configuration for a particular application.
Introducing Garbage-First (G1) Collector
In order to overcome the shortcomings of CMS and not comprise throughput, the Garbage-First (G1) Collector has been introduced. The G1 collector is a server-style garbage collector, targeted for multi-processors with large memories, that meets a soft real-time goal with high probability, while achieving high throughput. The G1 garbage collector is fully supported in Oracle JDK 7 update 4 and later releases. The G1 collector is suitable for applications that:
- Can work concurrently along with application threads like CMS collector.
- Compacts free space without lengthy GC induced pause times.
- Require more predictable GC pause durations.
- Want reasonable throughput performance.
- Do not want a much larger Java heap.
G1 is supposed to be the long term replacement for the Concurrent Mark-Sweep Collector (CMS). G1 offers many benefits in comparison to CMS. First and foremost, G1 is a compacting collector. G1 compacts sufficiently which leads to elimination of potential fragmentation issues, to a large extent. Also, G1 offers more predictable garbage collection pauses than the CMS collector, and allows users to specify desired pause targets. The Refinement, Marking and Cleanup phases are concurrent, while the young generation collection is done using multiple threads in parallel. The full garbage collection continues to be single threaded but if tuned properly applications should avoid full GCs. Another major benefit is that G1 is easy to use and tune. When performing garbage collections, G1 operates in a manner similar to the CMS collector. G1 performs a concurrent global marking phase to determine the liveness of objects throughout the heap. After the completion of mark phase, G1 knows which regions are mostly empty and it collects in these regions first. This usually yields a large amount of free space. This is why this method of garbage collection is called Garbage-First.
G1 Heap Overview
The heap in case of G1 is differently organized in comparison to earlier generational GCs. The heap is one large contiguous spaced partitioned into a set of equal-sized heap regions (approximately 2000 in number). The region size is chosen at startup and varies from 1 MB to 32 MB. There is no physical separation between young and old generation. A region may act as wither eden, survivor(s) or old generation. This provides a greater flexibility in memory usage. Objects are moved between regions during collections. For large objects (> 50% of region size), humongous regions are used. Currently, G1 is not optimized for collecting large objects in humongous regions.
Young Generation Collection in G1
Live objects are evacuated (copied/moved) to one or more survivor regions. If the aging threshold is met, some of the objects are promoted to old generation regions. It involves a stop the world (STW) pause. It’s done in parallel with multiple threads to shorten the pause time. Eden size and survivor size is calculated for the next young GC. Pause time goal are taken into consideration. This approach makes it very easy to resize regions, making them bigger or smaller as needed.
Old Generation Collection in G1
The old generation collection starts with the Initial Marking phase and it is piggybacked on young generation collection. It is a stop the world event and survivor regions (root regions) which may have references to objects in old generation are marked.
In the Concurrent Marking phase liveness information per region is determined while the application is running. Live objects are found over the entire heap. This activity may get interrupted by young generation collections. Any empty regions found (denoted as X) is removed immediately in the Remark phase.
The Remark phase completes the marking of live objects in the heap. G1 collector uses an algorithm called snapshot-at-the-beginning (SATB) which is much faster than what is used in the CMS collector. Empty regions are removed and reclaimed. Region liveness is now calculated for all regions and this is a stop-the-world event.
In the Copying/Cleanup phase, G1 selects the regions with the low “liveness”. These regions can be collected fastest and this cleanup happens at the same time as a young GC. So both young and old generations are collected at the same time.
After the cleanup phase, selected regions are collected and compacted. This is represented in dark blue region and the dark green region shown in the figure. Some garbage objects may be left in old generation regions and they may be collected later based on future liveness, pause time target and number of unused regions.
G1 Old Generation GC Summary
- Concurrent Marking Phase
- Calculates liveness information per region, concurrently while the application is running
- Identifies best regions for subsequent evacuation phases
- No corresponding sweeping phase
- Remark Phase
- Different marking algorithm than CMS
- Uses Snapshot-at-the-beginning (SATB) which is much faster than what was being used in CMS
- Copying/Cleanup Phase
- Completely empty regions are reclaimed
- Young generation and Old generation reclaimed at the same time
- Old generation regions selected based on their liveness
G1 and CMS Comparison
|Features||G1 GC||CMS GC|
|Concurrent and Generational||Yes||Yes|
|Releases Maximum Heap memory after usage||Yes||No|
|Physical Separation between Young and Old||No||Yes|
For the same application size, as compared to CMS, the heap size is likely to be larger in G1 due to additional accounting data structures
Remembered Sets (RSets / RSet) – The RSets track object references into a given region and there is one RSet per region. This enables parallel and independent collection of a region as there is no need to track whole heap to find references. Footprint overhead due to RSets is less than 5%. More inter-region references can lead to bigger Remembered Set which in turn leads to a slow GC.
Collection Sets (CSets / CSet) – The CSet is set of regions that will be collected in a GC cycle. Regions can be eden and survivor, and optionally after (concurrent) marking some old generation regions. All live data in a CSet is evacuated (copied/moved) during the GC. It has a footprint overhead less than 1%.
Command Line Options
To start using the G1 Garbage Collector use the following option:
To set target for the maximum GC pause time, use the following option:
The main goal of G1 GC is to reduce latency. If latency is not a problem, then Parallel GC can be used. A Related goal is simplified tuning. The most important tuning option is XX:MaxGCPauseMillis=200 (default value = 200ms). It Influences maximum amount of work per collection. However, this is Best effort only.
A trigger to start GC can be given by the option -XX:InitiatingHeapOccupancyPercentage=n. It specifies percent of entire heap not just old generation. Automatic resizing of young generation has lower and upper bound of 20% and 80% of java heap, respectively. One should be cautious while using option as too low value can lead to unnecessary GC overhead and too high value can lead to Space Overflow.
Threshold for region to be included in a Collection Set can be specified by -XX:G1OldCSetRegionLiveThresholdPercent=n. One should be cautious while using option as too high value can lead to more aggressive collecting and too low value can lead to heap wastage.
The Mixed GC / Concurrent Cycle can be specified using -XX:G1MixedGCCountTarget=n. A too high value can lead to unnecessary overhead and a too low can lead to longer pauses.
Care must be taken if young generation size is fixed using the option –Xmn. This can cause PauseTimeTarget to be ignored. G1 no longer respects the pause time target. Even if heap expands, the young generation size is fixed.
Sample Application Test
Let’s create a sample application to to measure performance and behavior of CMS and G1 collectors. The basic algorithm is described below:
- Create and add 190 Float Arrays into an Array List
- Each Float Array reserves 4MB of memory, i.e. 1 x 1024 x 1024 = 4 MB
- 4 MB x 190 = 760 MB
- After each iteration the arrays are released and application sleeps for some time
- Same steps are repeated certain number of times
I have run this application on a Windows 7 machine and VisualVM is used to analyze GC logs.
CMS Collector Results
Command Line Arguments to test CMS Collector:
java -server -XX:+UseConcMarkSweepGC -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -Xloggc:CMS.log -Dcom.sun.management.jmxremote.port=3333 -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.authenticate=false -classpath C:\ GCTest 190
Observations with VisualVM
G1 Collector Results
Command Line Arguments to test G1 Collector:
java -server -XX:+UseG1GC -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -Xloggc:G1GC.log -Dcom.sun.management.jmxremote.port=3333 -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.authenticate=false -classpath C:\ GCTest 190
Observations with VisualVM
|Parameters||G1 GC||CMS GC|
|Time Taken for Execution||7 min 5 sec||7 min 56 sec|
|Max CPU Usage||27.3%||70.2%|
|Max GC Activity||2%||24%|
|Max Heap Size||974 MB||974 MB|
|Max Used Heap Size||763 MB||779 MB|
- G1 GC is able to reclaim max heap size
- CMS is not able to do so
- Lesser CPU utilization for G1 collection
- G1 Heap goes to max size in three distinct jumps
- CMS seems to gain max heap size in initial jump
Should You Move to G1 GC
Now that we have covered all the good things about G1 GC, the most pertinent question is in which cases it should be used. I would suggest exercising cautious optimism. Don’t rush blindly to embrace it as it also has some costs (high heap size) and may actually not be a good fit for certain scenarios. Evaluate all other options before moving to G1 GC.
If you don’t need low latency then you are better off using parallel GC. If you don’t need a big heap, then use a small heap and parallel GC. If you need a big heap, then first try CMS collector. If CMS is not performing well, then try to tune it. If all attempts at tuning CMS are not paying dividends then you can consider using G1 GC.