Tuesday, August 12, 2014

Complexity chart for all Java Collections Datastructures

List

A list is an ordered collection of elements.


Add
Remove
Get
Contains
Data  Structure
ArrayList
 O(1)
O(n)
O(1)
O(n)
Array
LinkedList
 O(1)
O(1)
O(n)
O(n)
Linked List
CopyonWriteArrayList
 O(n)
O(n)
O(1)
O(n)
Array

Set

A collection that contains no duplicate elements.


Add
Contains
Next
Data Structure
HashSet
O(1)
O(1)
O(h/n)
Hash Table
LinkedHashSet
O(1)
O(1)
O(1)
Hash Table + Linked List
EnumSet
O(1)
O(1)
O(1)
Bit Vector
TreeSet
O(log n)
O(log n)
O(log n)
Red-black tree
CopyonWriteArraySet
O(n)
O(n)
O(1)
Array
ConcurrentSkipList
O(log n)
O(log n)
O(1)
Skip List

Queue

A collection designed for holding elements prior to processing.


Offer
Peak
Poll
Size
Data Structure
PriorityQueue
O(log n )
O(1)
O(log n)
O(1)
Priority Heap
LinkedList
 O(1)
O(1)
O(1)
O(1)
Array
ArrayDequeue
 O(1)
O(1)
O(1)
O(1)
Linked List
ConcurrentLinkedQueue
 O(1)
O(1)
O(1)
O(n)
Linked List
ArrayBlockingQueue
 O(1)
O(1)
O(1)
O(1)
Array
PriorirityBlockingQueue
O(log n)
O(1)
O(log n)
O(1)
Priority Heap
SynchronousQueue
O(1)
O(1)
O(1)
O(1)
None!
DelayQueue
O(log n)
O(1)
O(log n)
O(1)
Priority Heap
LinkedBlockingQueue
O(1)
O(1)
O(1)
O(1)
Linked List

Map

An object that maps keys to values. A map cannot duplicate keys; each key can map to at most one value.


Get
ContainsKey
Next
Data Structure
HashMap
O(1)
O(1)
O(h / n)
Hash Table
LinkedHashMap
O(1)
O(1)
O(1)
Hash Table + Linked List
IdentityHashMap
O(1)
O(1)
O(h / n)
Array
WeakHashMap
O(1)
O(1)
O(h / n)
Hash Table
EnumMap
O(1)
O(1)
O(1)
Array
TreeMap
O(log n)
O(log n)
O(log n)
Red-black tree
ConcurrentHashMap
O(1)
O(1)
O(h / n)
Hash Tables
ConcurrentSkipListMap
O(log n)
O(log n)
O(1)
Skip List

Monday, August 11, 2014

Anatomy of the G1 (Garbage first) Collector

This tutorial will cover the basics of the G1 (Garbage First) collector.

Garbage-First is a server-style garbage collector, targeted for mufti-processors with large memories, that meets a soft real-time goal with high probability [Detlefs04]. It does this while also achieving high throughput, which is an important point when comparing it to other real-time collectors. 

The G1 garbage collector is fully supported in Oracle JDK 7 update 4 and later releases. The G1 collector is designed for applications that:

  1. Can operate concurrently with applications threads like the CMS collector.
  2. Compact free space without lengthy GC induced pause times.
  3. Need more predictable GC pause durations.
  4. Do not want to sacrifice a lot of throughput performance.
  5. Do not require a much larger Java heap.
G1 is the replacement for the Concurrent Mark-Sweep Collector (CMS). Comparing G1 with CMS, there are differences that make G1 a better solution. One difference is that G1 is a compacting collector. G1 compacts sufficiently to completely avoid the use of fine-grained free lists for allocation, and instead relies on regions. This considerably simplifies parts of the collector, and mostly eliminates potential fragmentation issues. Also, G1 offers more predictable garbage collection pauses than the CMS collector, and allows users to specify desired pause targets.

How does the G1 work:

The older garbage collectors (serial, parallel, CMS) all structure the heap into three sections: young generation, old generation, and permanent generation of a fixed memory size.


The G1 collector takes a different approach The heap is partitioned into a set of equal-sized heap regions, each a contiguous range of virtual memory (see Figure 1). In the strictest sense, the heap doesn't contain generational areas, although a subset of the regions can be treated as such. This provides flexibility in how garbage collection is performed, which is adjusted on-the-fly according to the amount of processor time available to the collector.


 
 Figure 1


As shown above the regions are again broken down into 512 byte sections called cards (see Figure 2). Each card has a corresponding one-byte entry in a global card table, which is used to track which cards are modified by mutator threads. Subsets of these cards are tracked, and referred to as Remembered Sets (RS).
 
 
 Figure 2
 
 
When performing garbage collections, G1 operates in a manner similar to the CMS collector. G1 performs a concurrent global marking phase to determine the liveness of objects throughout the heap. After the mark phase completes, G1 knows which regions are mostly empty. It collects in these regions first, which usually yields a large amount of free space. As the name suggests, G1 concentrates its collection and compaction activity on the areas of the heap that are likely to be full of reclaimable objects, that is, garbage. G1 uses a pause prediction model to meet a user-defined pause time target and selects the number of regions to collect based on the specified pause time target.
 
The regions identified by G1 as ripe for reclamation are garbage collected using evacuation. G1 copies objects from one or more regions of the heap to a single region on the heap, and in the process both compacts and frees up memory. This evacuation is performed in parallel on multi-processors, to decrease pause times and increase throughput. Thus, with each garbage collection, G1 continuously works to reduce fragmentation, working within the user defined pause times. This is beyond the capability of both the previous methods. CMS (Concurrent Mark Sweep ) garbage collector does not do compaction. ParallelOld garbage collection performs only whole-heap compaction, which results in considerable pause times.

It is important to note that G1 is not a real-time collector. It meets the set pause time target with high probability but not absolute certainty. Based on data from previous collections, G1 does an estimate of how many regions can be collected within the user specified target time. Thus, the collector has a reasonably accurate model of the cost of collecting the regions, and it uses this model to determine which and how many regions to collect while staying within the pause time target.

Note: G1 has both concurrent (runs along with application threads, e.g., refinement, marking, cleanup) and parallel (multi-threaded, e.g., stop the world) phases. Full garbage collections are still single threaded, but if tuned properly your applications should avoid full GCs.


Step by step process of G1

G1 Heap Structure
The heap is one memory area split into many fixed sized regions.

Region size is chosen by the JVM at startup. The JVM generally targets around 2000 regions varying in size from 1 to 32Mb.
G1 Heap Allocation
In reality, these regions are mapped into logical representations of Eden, Survivor, and old generation spaces.

The colors in the picture shows which region is associated with which role. Live objects are evacuated (i.e., copied or moved) from one region to another. Regions are designed to be collected in parallel with or without stopping all other application threads.
As shown regions can be allocated into Eden, survivor, and old generation regions. In addition, there is a fourth type of object known as Humongous regions. These regions are designed to hold objects that are 50% the size of a standard region or larger. They are stored as a set of contiguous regions. Finally the last type of regions would be the unused areas of the heap.
Note: At the time of this writing, collecting humongous objects has not been optimized. Therefore, you should avoid creating objects of this size.
Young Generation in G1
The heap is split into approximately 2000 regions. Minimum size is 1Mb and maximum size is 32Mb. Blue regions hold old generation objects and green regions hold young generation objects.

Note that the regions are not required to be contiguous like the older garbage collectors.
A Young GC in G1
Live objects are evacuated (i.e., copied or moved) to one or more survivor regions. If the aging threshold is met, some of the objects are promoted to old generation regions.

This is a stop the world (STW) pause. Eden size and survivor size is calculated for the next young GC. Accounting information is kept to help calculate the size. Things like the pause time goal are taken into consideration.
This approach makes it very easy to resize regions, making them bigger or smaller as needed.
End of a Young GC with G1
Live objects have been evacuated to survivor regions or to old generation regions.

Recently promoted objects are shown in dark blue. Survivor regions in green.
In summary, the following can be said about the young generation in G1:
  • The heap is a single memory space split into regions.
  • Young generation memory is composed of a set of non-contiguous regions. This makes it easy to resize when needed.
  • Young generation garbage collections, or young GCs, are stop the world events. All application threads are stopped for the operation.
  • The young GC is done in parallel using multiple threads.
  • Live objects are copied to new survivor or old generation regions.

Old Generation Collection with G1

Like the CMS collector, the G1 collector is designed to be a low pause collector for old generation objects. The following table describes the G1 collection phases on old generation.

G1 Collection Phases - Concurrent Marking Cycle Phases

The G1 collector performs the following phases on the old generation of the heap. Note that some phases are part of a young generation collection.

Phase Description
(1) Initial Mark
(Stop the World Event)
This is a stop the world event. With G1, it is piggybacked on a normal young GC. Mark survivor regions (root regions) which may have references to objects in old generation.
(2) Root Region Scanning Scan survivor regions for references into the old generation. This happens while the application continues to run. The phase must be completed before a young GC can occur.
(3) Concurrent Marking Find live objects over the entire heap. This happens while the application is running. This phase can be interrupted by young generation garbage collections.
(4) Remark
(Stop the World Event)
Completes the marking of live object in the heap. Uses an algorithm called snapshot-at-the-beginning (SATB) which is much faster than what was used in the CMS collector.
(5) Cleanup
(Stop the World Event and Concurrent)
  • Performs accounting on live objects and completely free regions. (Stop the world)
  • Scrubs the Remembered Sets. (Stop the world)
  • Reset the empty regions and return them to the free list. (Concurrent)
(*) Copying
(Stop the World Event)
These are the stop the world pauses to evacuate or copy live objects to new unused regions. This can be done with young generation regions which are logged as [GC pause (young)]. Or both young and old generation regions which are logged as [GC Pause (mixed)].

G1 Old Generation Collection Step by Step

With the phases defined, let's look at how they interact with the old generation in the G1 collector.
Initial Marking Phase
Initial marking of live object is piggybacked on a young generation garbage collection. In the logs this is noted as GC pause (young)(inital-mark).

Concurrent Marking Phase
If empty regions are found (as denoted by the "X"), they are removed immediately in the Remark phase. Also, "accounting" information that determines liveness is calculated.

Remark Phase
Empty regions are removed and reclaimed. Region liveness is now calculated for all regions.

Copying/Cleanup Phase
G1 selects the regions with the lowest "liveness", those regions which can be collected the fastest. Then those regions are collected at the same time as a young GC. This is denoted in the logs as [GC pause (mixed)]. So both young and old generations are collected at the same time.

After Copying/Cleanup Phase
The regions selected have been collected and compacted into the dark blue region and the dark green region shown in the diagram.

Summary of Old Generation GC

In summary, there are a few key points we can make about the G1 garbage collection on the old generation.
  • Concurrent Marking Phase
    • Liveness information is calculated concurrently while the application is running.
    • This liveness information identifies which regions will be best to reclaim during an evacuation pause.
    • First, since this is the beginning of a new collection, the current marking bitmap is copied to the previous marking bitmap, and then the current marking bitmap is cleared.
    • Next, all mutator threads are paused while the current TAMS pointer is moved to point to the same byte in the region as the top (next free byte) pointer.
    • Next, all objects are traced from their roots, and live objects are marked in the marking bitmap. We now have a snapshot of the heap.
    • Next, all mutator threads are resumed.
    • Next, a write buffer is inserted for all mutator threads. This barrier records all new object allocations that take place after the snapshot into change buffers.
    •  
  • Remark Phase
    • Uses the Snapshot-at-the-Beginning (SATB) algorithm which is much faster then what was used with CMS.
    • When all filled buffers have been processed, the mutator threads are paused.
    • Next, the remaining (partially filled) buffers are processed, and those objects are marked also.
    • Completely empty regions are reclaimed.
  • Copying/Cleanup Phase
    • Next, all mutator threads are paused.
    • Next, all live-object counts are finalized per region.
    • The TAMS pointer for the current collection is copied to the previous TAMS pointer (since the current collection is basically complete).
    • The heap regions are sorted for collection priority according to a cost algorithm. As a result, the regions that will yield the highest numbers of reclaimed objects, at the smallest cost in terms of time, will be collected first. This forms what is called a collection set of regions.
    • All mutator threads are resumed.

Command Line options

In this section let's take a look at the various command line options for G1.

Basic Command Line

To enable the G1 Collector use: -XX:+UseG1GC
Here is a sample command line for starting the Java2Demo included in the JDK demos and samples download:
java -Xmx50m -Xms50m -XX:+UseG1GC -XX:MaxGCPauseMillis=200 -jar c:\javademos\demo\jfc\Java2D\Java2demo.jar

Key Command Line Switches

-XX:+UseG1GC - Tells the JVM to use the G1 Garbage collector.
-XX:MaxGCPauseMillis=200 - Sets a target for the maximum GC pause time. This is a soft goal, and the JVM will make its best effort to achieve it. Therefore, the pause time goal will sometimes not be met. The default value is 200 milliseconds.
-XX:InitiatingHeapOccupancyPercent=45 - Percentage of the (entire) heap occupancy to start a concurrent GC cycle. It is used by G1 to trigger a concurrent GC cycle based on the occupancy of the entire heap, not just one of the generations. A value of 0 denotes 'do constant GC cycles'. The default value is 45 (i.e., 45% full or occupied).

Best Practices

There are a few best practices you should follow when using G1.
Do not Set Young Generation Size
Explicitly setting young generation size via -Xmn meddles with the default behavior of the G1 collector.
  • G1 will no longer respect the pause time target for collections. So in essence, setting the young generation size disables the pause time goal.
  • G1 is no longer able to expand and contract the young generation space as needed. Since the size is fixed, no changes can be made to the size.
Response Time Metrics
Instead of using average response time (ART) as a metric to set the XX:MaxGCPauseMillis=<N>, consider setting value that will meet the goal 90% of the time or more. This means 90% of users making a request will not experience a response time higher than the goal. Remember, the pause time is a goal and is not guaranteed to always be met.
What is an Evacuation Failure?
A promotion failure that happens when a JVM runs out of heap regions during the GC for either survivors and promoted objects. The heap can't expand because it is already at max. This is indicated in the GC logs when using -XX:+PrintGCDetails by to-space overflow. This is expensive!
  • GC still has to continue so space has to be freed up.
  • Unsuccessfully copied objects have to be tenured in place.
  • Any updates to RSets of regions in the CSet have to be regenerated.
  • All of these steps are expensive.
How to avoid Evacuation Failure
To avoid evacuation failure, consider the following options.
  • Increase heap size
    • Increase the -XX:G1ReservePercent=n, the default is 10.
    • G1 creates a false ceiling by trying to leave the reserve memory free in case more 'to-space' is desired.
  • Start the marking cycle earlier
  • Increase the number of marking threads using the -XX:ConcGCThreads=n option.

Complete List of G1 GC Switches

This is the complete list of G1 GC switches. Remember to use the best practices outlined above.
Option and Default Value Description
-XX:+UseG1GC Use the Garbage First (G1) Collector
-XX:MaxGCPauseMillis=n Sets a target for the maximum GC pause time. This is a soft goal, and the JVM will make its best effort to achieve it.
-XX:InitiatingHeapOccupancyPercent=n Percentage of the (entire) heap occupancy to start a concurrent GC cycle. It is used by GCs that trigger a concurrent GC cycle based on the occupancy of the entire heap, not just one of the generations (e.g., G1). A value of 0 denotes 'do constant GC cycles'. The default value is 45.
-XX:NewRatio=n Ratio of new/old generation sizes. The default value is 2.
-XX:SurvivorRatio=n Ratio of eden/survivor space size. The default value is 8.
-XX:MaxTenuringThreshold=n Maximum value for tenuring threshold. The default value is 15.
-XX:ParallelGCThreads=n Sets the number of threads used during parallel phases of the garbage collectors. The default value varies with the platform on which the JVM is running.
-XX:ConcGCThreads=n Number of threads concurrent garbage collectors will use. The default value varies with the platform on which the JVM is running.
-XX:G1ReservePercent=n Sets the amount of heap that is reserved as a false ceiling to reduce the possibility of promotion failure. The default value is 10.
-XX:G1HeapRegionSize=n With G1 the Java heap is subdivided into uniformly sized regions. This sets the size of the individual sub-divisions. The default value of this parameter is determined ergonomically based upon heap size. The minimum value is 1Mb and the maximum value is 32Mb.

Logging GC with G1

Logging GC with G1

The final topic we need to cover is using logging information to analyze performance with the G1 collector. This section provides a quick overview of the switches you can use to collect data and the information that is printed in the logs.

Setting the Log Detail

You can set the detail to three different levels of detail.
(1) -verbosegc (which is equivalent to -XX:+PrintGC) sets the detail level of the log to fine.
Sample Output
[GC pause (G1 Humongous Allocation) (young) (initial-mark) 24M- >21M(64M), 0.2349730 secs]
[GC pause (G1 Evacuation Pause) (mixed) 66M->21M(236M), 0.1625268 secs]    
(2) -XX:+PrintGCDetails sets the detail level to finer. The options shows the following information:
  • Average, Min, and Max time are displayed for each phase.
  • Root Scan, RSet Updating (with processed buffers information), RSet Scan, Object Copy, Termination (with number of attempts).
  • Also shows “other” time such as time spent choosing CSet, reference processing, reference enqueuing and freeing CSet.
  • Shows the Eden, Survivors and Total Heap occupancies.
Sample Output
[Ext Root Scanning (ms): Avg: 1.7 Min: 0.0 Max: 3.7 Diff: 3.7]
[Eden: 818M(818M)->0B(714M) Survivors: 0B->104M Heap: 836M(4096M)->409M(4096M)]
(3) -XX:+UnlockExperimentalVMOptions -XX:G1LogLevel=finest sets the detail level to its finest. Like finer but includes individual worker thread information.
[Ext Root Scanning (ms): 2.1 2.4 2.0 0.0
           Avg: 1.6 Min: 0.0 Max: 2.4 Diff: 2.3]
       [Update RS (ms):  0.4  0.2  0.4  0.0
           Avg: 0.2 Min: 0.0 Max: 0.4 Diff: 0.4]
           [Processed Buffers : 5 1 10 0
           Sum: 16, Avg: 4, Min: 0, Max: 10, Diff: 10]

Determining Time

A couple of switches determine how time is displayed in the GC log.
(1) -XX:+PrintGCTimeStamps - Shows the elapsed time since the JVM started.
Sample Output
    1.729: [GC pause (young) 46M->35M(1332M), 0.0310029 secs]
(2) -XX:+PrintGCDateStamps - Adds a time of day prefix to each entry.
    2012-05-02T11:16:32.057+0200: [GC pause (young) 46M->35M(1332M), 0.0317225 secs]

Understanding G1 Log

To understand the log, this section defines a number of terms using actual GC log output. The following examples show output from the log with explanations of the terms and values you will find in it.

G1 Logging Terms Index to debug

Clear CT
CSet
External Root Scanning
Free CSet
GC Worker End
GC Worker Other
Object Copy
Other
Parallel Time
Ref Eng
Ref Proc
Scanning Remembered Sets
Termination Time
Update Remembered Set
Worker Start



Parallel Time
414.557: [GC pause (young), 0.03039600 secs] [Parallel Time: 22.9 ms]
[GC Worker Start (ms): 7096.0 7096.0 7096.1 7096.1 706.1 7096.1 7096.1 7096.1 7096.2 7096.2 7096.2 7096.2
       Avg: 7096.1, Min: 7096.0, Max: 7096.2, Diff: 0.2]
Parallel Time – Overall elapsed time of the main parallel part of the pause
Worker Start – Timestamp at which the workers start
Note: The logs are ordered on thread id and are consistent on each entry
External Root Scanning
[Ext Root Scanning (ms): 3.1 3.4 3.4 3.0 4.2 2.0 3.6 3.2 3.4 7.7 3.7 4.4
     Avg: 3.8, Min: 2.0, Max: 7.7, Diff: 5.7]
External root scanning - The time taken to scan the external root (e.g., things like system dictionary that point into the heap.)
Update Remembered Set
[Update RS (ms): 0.1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 Avg: 0.0, Min: 0.0, Max: 0.1, Diff: 0.1]
   [Processed Buffers : 26 0 0 0 0 0 0 0 0 0 0 0
    Sum: 26, Avg: 2, Min: 0, Max: 26, Diff: 26]
Update Remembered Set - Any buffers that are completed but have not yet been processed by the concurrent refinement thread before the start of the pause have to be updated. Time depends on density of the cards. The more cards, the longer it will take.
Scanning Remembered Sets
[Scan RS (ms): 0.4 0.2 0.1 0.3 0.0 0.0 0.1 0.2 0.0 0.1 0.0 0.0 Avg: 0.1, Min: 0.0, Max: 0.4, Diff: 0.3]F
Scanning Remembered Sets - Look for pointers that point into the Collection Set.
Object Copy
[Object Copy (ms): 16.7 16.7 16.7 16.9 16.0 18.1 16.5 16.8 16.7 12.3 16.4 15.7 Avg: 16.3, Min: 12.3, Max:  18.1, Diff: 5.8]
Object copy – The time that each individual thread spent copying and evacuating objects.
Termination Time
[Termination (ms): 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
0.0 Avg: 0.0, Min: 0.0, Max: 0.0, Diff: 0.0] [Termination Attempts : 1 1 1 1 1 1 1 1 1 1 1 1 Sum: 12, Avg: 1, Min: 1, Max: 1, Diff: 0]
Termination time - When a worker thread is finished with its particular set of objects to copy and scan, it enters the termination protocol. It looks for work to steal and once it's done with that work it again enters the termination protocol. Termination attempt counts all the attempts to steal work.
GC Worker End
[GC Worker End (ms): 7116.4 7116.3 7116.4 7116.3 7116.4 7116.3 7116.4 7116.4 7116.4 7116.4 7116.3 7116.3
    Avg: 7116.4, Min: 7116.3, Max: 7116.4, Diff:   0.1]
[GC Worker (ms): 20.4 20.3 20.3 20.2 20.3 20.2 20.2 20.2 20.3 20.2 20.1 20.1
     Avg: 20.2, Min: 20.1, Max: 20.4, Diff: 0.3]
GC worker end time – Timestamp when the individual GC worker stops.
GC worker time – Time taken by individual GC worker thread.
GC Worker Other
[GC Worker Other (ms): 2.6 2.6 2.7 2.7 2.7 2.7 2.7 2.8 2.8 2.8 2.8 2.8
    Avg: 2.7, Min: 2.6, Max: 2.8, Diff: 0.2]
GC worker other – The time (for each GC thread) that can't be attributed to the worker phases listed previously. Should be quite low. In the past, we have seen excessively high values and they have been attributed to bottlenecks in other parts of the JVM (e.g., increases in the Code Cache occupancy with Tiered).
Clear CT
[Clear CT: 0.6 ms]
Time taken to clear the card table of RSet scanning meta-data
Other
[Other: 6.8 ms]
Time taken for various other sequential phases of the GC pause.
CSet
[Choose CSet: 0.1 ms]
Time taken finalizing the set of regions to collect. Usually very small; slightly longer when having to select old.
Ref Proc
[Ref Proc: 4.4 ms]
Time spent processing soft, weak, etc. references deferred from the prior phases of the GC.
Ref Enq
[Ref Enq: 0.1 ms]
Time spent placing soft, weak, etc. references on to the pending list.
Free CSet
[Free CSet: 2.0 ms]
Time spent freeing the set of regions that have just been collected, including their remembered sets.

Note: For more references check out

http://www.infoq.com/articles/G1-One-Garbage-Collector-To-Rule-Them-All
https://blogs.oracle.com/poonam/entry/understanding_g1_gc_logs
http://www.drdobbs.com/jvm/g1-javas-garbage-first-garbage-collector/219401061?pgno=1
http://www.oracle.com/technetwork/tutorials/tutorials-1876574.html
 


Tuesday, June 17, 2014

Compare and Swap In Java


Compare and swap (CAS)

The first processors that supported concurrency provided atomic test-and-set operations, which generally operated on a single bit. The most common approach taken by current processors, including Intel and Sparc processors, is to implement a primitive called compare-and-swap, or CAS. (On Intel processors, compare-and-swap is implemented by the cmpxchg family of instructions. PowerPC processors have a pair of instructions called "load and reserve" and "store conditional" that accomplish the same goal; similar for MIPS, except the first is called "load linked.")

A CAS operation includes three operands - a memory location (V), the expected old value (A), and a new value (B). The processor will atomically update the location to the new value if the value that is there matches the expected old value, otherwise it will do nothing. In either case, it returns the value that was at that location prior to the CAS instruction. (Some flavors of CAS will instead simply return whether or not the CAS succeeded, rather than fetching the current value.) CAS effectively says "I think location V should have the value A; if it does, put B in it, otherwise, don't change it but tell me what value is there now."

The natural way to use CAS for synchronization is to read a value A from an address V, perform a multistep computation to derive a new value B, and then use CAS to change the value of V from A to B. The CAS succeeds if the value at V has not been changed in the meantime.

Instructions like CAS allow an algorithm to execute a read-modify-write sequence without fear of another thread modifying the variable in the meantime, because if another thread did modify the variable, the CAS would detect it (and fail) and the algorithm could retry the operation. Listing below illustrates the behavior (but not performance characteristics) of the CAS operation, but the value of CAS is that it is implemented in hardware and is extremely lightweight (on most processors):

=================================================
public class SimulatedCAS {

    private int value;

    public synchronized int getValue() {
        return value;
    }

    public synchronized int compareAndSwap(int expectedValue, int newValue) {
        int oldValue = value;
        if (value == expectedValue) {
            value = newValue;
        }
        return oldValue;
    }
}
=================================================
Concurrent algorithms based on CAS are called lock-free, because threads do not ever have to wait for a lock (sometimes called a mutex or critical section, depending on the terminology of your threading platform). Either the CAS operation succeeds or it doesn't, but in either case, it completes in a predictable amount of time. If the CAS fails, the caller can retry the CAS operation or take other action as it sees fit. Listing below shows the counter class written to use CAS:

=================================================
public class CASCounter {

    private SimulatedCAS value = new SimulatedCAS(); // starts with 0

    public int getValue() {
        return value.getValue();
    }

    public int increment() {
        int oldValue = value.getValue();
        while (value.compareAndSwap(oldValue, oldValue + 1) != oldValue) {
            oldValue = value.getValue();
        }
        return oldValue + 1;
    }
}
=================================================
Lock-free and wait-free algorithms

An algorithm is said to be wait-free if every thread will continue to make progress in the face of arbitrary delay (or even failure) of other threads. By contrast, a lock-free algorithm requires only that some thread always make progress.

Nonblocking algorithms are used extensively at the operating system and JVM level for tasks such as thread and process scheduling. While they are more complicated to implement, they have a number of advantages over lock-based alternatives - hazards like priority inversion and deadlock are avoided, contention is less expensive, and coordination occurs at a finer level of granularity, enabling a higher degree of parallelism.

Until Java SE 5, it was not possible to write wait-free, lock-free algorithms in the Java language without using native code. With the addition of the atomic variables classes in the java.util.concurrent.atomic package, that has changed. The atomic variable classes all expose a compare-and-set primitive (similar to compare-and-swap), which is implemented using the fastest native construct available on the platform (compare-and-swap, load linked/store conditional, or, in the worst case, spin locks). Nine flavors of atomic variables are provided in the java.util.concurrent.atomic package (AtomicInteger; AtomicLong; AtomicReference; AtomicBoolean; array forms of atomic integer; long; reference; and atomic marked reference and stamped reference classes, which atomically update a pair of values).

The atomic variable classes can be thought of as a generalization of volatile variables, extending the concept of volatile variables to support atomic conditional compare-and-set updates. Reads and writes of atomic variables have the same memory semantics as read and write access to volatile variables.

Operations on atomic variables get turned into the hardware primitives that the platform provides for concurrent access, such as compare-and-swap.

Java Atomic Operations

In Java the integer primitive increment operation is not an atomic operation. When an integer is incremented, the following logical steps are performed by the JVM:

Retrieve the value of the integer from memory

Increment the value

Assign the newly incremented value back to the appropriate memory location

Return the value to the caller

So while we write the increment operator in a single line of Java code, such as:

int n = i++;
Each one of the aforementioned steps occurs in the JVM. The danger is that if you have multiple threads that all try to increment the same value, there is a chance that two or more of the threads will get the same value (step #1), then increment it (step #2), and then assign the new value to it (step #3). If two threads increment the number 5 then you would expect to see 7 but instead both increment the number 5, yielding a result of 6, and then assign 6 back to the integer's memory location.

With the release of Java SE 5, Sun included a java.util.concurrent.atomic package that addresses this limitation. And specifically they added classes including the following:

AtomicBoolean - A boolean value that may be updated atomically. An AtomicBoolean is used in applications such as atomically updated flags, and cannot be used as a replacement for a Boolean.

AtomicInteger - An int value that may be updated atomically. An AtomicInteger is used in applications such as atomically incremented counters, and cannot be used as a replacement for an Integer. However, this class does extend Number to allow uniform access by tools and utilities that deal with numerically-based classes.

AtomicIntegerArray - An int array in which elements may be updated atomically.

AtomicLong - A long value that may be updated atomically. An AtomicLong is used in applications such as atomically incremented sequence numbers, and cannot be used as a replacement for a Long. However, this class does extend Number to allow uniform access by tools and utilities that deal with numerically-based classes.

AtomicLongArray - A long array in which elements may be updated atomically.

Each of these atomic classes provides methods to perform common operations, but each one is ensured to be performed as a single atomic operation. For example, rather than incrementing an integer using the standard increment operator, like the following:

int n = ++i;
You can ensure that the (1) get value, (2) increment value, (3) update memory, and (4) assign the new value to n is all accomplished without fear of another thread interrupting your operation by writing you code as follows:

AtomicInteger ai = new AtomicInteger(0);
int n = ai.incrementAndGet();
In addition, the AtomicInteger class provides the following operations:

int addAndGet(int delta) - Add delta to the integer and then return the new value.

int decrementAndGet() - Decrement the integer and return its value.

int getAndAdd(int delta) - Return the value and then add delta to it.

int getAndDecrement() - Return the value and then decrement.

int getAndIncrement() - Return the value and then increment it.

int getAndSet(int newValue) - Return the value and then set it to the newValue


Ref:
>> http://howtodoinjava.com/2014/06/13/compare-and-swap-cas-algorithm/
>> http://java.boot.by/ocpjp7-upgrade/ch04s03.html