The Java Pro Seeker: May 2014

Thursday, May 29, 2014

Given an inorder and a pre or post order tree traversal, reconstruct the tree (assume a binary tree)

Considering the tree to be as in the below diagram

/**

 * @author Sachidananda

 *Technically a tree consists of Node.The Node definition has been provided below.

*/

public class Node {

 int nodeValue;

 public Node left;

 public Node right;

 public Node(int val) {

  nodeValue = val;

  this.left = null;

  this.right = null;

}

 public boolean hasLeft() {

  return left != null;

}

 public boolean hasRight() {

  return right != null;

}

 public String toString() {

  return toStringHelper(" ");

}

 private String toStringHelper(String indent) {

  String ret = "";

  if (hasRight()){

   ret += right.toStringHelper(indent + "   ");

}

   ret += indent + nodeValue + "\n";

  if (hasLeft()){

   ret += left.toStringHelper(indent + " ");

}

  return ret;

}

}

/**

 * @author sachidananda

 * Class has the functionality to construct  a BinaryTree from the inorder,preorder and postorder values.

*/

public class BinTreeContsructor {

/*considering that we have arrays of integer values for inorder, preorder and postorder traversal and that does not have duplicates*/

/**

  * @param int []preorder

  * @param int []inorder

  * @return Node

*/

 public static Node buildTreePreIn(int[] preorder, int[] inorder){

  int preorderLength=preorder.length;

  int inorderLength=inorder.length;

  return buildTreePI(preorder,0,preorderLength-1,inorder,0,inorderLength-1);

}

/**

  * @param int []inorder

  * @param int []postorder

  * @return Node

*/

 public static Node buildTreeInPost(int[] inorder, int[] postorder){

  int inorderLength=inorder.length;

  int postorderLength=postorder.length;

  return buildTreeIP(inorder, 0, inorderLength-1, postorder, 0, postorderLength-1);

}

/**

*

  * @param int []pre

  * @param int preStart

  * @param int preEnd

  * @param int []in

  * @param int inStart

  * @param int inEnd

  * @return Node

*/

 public static Node buildTreePI(int[] pre, int preStart, int preEnd, int[] in, int inStart, int inEnd){

           if(inStart > inEnd){

              return null;

}

           int rootVal = pre[preStart];

           int rootIndex = 0;

           for(int i = inStart; i <= inEnd; i++){

               if(in[i] == rootVal){

                  rootIndex = i;

                   break;

}

}

          int len = rootIndex - inStart;

          Node root = new Node(rootVal);

          root.left = buildTreePI(pre, preStart+1, preStart+len, in, inStart, rootIndex-1);

          root.right = buildTreePI(pre, preStart+len+1, preEnd, in, rootIndex+1, inEnd);

           return root;

}

/**

*

  * @param int []in

  * @param int inStart

  * @param int inEnd

  * @param int []post

  * @param int postStart

  * @param int postEnd

  * @return Node

*/

 public static Node buildTreeIP(int[] in, int inStart, int inEnd, int[] post, int postStart, int postEnd){

          if(inStart > inEnd){

               return null;

}

           int rootVal = post[postEnd];

           int rootIndex = 0;

           for(int i = inStart; i <= inEnd; i++){

              if(in[i] == rootVal){

                   rootIndex = i;

                   break;

}

}

           int len = rootIndex - inStart;

           Node root = new Node(rootVal);

           root.left = buildTreeIP(in, inStart, rootIndex-1, post, postStart, postStart+len-1);

           root.right = buildTreeIP(in, rootIndex+1, inEnd, post, postStart+len, postEnd-1);

           return root;

}

 public static void main(String[] args) {

  int []postOrder={10,32,25,78,40};

  int []preOrder ={40,25,10,32,78};

  int []inOrder= {10,25,32,40,78};

  Node tree=buildTreePreIn(preOrder,inOrder);

  System.out.println("Building Tree from preOrder and  inOrder");

  System.out.println("----------------------------------------");

  System.out.println(tree);

  Node tree1=buildTreeInPost(inOrder,postOrder);

  System.out.println("Building Tree from inOrder and  postOrder");

  System.out.println("----------------------------------------");

  System.out.println(tree1);

}

}

output

=============

Building Tree from preOrder and  inOrder

----------------------------------------

78

40

32

25

10

Building Tree from inOrder and  postOrder

----------------------------------------

78

40

32

25

10

For a given String, generate all possible permutations (All Possible Anagrams) of that String In Java

import java.util.ArrayList;

import java.util.Scanner;

/**

 * @author Sachidananda

*/

public class StringPremutator {

 public static void main(String[] args) {

  System.out.println("Enter the strings to get all the combinations : ");

  Scanner scanner = new Scanner(System.in);

  String inputString = scanner.nextLine();

  System.out.println("The Input you have provided is : "+inputString);

  ArrayList<String> finalResult=permutate(inputString);

  //int counter=0; // Had taken the counter to check if the permutation for a given length mathematically is correct.

  System.out.println("The Results are as follows : ");

  for(String results:finalResult){

   System.out.println(results);

   //counter++;

   //System.out.println(counter);

}

}

/**

  * List permutation of a string

*

  * @param   String as in input.

  * @return  ArrayList that has all the permutation of the input passed.

*/

 public static ArrayList<String> permutate(String ins){

  ArrayList<String> permutattionResult = new ArrayList<String>();

  if(ins.length()==1){

   permutattionResult.add(ins);

  }else if(ins.length()>1){

   int lastIdx=ins.length()-1;

   String last=ins.substring(lastIdx);

   String lastButRest=ins.substring(0,lastIdx);

   permutattionResult=getPremutatorCombinations(permutate(lastButRest),last);

}

  return permutattionResult;

}

/**

*

  * @param ArrayList is passed which has all the characters in the string except the last

  * @param Strig as last which is basically the character at the last index if the input string.

  * @return ArrayList as result.

*/

 private static ArrayList<String> getPremutatorCombinations(ArrayList<String> permutate, String last) {

  ArrayList<String> result = new ArrayList<String>();

  for(String data:permutate){

   for(int i=0; i <=data.length();i++){

    String permutedData=new StringBuilder(data).insert(i,last ).toString();

    result.add(permutedData);

}

}

  return result;

}

}

OUTPUT

===============

Enter the strings to get all the combinations : 

ABC

The Input you have provided is : ABC

The Results are as follows : 

CBA

BCA

BAC

CAB

ACB

ABC

Sunday, May 4, 2014

Tips to tune your Java Code for Performance Optimization

Using double/long vs BigDecimal for monetary calculations: double, long, java.math.BigDecimal, java.lang.String :

If you want to implement fast and correct monetary arithmetic operations in Java, stick to the following rules:

Store monetary values in the smallest currency units (for example, cents) in long variables.
Avoid working with non-integral values while using double (calculate in the smallest currency units).
Add/subtract using long.
Round any multiplication/division results using Math.round/rint/ceil/floor (per your system requirements).
Your calculations should fit into 52 bits (double precision).

Always use MathContext for BigDecimal multiplication and division in order to avoid ArithmeticException for infinitely long decimal results. Don't use MathContext.UNLIMITED for that reason - it is equivalent to no context at all.
Do not convert double to BigDecimal, instead convert String to BigDecimal when possible.

Changes to String internal representation made in Java 1.7.0_06: java.lang.String, java.util.HashMap, java.util.Hashtable, java.util.HashSet, java.util.LinkedHashMap, java.util.LinkedHashSet, java.util.WeakHashMap and java.util.concurrent.ConcurrentHashMap :

From Java 1.7.0_06 String.substring always creates a new underlying char[] value for every String it creates. This means that this method now has a linear complexity compared to previous constant complexity. The advantage of this change is a slightly smaller memory footprint of a String (8 bytes less than before) and a guarantee to avoid memory leaks caused by String.substring (see String packing part 1: converting characters to bytes for more details on Java object memory layout).
Java 7u6+ functionality. Removed in Java 8. Starting from the same Java update, String class got a second hashing method called hash32. This method is currently not public and could be accessed without reflection only via sun.misc.Hashing.stringHash32(String) call. This method is used by 7 JDK hash-based collections if their size will exceed jdk.map.althashing.threshold system property. This is an experimental function and currently I don't recommend using it in your code.
Java 7u6 (inclusive) to Java 7u40 (exclusive) functionality. Not applicable to Java 8. All standard JDK non-concurrent maps and sets in all Java versions between Java 7u6 (inclusive) and Java 7u40 (exclusive) are affected by a performance bug caused by new hashing implementation. This bug affects only multithreaded applications creating heaps of maps per second. See this article for more details. This problem was fixed in Java 7u40.

Performance of various methods of binary serialization in Java: java.nio.ByteBuffer, sun.misc.Unsafe, java.io.DataInputStream, java.io.DataOutputStream, java.io.ByteArrayInputStream, java.io.ByteArrayOutputStream: comparison of binary serialization performance using various classes:

It is extremely slow to write single bytes to direct byte buffers. You should avoid using direct byte buffers for writing records with mostly single byte fields.
If you have primitive array fields - always use bulk methods to process them. ByteBuffer bulk methods performance is close to those of Unsafe (though ByteBuffer methods are always a little slower). If you need to store/load any other primitive array except byte - use ByteBuffer.to[YourType]Buffer.put(array) method call followed by your byte buffer position update. Do not call ByteBuffer.put[YourType] method in a loop!
The higher your average field length - the slower is a heap buffer and the faster is a direct byte buffer. Unsafe access even to separate fields is still faster.
In Java 7 many types of ByteBuffer accesses were seriously optimized compared to Java 6.
Always try to serialize primitive arrays using direct byte buffer with your platform native byte order - its performance is very close to Unsafe performance and it is portable, unlike Unsafe code.

Java collections overview: all JDK 1.6/1.7 standard collections are described and categorized in this overview :

Try to follow these rules while using ArrayList:

Add elements to the end of the list
Remove elements from the end too
Avoid contains, indexOf and remove(Object) methods
Even more avoid removeAll and retainAll methods
Use subList(int, int).clear() idiom to quickly clean a part of the list

java.util.LinkedList performance: java.util.LinkedList, java.util.ArrayDeque:

Consider using ArrayDeque for queue-based algorithms
Use ListIterator with LinkedList
Avoid any LinkedList methods which accept or return index of an element in the list - they have nothing in common with performance
Check if you have a reason to use LinkedList.remove/removeFirst/removeLast methods, use pollFirst/pollLast instead
Try batch processing LinkedList

Bit sets: java.util.BitSet, java.util.Set<Integer>: representing set of integers in the most compact form, using bit sets to store set of Long/long values:

Do not forget about bit sets when you need to map a large number of integer keys to boolean flags.
Sets of integer values should be replaced with bit sets in a lot of cases in order to save a lot of memory.

java.util.IdentityHashMap: discussion why an IdentityHashMap is so special and what alternatives does it have.

java.util.IdentityHashMap uses System.identityHashCode to get object identity hash code. Avoid using IdentityHashMap if you either have primary key field in the objects (use them as a key for ordinary HashMap) or use Trove maps custom hashing strategy if you need to add your own equals and hashCode methods, but can't update the objects you are working on.
Do not try to iterate IdentityHashMap contents, because iteration order will be different on every run of your program, thus making your program results inconsistent.
Accessing the object identity hash code is a very cheap Java intrinsic operation.
Beware that an object with the calculated identity hash code can not be used for biased locking. While very rare in normal circumstances, you may end up in this situation if your lock will be accessed by any Java object graph traversal algorithm (serialization, for example).

Regexp-related methods of String: java.util.regex.Pattern, java.util.regex.Matcher, java.lang.String: pattern/matcher logic:

Always (or nearly always) replace String.matches, split, replaceAll, replaceFirst methods with Matcher and Pattern methods - it will save you from unnecessary pattern compilation.
In Java 7 splitting by a single not regex-special character string is optimized in String.split method. Always use String.split to split such strings in Java 7.
In all other simple cases consider handwriting parsing methods for simple situations in the time-critical code. You can easily gain 10 times speedup by replacing Pattern methods with handcrafted methods.

java.util.Date, java.util.Calendar and java.text.SimpleDateFormat performance: java.util.Date, java.util.Calendar, java.text.SimpleDateFormat: date storage, parsing and converting back to string:

Do not use java.util.Date unless you have to use it. Use an ordinary long instead.
java.util.Calendar is useful for all sorts of date calculations and i18n, but avoid either storing a lot of such objects or extensively creating them - they consume a lot of memory and expensive to create.
java.text.SimpleDateFormat is useful for general case datetime parsing, but it is better to avoid it if you have to parse a lot of dates in the same format (especially dates without time). Implement a parser manually instead.
Joda Time library performance: org.joda.time.DateTime, org.joda.time.format.DateTimeFormat, org.joda.time.format.DateTimeFormatter.
This is a comparison of Joda Time library classes performance with standard JDK classes performance (java.util.Date, java.util.Calendar, java.text.SimpleDateFormat).

All Joda Time date/time objects are built on top of a long timestamp, so it is cheap to construct those objects from a long.
Joda Time ver 2.1-2.3 is affected by a performance issue in a timezone offset calculation logic - all years after the last daylight savings rule change in the given timezone use a slow calculation path (European timezones are affected particularly badly). In essence it means that all zones will perform badly in all years after Joda Time release you are using.
Date/time objects construction and date/time arithmetics in Joda work 1.5-3 times faster than GregorianCalendar for the years not affected by an above mentioned performance issue. For affected years date operations performance in Joda plummets and could be 4 times slower than in GregorianCalendar.
Joda does not keep the human time - year/month/day/hour/min/second inside its objects (unlike GregorianCalendar). It means that accessing human time on Joda objects is more expensive if you need to get more than one field.
Date/time parsing in Joda is working a little faster than in JDK SimpleDateFormat. The advantage of Joda parsing is that constructing a parser - DateTimeFormatter object is extremely cheap, unlike an expensive SimpleDateFormat, so you don't have to cache parsers anymore.

JSR 310 - Java 8 Date/Time library performance (as well as Joda Time 2.3 and j.u.Calendar): an overview of a new Java 8 date/time implementation also known as JSR-310 and its performance comparison with Joda Time 2.3 and j.u.GregorianCalendar.

Java 8 date/time classes are built on top of human time - year/month/day/hour/minute/second/nanos. It makes them fast for human datetime arithmetics/conversion. Nevertheless, if you are processing computer time (a.k.a. millis since epoch), especially computer time in a short date range (a few days), a manual implementation based on int/long values would be much faster.
Date/time component getters like getDayOfMonth have O(1) complexity in Java 8 implementation. Joda getters require the computer-to-human time calcualtion on every getter call, which makes Joda a bottleneck in such scenarios.
Parsing of OffsetDateTime/OffsetTime/ZonedDateTime is very slow in Java 8 ea b121 due to exceptions thrown and caught internally in the JDK.

java.io.ByteArrayOutputStream: java.io.ByteArrayOutputStream, java.nio.ByteBuffer: why you should not use ByteArrayOutputStream in the performance critical code.

For performance critical code try to use ByteBuffer instead of ByteArrayOutputStream. If you still want to use ByteArrayOutputStream - get rid of its synchronization.
If you are working on a method which writes some sort of message to unknown OutputStream, always write your message to the ByteArrayOutputStream first and use its writeTo(OutputStream) method after that. In some rare cases when you are building a String from its byte representation, do not forget about ByteArrayOutputStream.toString methods.
In most cases avoid ByteArrayOutputStream.toByteArray method - it creates a copy of internal byte array. Garbage collecting these copies may take a noticeable time if your application is using a few gigabytes of memory (see Inefficient byte[] to String constructor article for another example).

java.io.BufferedInputStream and java.util.zip.GZIPInputStream: java.io.BufferedInputStream, java.util.zip.GZIPInputStream, java.nio.channels.FileChannel: some minor performance pitfalls in these two streams.

Both BufferedInputStream and GZIPInputStream have internal buffers. Default size for the former one is 8192 bytes and for the latter one is 512 bytes. Generally it worth increasing any of these sizes to at least 65536.
Do not use a BufferedInputStream as an input for a GZIPInputStream, instead explicitly set GZIPInputStream buffer size in the constructor. Though, keeping a BufferedInputStream is still safe.
If you have a new BufferedInputStream( new FileInputStream( file ) ) object and you call its available method rather often (for example, once or twice per each input message), consider overriding BufferedInputStream.available method. It will greatly speed up file reading.

java.lang.Byte, Short, Integer, Long, Character (boxing and unboxing): java.lang.Byte, java.lang.Short, java.lang.Integer, java.lang.Long, java.lang.Character:

Never call java.lang.Number subclasses valueOf(String) methods. If you need a primitive value - call parse[Type]. If you want an instance of a wrapper class, still call parse[Type] method and rely on the JVM-implemented boxing. It will support caching of most frequently used values. Never call wrapper classes constructors - they always return a new Object, thus bypassing the caching support. Here is the summary of caching support for primitive replacement classes:
Byte, Short, Long Character Integer Float, Double
From -128 to 127 From 0 to 127 From -128 to java.lang.Integer.IntegerCache.high or 127, whichever is bigger No caching

Map.containsKey/Set.contains: java.util.Map, java.util.Set and most of their implementations:

For sets, contains+add/remove call pairs should be replaced with single add/remove calls even if some extra logic was guarded by contains call.
For maps, contains+get pair shall always be replaced with get followed by null-check of get result. contains+remove pair should be replaced with a single remove call and check of its result.
Same ideas are applicable to Trove maps and sets too.

java.util.zip.CRC32 and java.util.zip.Adler32 performance: java.util.zip.CRC32, java.util.zip.Adler32 and java.util.zip.Checksum:

If you can choose which checksum implementation you can use - try Adler32 first. If its quality is sufficient for you, use it instead of CRC32. In any case, use Checksum interface in order to access Adler32/CRC32 logic.
Try to update checksum by at least 500 byte blocks. Shorter blocks will require a noticeable time to be spent in JNI calls.

hashCode method performance tuning: java.lang.String, java.util.HashMap, java.util.HashSet, java.util.Arrays:

Try to improve distribution of results of your hashCode method. This is far more important than to optimize that method speed. Never write a hashCode method which returns a constant.
String.hashCode results distribution is nearly perfect, so you can sometimes substitute Strings with their hash codes. If you are working with sets of strings, try to end up with BitSets, as described in this article. Performance of your code will greatly improve.

Throwing an exception in Java is very slow: why it is too expensive to throw exceptions in Java: java.lang.Throwable, java.lang.Exception, java.lang.RuntimeException, sun.misc.BASE64Decoder, java.lang.NumberFormatException:

Never use exceptions as return code replacement or for any likely to happen events. Throwing an exception is too expensive - you may experience 100 times slowdown for simple methods.
Avoid using any Number subclass parse*/valueOf methods if you call them for each piece of your data and you expect a lot of non-numerical data. Parse such values manually for top perform

Java logging performance pitfalls: how to lose as little as possible performance while writing log messages: java.util.logging.Logger, java.util.logging.Handler, java.util.logging.Formatter, java.text.MessageFormat:

If you make expensive calculations while preparing data for log messages, either use Logger.isLoggable and do all data preparation inside or write an object which does all calculations in its toString method.
Never call Object.toString method in order to obtain a log message argument - just pass an original object. Logging framework will call toString method on your object.
Do not mix format string concatenation with log arguments - malicious concatenated string will allow your application user to break your logging/access data which was not supposed for user access.

Base64 encoding and decoding performance: an overview of several well-known Base64 Java implementations from the performance perspective: sun.misc.BASE64Encoder, sun.misc.BASE64Decoder, java.util.Base64 (Java 8 specific), javax.xml.bind.DatatypeConverter (Java 6+), org.apache.commons.codec.binary.Base64, com.google.common.io.BaseEncoding (Google Guava), http://iharder.net/base64, MiGBase64:

If you looking for a fast and reliable Base64 codec - do not look outside JDK. There is a new codec in Java 8: java.util.Base64 and there is also one hidden from many eyes (from Java 6): javax.xml.bind.DatatypeConverter. Both are fast, reliable and do not suffer from integer overflows.
2 out of 4 3rd party codecs described here are very fast: MiGBase64 and IHarder. Unfortunately, if you will need to process hundreds of megabytes at a time, only Google Guava will allow you to decode 2G of data at a time (360MB in case of MiGBase64 / 720M in case of IHarder and Apache Commons). Unfortunately, Guava does not support byte[] -> byte[] encoding.
Do not try to call String.getBytes(Charset) on huge strings if your charset is a multibyte one - you may get the whole gamma of integer overflow related exceptions.
A possible memory leak in the manual MultiMap implementation: an overview of multimap implementations in Java 8, Google Guava and Scala 2.10 as well as a description of a possible memory leak you can have while manually implementing a multimap using Java 6 or 7.
As you have seen, it is quite easy to miss a memory leak while implementing a multilevel map. You need to be careful and split read and write accesses to the outer map.
Newer frameworks and languages, like Google Guava, Java 8 and Scala already provide you more convenient syntax and wider choice of collections thus allowing you to avoid possible memory leaks in the multilevel maps.

java.util.Random and java.util.concurrent.ThreadLocalRandom in multithreaded environments: an overview of java.util.Random and java.util.concurrent.ThreadLocalRandom in single and multithreaded environments as well as some low level analysis of their performance.

Do not share an instance of java.util.Random between several threads in any circumstances, wrap it in ThreadLocal instead.
From Java 7 prefer java.util.concurrent.ThreadLocalRandom to java.util.Random in all circumstances - it is backwards compatible with existing code, but uses cheaper operations internally.

Charset encoding and decoding in Java 7/8: we will check how fast are Charset encoders/decoders in Java 7 and what are the performance improvements in Java 8.

Always prefer national charsets like windows-1252 or Shift_JIS to UTF-8: they produce more compact binary representation (as a rule) and they are faster to encode/decode (there are some exceptions in Java 7, but it becoming a rule in Java 8).
ISO-8859-1 always works faster than US-ASCII in Java 7 and 8. Choose ISO-8859-1 if you don't have any solid reasons to use US-ASCII.
You can write a very fast String->byte[] conversion for US-ASCII/ISO-8859-1, but you can not beat Java decoders - they have direct access to the output String they create.

Saturday, May 3, 2014

Java Performance Optimizations tips

Garbage Collection Optimization for Low-Latency and High-Throughput Java Applications

Tune the GC on a codebase that is near completion and includes performance optimizations.
If you cannot tune on a real workload, you need to tune on synthetic workloads representative of production environments.
GC characteristics you should optimize include: Stop-the-world pause time duration and frequency; CPU contention with the application; Heap fragmentation; GC memory overhead compared to application memory requirements.
You need a clear understanding of GC logs and commonly used JVM parameters to tune GC behavior.
A large heap is required if you need to maintain an object cache of long-lived objects.
GC logging should use -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps -XX:+PrintTenuringDistribution -XX:+PrintGCApplicationStoppedTime.
Use GC logging to analyze GC performance.
GC collection frequency can be decreased by reducing the object allocation/promotion rate and/or increasing the size of the generation.
The duration of young generation GC pause depends on the number of objects that survive a collection.
Increasing the young generation size may produce longer young GC pauses if more data survives and gets copied in survivor spaces, or if more data gets promoted to the old generation; but could have the same pause time and decrease in frequency if the count of surviving objects doesn't increase much.
Applications that mostly create short-lived objects, only need to tune the young generation after making the old generation big enough to handle the initially generated ong-lived objects.
Applications that produce long-lived objects need to tune the application so that the promoted objects fill the old generation at a rate that produces an acceptable frequency of old-generation GCs.
If the threshold at which old generation GC is triggered is too low, the application can get stuck in incessant GC cycles. For example the flags -XX:CMSInitiatingOccupancyFraction=92 -XX:+UseCMSInitiatingOccupancyOnly would start the old gen GC only when the old gen heap is 92% full.
Try to minimize the heap fragmentation and the associated full GC pauses for CMS GC with -XX:CMSInitiatingOccupancyFraction.
Tune -XX:MaxTenuringThreshold to reduce the amount of time spent in data copying in the young generation collection while avoiding promoting too many objects, by noting tenuring ages in the GC logs.
Young collection pause duration can increase as the old generation fills up due to object promotion taking more time from backpressure from the old generation (the old gen needing to free space before allowing promotion).
Setting -XX:ParGCCardsPerStrideChunk controls the granularity of tasks given to GC worker threads and helps get the best performance (in this tuning exercise the value 32768 reduced pause times).
The -XX:+BindGCTaskThreadsToCPUs option binds GC threads to individual CPU cores (if implemented in the JVM and if the OS permits).
-XX:+UseGCTaskAffinity allocates tasks to GC worker threads using an affinity parameter (if implemented).
GC pauses with low user time, high system time and high real (wallclock) time imply that the JVM pages are being stolen by Linux. You can use -XX:+AlwaysPreTouch and set vm.swappiness to minimize this. Or mlock, but this would crash the process if RAM is exhausted.
GC pauses with low user time, low system time and high real time imply that the GC threads were recruited by Linux for disk flushes and were stuck in the kernel waiting for I/O. You can use -XX:+AlwaysPreTouch and set vm.swappiness to minimize this. Or mlock, but this would crash the process if RAM is exhausted.
The following combination of options enabled an I/O intensive application with a large cache and mostly short-lived objects after initialization to achieve 60ms pause times for the 99.9th percentile latency: -server -Xms40g -Xmx40g -XX:MaxDirectMemorySize=4096m -XX:PermSize=256m -XX:MaxPermSize=256m -XX:NewSize=6g -XX:MaxNewSize=6g -XX:+UseParNewGC -XX:MaxTenuringThreshold=2 -XX:SurvivorRatio=8 -XX:+UnlockDiagnosticVMOptions -XX:ParGCCardsPerStrideChunk=32768 -XX:+UseConcMarkSweepGC -XX:CMSParallelRemarkEnabled -XX:+ParallelRefProcEnabled -XX:+CMSClassUnloadingEnabled -XX:CMSInitiatingOccupancyFraction=80 -XX:+UseCMSInitiatingOccupancyOnly -XX:+AlwaysPreTouch -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps -XX:+PrintTenuringDistribution -XX:+PrintGCApplicationStoppedTime -XX:-OmitStackTraceInFastThrow.

Ref: http://engineering.linkedin.com/garbage-collection/garbage-collection-optimization-high-throughput-and-low-latency-java-applications

Understanding Throughput and Latency Using Little's Law

Little's Law: occupancy = latency x throughput. Occupancy is the number of requestors in the system, also often referred to as capacity used.
For a given system, Little's Law says the capacity of the system is fixed, inversely relating the maximum latency and maxium throughput of the system. E.g. If you want to increase maxium throughput of the system by some factor, you need to decrease the latency by that same factor.
When system occupancy (capacity used) has reached it's peak, additional requests can't add to the throughput; instead they increase the average latency caused by queuing delay.
Target latency in preference to throughput. Improving latency improves throughput at the same time - if you do things faster you get more things done. However it's often easier to increase throughput.
Throughput at one level determines latency at the next level up.
When using parallelism to increase throughput, ask: What are the data dependencies and how much data will need to be shared?

High Scalability

Architect for redundancy in everything - including after when one element is offline, i.e. at least triple redundancy. A requirement for five 9's availability corresponds to about 5 minutes of downtime per year.
Do not depend on any shared infrastructure - if any infrastructure system goes down, the rest of the system should stay up.
Use caching to optimize performance but not as a requirement for the system to operate at scale. Caches should be a bonus but not a necessity.
Use front-end load balancing to direct traffic to the nearest unloaded system capable of serving the request.
Monitor the system and trigger alarms when thresholds for availability and response times are not met. Monitor hosts, CPUs, interfaces, network devices, file systems, web traffic, response codes, response times, and application metrics.
Test every feature with a subset of users to see how it performs before rolling out to a wider audience.
Use replicated data closer to the servers if data is distant (e.g. replicated across datacentres so servers in a datacentre doesn't need to access data from another datacentre).
Use a frontend request balancer to the data servers to easily enable scaling by adding new dataservers.

Thread Confinement:

A useful trick for ensuring thread safety is "stack confinement" - any object that is created and doesn't escape a method (i.e. doesn't get referenced from any other object fields) is guaranteed threadsafe, because it only lives in the one thread (the thread owning the stack the method is executing on).
A useful trick for ensuring thread safety is "thread confinement" - any object which is only ever used in the same thread (typically by being held by a ThreadLocal and never being referenced from any other object fields) is guaranteed threadsafe, because it only lives in the one thread. But beware that ThreadLocal held objects can be the cause of a memory leak if they are not managed carefully.
Don't leak a ThreadLocalRandom instance, either by storing it in a field or passing it to a method or having it leak accidentally to an anonymous class. Access it using ThreadLocalRandom.current() (you can call that every time or store that to a method local variable and use it that way).

Memory Computing

In-memory computing speeds up data processing by roughly 5,000 times compared to using disks.
In-memory computing requires less hardware meaning decreased capital, operational and infrastructure overhead; existing hardware lifetimes can also be extended.
In-memory computing uses a tiered approach to data storage: RAM, local disk and remotely accessible disk.
In-memory computing only puts the operational data into memory for processing - offline, backup and historical data should all remain out of memory.

Parallel Array Operations in Java 8

Arrays.parallelSort() implements a parallel sort-merge algorithm that recursively breaks an array into pieces, sorts them, then recombines them concurrently and transparently, resulting in gains in performance and efficiency when sorting large arrays (compared to using the serial Arrays.sort). For large arrays, this can improve sorting time by a factor corresponding to the number of cores available.
Arrays.parallelPrefix() allows a mathematical operation to be performed automatically in parallel on an array. For large arrays, this can improve calculation time by a factor corresponding to the number of cores available.
Arrays.parallelSetAll() sets each element of an array in parallel using any specified function.
Spliterators split and traverse arrays, collections and channels in parallel. Spliterator spliterator = ... //e.g. Arrays.spliterator(anArray); spliterator.forEachRemaining( n -> action(n) );
Stream processing in Java 8 allows for parallel processing on a dataset derived from an array or collection or channel. A Stream allows you to perform aggregate functions such as pulling out only distinct values (ignoring duplicates) from a set, data conversions, finding min and max values, map-reduce functions, and other mathematical operations.

Using SharedHashMap http://www.fasterj.com/articles/sharedhashmap1a.shtml

SharedHashMap is a high performance persisted off-heap hash map, shareable across processes.
ProcessInstanceLimiter allows you to limit the number of processes running for a class of processes.
Off-heap memory storage is useful for low latency applications, to avoid GC overheads on that data.
Low latency applications benefit from "no-copy" implementations, where you avoid copying objects as much as possible, thus minimizing or even avoiding GC.
Techniques to support "no-copy" implementations include: methods which atomically assign references to existing objects or create the object if absent; using primitive data types wherever possible; writing directly to shared objects; using off-heap memory.
Externalizable is usually more efficient than Serializable, and can be massively more efficient.
SharedHashMap is thread-safe across threads spanning multiple processes.
SharedHashMap supports concurrency control mechanisms for data objects referenced by threads spanning multiple processes.

11 Best Practices for Low Latency Systems

Scripting languages are not appropriate for low latency. You need something that gets compiled (JIT compilation is fine) and has a strong memory model to enable lock free programming. Obviously Java satisfies these requirements.
For low latency ensure that all data is in memory - I/O will kill latency. Typically use in-memory data structures with persitent logging to allow rebuilding the state after a crash.
For low latency keep data and processing colocated - network latency is an overhead you want to avoid if at all possible.
Low latency requires always having free resources to process the request, so the system should normally be underutilized.
Context switches impact latency. Limit the thread count to the number of cores available to the application and pin each thread to its own core.
Keep your reads sequential: All forms of storage perform significantly better when used sequentially (prefetching is a feature of most OSs and hardware).
For low latency, following pointers through use of linked lists or arrays of objects should be avoided at all costs, as this will require random access to memory which is much less efficient than sequential access. Arrays of primitive data types or structs is hugely better.
Batch writes by having a continuous writing thread which writes all data that is passed to its buffer. Do not pause to pass data from other threads, pass data to the buffer immediately.
Try to work with the caches - fitting data into caches is ideal. Cache-oblivious algorithms (ones that work with the cache regardless of its size) work by recursively breaking down the data until it fits in cache and then doing any necessary processing.
Non blocking and wait-free data structures and algorithms are friends to low latency processing. Lock-free is probably good enough if wait-free is too difficult.
Any processing (especially I/O) that is not absolutely necessary for the response should be done outside the critical path.
Parallelize as much as possible.
If there is a garbage collector, work with it to ensure pauses are absent or within acceptable parameters (this may require highly constrained object management, off heap processing, etc).

Java Marshalling Performance

When marshalling: aim to scan through primitive data type arrays; consider boundary alignments; control garbage, recycle if necessary; compute the data layout before runtime; work with the compiler and OS and hardware; be very careful about the code you generate, make it minimal and efficient.
Having data together in cache lines can make the data access an order of magnitude faster.
Order fields in their structure (probably naturally ordered by declaration order) according to the order in which they will be accessed.
Use a (direct) buffer correctly sized and sequentially read/write the elements in the buffer using data primitives and no object creation for highly efficient marshalling.
FIX/SBE (Simple Binary Encoding) marshalling can be 50 times faster than Google Protocol Buffers.

Top 10 - Performance Folklore

Sequential disk and memory access is much faster than random access.
Working off heap allows you to avoid putting pressure on the garbage collector.
Decoupled classes and components are more efficient. Keep your code cohesive.
Choose your data structures for the use required.
Making the procedure parallel is not necessarily faster - you need to use more threads, locks, pools, etc. Make sure you measure the difference. Single threaded solutions can be faster, or at least fast enough.
If going parallel, use message passing and pipelining to keep the implementation simple and efficient.
Logging is usually slow. Efficient logging would be asynchronous, and log in binary.
Beware that parsing libraries can be very inefficient.
Performance test your application.

Performance tuning legacy applications

Applications can be optimized forever - you must set performance goals appropriate to the business or you'll waste time and resources tuning beyond the performance needed.
User response times can be generally categorized as: 0.1 seconds feels like an instantaneous response (normal browsing should be in this range); 1 second is a noticeable delay but allows the user to continue their flow and they still feel in control (searches can fall in this range); 10 seconds makes the user feel at the mercy of the computer, but can be handled (e.g. generating a PDF); over 10 seconds and the user's flow is completely disrupted (so should only be targeted for things the user would expect to resaonably wait for, like end of day report generation).
Without sensible categorization of performance goals, everything tends to get dumped into the we want instant response" bucket - but this would be very expensive to achieve, so it's important to categorize correctly.
A site should be optimized against the "Google Page Speed" and "YSlow" recommendations.
Measure ideally from the user perspective, but at least for full service times of requests. If profiling tools are unavailable or impractical, a detailed logging trail of requests can help you identify underperforming components.

Combining Agile with Load and Performance Testing

Load and performance testing can determine how much load an application can handle before it crashes, when to add another server, when to reconfigure the network, where code needs to be optimized, and more.
Performance testing during continuous integration let's you catch issues early when they are cheaper to fix and won't impact release dates. Use SLAs (Service Level Agreements) to provide targets for the continuous integration performance tests.
Sometimes a small change can lead to disproportionate effects without it being realised beforehand. Integrating load testing into the continuous integration process stops small apparently innocuous changes getting pushed out to production without seeing their performance effect on the system, ensuring that performance targets remain satisfied.
Performance testing during continuous integration gives developers quick feedback on how their change has affected performance, allowing for changes to be handled in a more "normal" development way, rather than having to wait for results on code that they haven't worked on for a while.