The Java Pro Seeker: Java Performance Optimizations tips

Garbage Collection Optimization for Low-Latency and High-Throughput Java Applications

Tune the GC on a codebase that is near completion and includes performance optimizations.
If you cannot tune on a real workload, you need to tune on synthetic workloads representative of production environments.
GC characteristics you should optimize include: Stop-the-world pause time duration and frequency; CPU contention with the application; Heap fragmentation; GC memory overhead compared to application memory requirements.
You need a clear understanding of GC logs and commonly used JVM parameters to tune GC behavior.
A large heap is required if you need to maintain an object cache of long-lived objects.
GC logging should use -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps -XX:+PrintTenuringDistribution -XX:+PrintGCApplicationStoppedTime.
Use GC logging to analyze GC performance.
GC collection frequency can be decreased by reducing the object allocation/promotion rate and/or increasing the size of the generation.
The duration of young generation GC pause depends on the number of objects that survive a collection.
Increasing the young generation size may produce longer young GC pauses if more data survives and gets copied in survivor spaces, or if more data gets promoted to the old generation; but could have the same pause time and decrease in frequency if the count of surviving objects doesn't increase much.
Applications that mostly create short-lived objects, only need to tune the young generation after making the old generation big enough to handle the initially generated ong-lived objects.
Applications that produce long-lived objects need to tune the application so that the promoted objects fill the old generation at a rate that produces an acceptable frequency of old-generation GCs.
If the threshold at which old generation GC is triggered is too low, the application can get stuck in incessant GC cycles. For example the flags -XX:CMSInitiatingOccupancyFraction=92 -XX:+UseCMSInitiatingOccupancyOnly would start the old gen GC only when the old gen heap is 92% full.
Try to minimize the heap fragmentation and the associated full GC pauses for CMS GC with -XX:CMSInitiatingOccupancyFraction.
Tune -XX:MaxTenuringThreshold to reduce the amount of time spent in data copying in the young generation collection while avoiding promoting too many objects, by noting tenuring ages in the GC logs.
Young collection pause duration can increase as the old generation fills up due to object promotion taking more time from backpressure from the old generation (the old gen needing to free space before allowing promotion).
Setting -XX:ParGCCardsPerStrideChunk controls the granularity of tasks given to GC worker threads and helps get the best performance (in this tuning exercise the value 32768 reduced pause times).
The -XX:+BindGCTaskThreadsToCPUs option binds GC threads to individual CPU cores (if implemented in the JVM and if the OS permits).
-XX:+UseGCTaskAffinity allocates tasks to GC worker threads using an affinity parameter (if implemented).
GC pauses with low user time, high system time and high real (wallclock) time imply that the JVM pages are being stolen by Linux. You can use -XX:+AlwaysPreTouch and set vm.swappiness to minimize this. Or mlock, but this would crash the process if RAM is exhausted.
GC pauses with low user time, low system time and high real time imply that the GC threads were recruited by Linux for disk flushes and were stuck in the kernel waiting for I/O. You can use -XX:+AlwaysPreTouch and set vm.swappiness to minimize this. Or mlock, but this would crash the process if RAM is exhausted.
The following combination of options enabled an I/O intensive application with a large cache and mostly short-lived objects after initialization to achieve 60ms pause times for the 99.9th percentile latency: -server -Xms40g -Xmx40g -XX:MaxDirectMemorySize=4096m -XX:PermSize=256m -XX:MaxPermSize=256m -XX:NewSize=6g -XX:MaxNewSize=6g -XX:+UseParNewGC -XX:MaxTenuringThreshold=2 -XX:SurvivorRatio=8 -XX:+UnlockDiagnosticVMOptions -XX:ParGCCardsPerStrideChunk=32768 -XX:+UseConcMarkSweepGC -XX:CMSParallelRemarkEnabled -XX:+ParallelRefProcEnabled -XX:+CMSClassUnloadingEnabled -XX:CMSInitiatingOccupancyFraction=80 -XX:+UseCMSInitiatingOccupancyOnly -XX:+AlwaysPreTouch -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps -XX:+PrintTenuringDistribution -XX:+PrintGCApplicationStoppedTime -XX:-OmitStackTraceInFastThrow.

Ref: http://engineering.linkedin.com/garbage-collection/garbage-collection-optimization-high-throughput-and-low-latency-java-applications

Understanding Throughput and Latency Using Little's Law

Little's Law: occupancy = latency x throughput. Occupancy is the number of requestors in the system, also often referred to as capacity used.
For a given system, Little's Law says the capacity of the system is fixed, inversely relating the maximum latency and maxium throughput of the system. E.g. If you want to increase maxium throughput of the system by some factor, you need to decrease the latency by that same factor.
When system occupancy (capacity used) has reached it's peak, additional requests can't add to the throughput; instead they increase the average latency caused by queuing delay.
Target latency in preference to throughput. Improving latency improves throughput at the same time - if you do things faster you get more things done. However it's often easier to increase throughput.
Throughput at one level determines latency at the next level up.
When using parallelism to increase throughput, ask: What are the data dependencies and how much data will need to be shared?

High Scalability

Architect for redundancy in everything - including after when one element is offline, i.e. at least triple redundancy. A requirement for five 9's availability corresponds to about 5 minutes of downtime per year.
Do not depend on any shared infrastructure - if any infrastructure system goes down, the rest of the system should stay up.
Use caching to optimize performance but not as a requirement for the system to operate at scale. Caches should be a bonus but not a necessity.
Use front-end load balancing to direct traffic to the nearest unloaded system capable of serving the request.
Monitor the system and trigger alarms when thresholds for availability and response times are not met. Monitor hosts, CPUs, interfaces, network devices, file systems, web traffic, response codes, response times, and application metrics.
Test every feature with a subset of users to see how it performs before rolling out to a wider audience.
Use replicated data closer to the servers if data is distant (e.g. replicated across datacentres so servers in a datacentre doesn't need to access data from another datacentre).
Use a frontend request balancer to the data servers to easily enable scaling by adding new dataservers.

Thread Confinement:

A useful trick for ensuring thread safety is "stack confinement" - any object that is created and doesn't escape a method (i.e. doesn't get referenced from any other object fields) is guaranteed threadsafe, because it only lives in the one thread (the thread owning the stack the method is executing on).
A useful trick for ensuring thread safety is "thread confinement" - any object which is only ever used in the same thread (typically by being held by a ThreadLocal and never being referenced from any other object fields) is guaranteed threadsafe, because it only lives in the one thread. But beware that ThreadLocal held objects can be the cause of a memory leak if they are not managed carefully.
Don't leak a ThreadLocalRandom instance, either by storing it in a field or passing it to a method or having it leak accidentally to an anonymous class. Access it using ThreadLocalRandom.current() (you can call that every time or store that to a method local variable and use it that way).

Memory Computing

In-memory computing speeds up data processing by roughly 5,000 times compared to using disks.
In-memory computing requires less hardware meaning decreased capital, operational and infrastructure overhead; existing hardware lifetimes can also be extended.
In-memory computing uses a tiered approach to data storage: RAM, local disk and remotely accessible disk.
In-memory computing only puts the operational data into memory for processing - offline, backup and historical data should all remain out of memory.

Parallel Array Operations in Java 8

Arrays.parallelSort() implements a parallel sort-merge algorithm that recursively breaks an array into pieces, sorts them, then recombines them concurrently and transparently, resulting in gains in performance and efficiency when sorting large arrays (compared to using the serial Arrays.sort). For large arrays, this can improve sorting time by a factor corresponding to the number of cores available.
Arrays.parallelPrefix() allows a mathematical operation to be performed automatically in parallel on an array. For large arrays, this can improve calculation time by a factor corresponding to the number of cores available.
Arrays.parallelSetAll() sets each element of an array in parallel using any specified function.
Spliterators split and traverse arrays, collections and channels in parallel. Spliterator spliterator = ... //e.g. Arrays.spliterator(anArray); spliterator.forEachRemaining( n -> action(n) );
Stream processing in Java 8 allows for parallel processing on a dataset derived from an array or collection or channel. A Stream allows you to perform aggregate functions such as pulling out only distinct values (ignoring duplicates) from a set, data conversions, finding min and max values, map-reduce functions, and other mathematical operations.

Using SharedHashMap http://www.fasterj.com/articles/sharedhashmap1a.shtml

SharedHashMap is a high performance persisted off-heap hash map, shareable across processes.
ProcessInstanceLimiter allows you to limit the number of processes running for a class of processes.
Off-heap memory storage is useful for low latency applications, to avoid GC overheads on that data.
Low latency applications benefit from "no-copy" implementations, where you avoid copying objects as much as possible, thus minimizing or even avoiding GC.
Techniques to support "no-copy" implementations include: methods which atomically assign references to existing objects or create the object if absent; using primitive data types wherever possible; writing directly to shared objects; using off-heap memory.
Externalizable is usually more efficient than Serializable, and can be massively more efficient.
SharedHashMap is thread-safe across threads spanning multiple processes.
SharedHashMap supports concurrency control mechanisms for data objects referenced by threads spanning multiple processes.

11 Best Practices for Low Latency Systems

Scripting languages are not appropriate for low latency. You need something that gets compiled (JIT compilation is fine) and has a strong memory model to enable lock free programming. Obviously Java satisfies these requirements.
For low latency ensure that all data is in memory - I/O will kill latency. Typically use in-memory data structures with persitent logging to allow rebuilding the state after a crash.
For low latency keep data and processing colocated - network latency is an overhead you want to avoid if at all possible.
Low latency requires always having free resources to process the request, so the system should normally be underutilized.
Context switches impact latency. Limit the thread count to the number of cores available to the application and pin each thread to its own core.
Keep your reads sequential: All forms of storage perform significantly better when used sequentially (prefetching is a feature of most OSs and hardware).
For low latency, following pointers through use of linked lists or arrays of objects should be avoided at all costs, as this will require random access to memory which is much less efficient than sequential access. Arrays of primitive data types or structs is hugely better.
Batch writes by having a continuous writing thread which writes all data that is passed to its buffer. Do not pause to pass data from other threads, pass data to the buffer immediately.
Try to work with the caches - fitting data into caches is ideal. Cache-oblivious algorithms (ones that work with the cache regardless of its size) work by recursively breaking down the data until it fits in cache and then doing any necessary processing.
Non blocking and wait-free data structures and algorithms are friends to low latency processing. Lock-free is probably good enough if wait-free is too difficult.
Any processing (especially I/O) that is not absolutely necessary for the response should be done outside the critical path.
Parallelize as much as possible.
If there is a garbage collector, work with it to ensure pauses are absent or within acceptable parameters (this may require highly constrained object management, off heap processing, etc).

Java Marshalling Performance

When marshalling: aim to scan through primitive data type arrays; consider boundary alignments; control garbage, recycle if necessary; compute the data layout before runtime; work with the compiler and OS and hardware; be very careful about the code you generate, make it minimal and efficient.
Having data together in cache lines can make the data access an order of magnitude faster.
Order fields in their structure (probably naturally ordered by declaration order) according to the order in which they will be accessed.
Use a (direct) buffer correctly sized and sequentially read/write the elements in the buffer using data primitives and no object creation for highly efficient marshalling.
FIX/SBE (Simple Binary Encoding) marshalling can be 50 times faster than Google Protocol Buffers.

Top 10 - Performance Folklore

Sequential disk and memory access is much faster than random access.
Working off heap allows you to avoid putting pressure on the garbage collector.
Decoupled classes and components are more efficient. Keep your code cohesive.
Choose your data structures for the use required.
Making the procedure parallel is not necessarily faster - you need to use more threads, locks, pools, etc. Make sure you measure the difference. Single threaded solutions can be faster, or at least fast enough.
If going parallel, use message passing and pipelining to keep the implementation simple and efficient.
Logging is usually slow. Efficient logging would be asynchronous, and log in binary.
Beware that parsing libraries can be very inefficient.
Performance test your application.

Performance tuning legacy applications

Applications can be optimized forever - you must set performance goals appropriate to the business or you'll waste time and resources tuning beyond the performance needed.
User response times can be generally categorized as: 0.1 seconds feels like an instantaneous response (normal browsing should be in this range); 1 second is a noticeable delay but allows the user to continue their flow and they still feel in control (searches can fall in this range); 10 seconds makes the user feel at the mercy of the computer, but can be handled (e.g. generating a PDF); over 10 seconds and the user's flow is completely disrupted (so should only be targeted for things the user would expect to resaonably wait for, like end of day report generation).
Without sensible categorization of performance goals, everything tends to get dumped into the we want instant response" bucket - but this would be very expensive to achieve, so it's important to categorize correctly.
A site should be optimized against the "Google Page Speed" and "YSlow" recommendations.
Measure ideally from the user perspective, but at least for full service times of requests. If profiling tools are unavailable or impractical, a detailed logging trail of requests can help you identify underperforming components.

Combining Agile with Load and Performance Testing

Load and performance testing can determine how much load an application can handle before it crashes, when to add another server, when to reconfigure the network, where code needs to be optimized, and more.
Performance testing during continuous integration let's you catch issues early when they are cheaper to fix and won't impact release dates. Use SLAs (Service Level Agreements) to provide targets for the continuous integration performance tests.
Sometimes a small change can lead to disproportionate effects without it being realised beforehand. Integrating load testing into the continuous integration process stops small apparently innocuous changes getting pushed out to production without seeing their performance effect on the system, ensuring that performance targets remain satisfied.
Performance testing during continuous integration gives developers quick feedback on how their change has affected performance, allowing for changes to be handled in a more "normal" development way, rather than having to wait for results on code that they haven't worked on for a while.

The Java Pro Seeker

Saturday, May 3, 2014

Java Performance Optimizations tips

No comments:

Post a Comment

Popular Posts