Why It Is Fast
Introduction
The framework's internal message transport layer uses the Aeron + SBE combination: true zero-copy, zero-loopback, zero-reflection, zero-GC, zero runtime parsing, and near-zero encoding/decoding overhead. Through a Lock-Free design and a Zero-GC strategy, it can achieve sub-microsecond or even nanosecond-level end-to-end latency. Especially in IPC, performance can nearly reach the hardware limit.
Shared Memory (Aeron IPC)
Aeron's unique IPC mechanism is based on shared memory and lock-free design, fundamentally eliminating copy and context-switch overhead during inter-process transfer.
Difference between Aeron IPC and Netty Zero-Copy
| Feature | Netty system-level zero-copy | Aeron application-level zero-copy (IPC) |
|---|---|---|
| Transport medium | Kernel buffers, network protocol stack | Shared-memory files (mapped files) |
| Data flow | Reduces copies between user space and kernel space, but still involves the kernel network stack | Completely bypasses kernel, network stack, and copies between user spaces |
| Zero-copy definition | Reduces CPU copies during system calls | Cross-process, fully eliminates data copying |
Aeron allocates a memory region (Log Buffer) at the underlying layer so different processes can point to the same memory. By mapping the same memory into multiple processes, data transfer between processes does not require copying through kernel, network stack, or user space. This is a more complete, higher-level zero-copy model.
The diagram below shows two processes pointing to the same memory. This memory is a reusable ring buffer (composed of Term Buffers) to avoid GC. Memory reuse is a key reason for Aeron's high performance.
The ring buffer completely avoids frequent memory allocation/release during data transfer, greatly reducing JVM GC pressure and pauses, and ensuring very low latency.
Advantages of this design:
- Latency: extremely low. It avoids network stack and context-switch overhead, usually at the nanosecond/microsecond level.
- Transfer model: because multiple processes map the same memory, data transfer between processes achieves zero-copy transmission.
- Overhead: fundamentally eliminates data-copy and context-switch overhead in inter-process transfer.
SBE (Simple Binary Encoding)
Aeron and SBE are often used together as a "best combination" to provide an end-to-end optimal solution for high-throughput, low-latency applications.
SBE is a message encoding/decoding standard for high-performance financial and trading applications. Its core goal is to achieve extreme CPU efficiency while keeping very low latency. SBE is designed entirely for machine efficiency, not human readability. It discards the unnecessary overhead common in XML, JSON, Google Protobuf, Thrift, and similar formats.
Key characteristics:
- Extreme CPU efficiency: SBE encoding/decoding is almost zero CPU overhead. It uses a mechanism called "On-the-wire Encoding/Decoding".
- Fixed memory layout: field sizes are fixed, and message structure plus field offsets are known at compile time, so encoding/decoding overhead is nearly zero.
- Zero GC: minimal memory overhead, no temporary objects, and no new Java object creation during encode/decode.
SBE is extremely fast, with encode/decode latency at the nanosecond level. Thanks to fixed memory layout, performance is often several times faster than runtime-parsing frameworks such as Google Protobuf. On a single core, SBE can easily process tens of millions of messages per second, demonstrating extreme CPU efficiency.
Time Conversion and Latency Metrics
To measure performance clearly, first review time units.
| Unit | Relation | Description | Conversion |
|---|---|---|---|
| Nanosecond (ns) | 10-9 second | One billionth of a second | 1 second (s) = 1,000,000,000 nanoseconds (ns) |
| Microsecond (us) | 10-6 second | One millionth of a second | 1 microsecond (us) = 1,000 nanoseconds (ns) |
| Millisecond (ms) | 10-3 second | One thousandth of a second | 1 millisecond (ms) = 1,000 microseconds (us) |
| Second (s) | 1 second (s) = 1,000 milliseconds (ms) |
Aeron latency metrics
-
Intra-process/inter-process communication (IPC):
- Latency: nanosecond-level latency (~100 ns level).
- Mechanism: lock-free, shared-memory ring buffers; fully avoids kernel mode switches and lock contention, which is key to ultra-low latency.
-
LAN communication (UDP):
- Latency: microsecond-level latency (~10 - 30 us level).
- Mechanism: application-layer optimized UDP protocol with reliability and ordering guarantees, significantly better than traditional kernel TCP/IP stack.
Comparison: baseline latency of traditional TCP/IP communication is usually at the millisecond (ms) level. Aeron's LAN latency advantage is an order-of-magnitude difference.
Key Differences Between Aeron UDP and Traditional TCP/IP
The design philosophies of Aeron UDP and traditional kernel TCP are fundamentally different.
- Traditional TCP/IP aims to provide a general-purpose, reliable, fair WAN transport layer, but sacrifices latency stability and predictability.
- Aeron is focused on low latency, high throughput, and predictable latency message-stream transport. It moves complex transport logic to the application layer, and bypasses kernel TCP/IP bottlenecks through shared memory (IPC) and optimized UDP (application-layer reliability).
- With Aeron, developers gain fine-grained control over latency and transport behavior, enabling extremely low and highly predictable communication latency.
| Feature | Aeron (UDP/Application Layer) | Traditional TCP/IP (Kernel/Transport Layer) |
|---|---|---|
| Reliability mechanism | NACK mechanism: retransmit only when receiver detects missing data. High efficiency, low traffic. | ACK mechanism: sender waits for acknowledgments, increasing latency and network traffic. |
| Ordering guarantee | Handled in application layer: out-of-order receive, ordered reassembly at receiver log. | Guaranteed in kernel layer: all packets must be delivered in order. |
| HoL blocking | Eliminated (No Head-of-Line Blocking). Even with packet loss, subsequent data can still be received and buffered, without breaking pipeline processing. | Exists (TCP). Packet loss stalls subsequent packets in kernel buffer until retransmission completes, significantly increasing P99 latency. |
| Backpressure/flow control | Application-layer flow control (e.g., Term Offset/Position), fast, fine-grained, responsive. | Kernel sliding-window/flow control limited by OS scheduling, relatively slower response. |
Summary
Two key points in network programming performance are data transfer and encoding/decoding. With Aeron + SBE, both bottlenecks are addressed:
- Aeron handles efficient message transport: ultra-low transfer latency, zero-copy in IPC, and network transport superior to traditional TCP/UDP.
- SBE handles efficient encoding/decoding: ultra-low CPU usage, with zero-copy and zero-parse encode/decode process.
Final benefits
- SBE = extreme CPU efficiency, pushing message processing speed to hardware limits.
- Aeron + SBE = end-to-end performance guarantee.
- Zero GC / Zero Copy: SBE operates directly on buffers, Aeron transmits in shared memory, jointly achieving the high-performance goal of zero memory copy and zero garbage collection.