Home Markets Equities Analysis Research Infrastructure About
Live
Loading…
certurk23 Quant Lab — Technical Documentation
HFT Infrastructure Architecture
A detailed breakdown of the hardware, software, and network stack powering sub-microsecond order flow analysis at Equinix NY4/NY5 co-location facilities.

Co-location & Data Center Topology

All production systems operate within Equinix NY4 (Secaucus, NJ) and NY5 (Parsippany, NJ) — the primary co-location facilities for NASDAQ (Carteret) and NYSE (Mahwah) matching engines respectively. Physical proximity to exchange matching engines is the foundational latency advantage.

NASDAQ Carteret
NY4 / Equinix
Secaucus, NJ
NYSE Mahwah
NY5 / Equinix
Parsippany, NJ

Cross-venue wire latency between Carteret and Mahwah is approximately 120 µs. This inter-venue delta is the primary exploitable window in Cross-Venue Latency Arbitrage strategies. Our infrastructure maintains dedicated cross-connects within Equinix to minimize intra-facility hops.

ParameterDescriptionValue
Primary VenueNASDAQ Matching Engine Distance< 1 rack unit
Secondary VenueNYSE Arca Cross-ConnectNY4 → NY5
Cross-venue RTTCarteret ↔ Mahwah Wire~120 µs
Internal HopIntra-facility cross-connect< 300 ns
Power RedundancyUPS + Generator2N

NIC & Kernel Bypass (OpenOnload)

Standard Linux kernel network stacks introduce 5–50 µs of latency due to context switches, interrupt handling, and system call overhead. We bypass this entirely using Solarflare XtremeScale SFN8522 NICs with OpenOnload user-space network stack.

Kernel Bypass Architecture

OpenOnload intercepts socket calls at the application layer via LD_PRELOAD injection, redirecting traffic directly to NIC hardware queues via RDMA-style zero-copy DMA. The kernel is bypassed entirely for data-path operations — interrupts are replaced by polling loops pinned to isolated CPU cores.

nic_init.cpp C++17
// Solarflare OpenOnload — Zero-copy RX path initialization
int init_onload_stack(OnloadStack* stack) {
    // Create OpenOnload stack with polling mode
    int rc = ef_driver_open(&stack->dh);
    ef_pd_alloc(&stack->pd, stack->dh,
                stack->ifindex, EF_PD_DEFAULT);

    // Allocate hugepage-backed RX/TX ring buffers
    ef_memreg_alloc(&stack->mr, stack->dh,
                    &stack->pd, stack->dh,
                    stack->buf, BUF_SIZE);

    // Pin polling thread to isolated core (no OS scheduling)
    pin_thread_to_core(POLL_CORE_ID);
    return rc;
}
ComponentSpecificationLatency Impact
NIC ModelSolarflare XtremeScale SFN8522Baseline
Network StackOpenOnload 7.x (user-space)−5 to −50 µs
DMA ModeZero-copy RDMA−800 ns
Interrupt ModelBusy-poll (no IRQ)−2 µs
CPU AffinityIsolated core pinning−1.2 µs

Memory Architecture — Hugepages & TLB

Standard 4KB memory pages cause frequent Translation Lookaside Buffer (TLB) misses during high-throughput tick processing. Each TLB miss triggers a costly page-table walk (~100 ns penalty). We eliminate this by allocating all critical data structures on 2MB Hugepages via MAP_HUGETLB | MAP_LOCKED.

TLB Miss Reduction Factor
$$\text{TLB}_{\text{miss}} \approx \frac{N_{\text{pages}}^{\text{4KB}}}{N_{\text{pages}}^{\text{2MB}}} = \frac{2048 \cdot 1024}{2 \cdot 1024 \cdot 1024} = \frac{1}{512}$$

By switching from 4KB to 2MB pages, the number of TLB entries required for a given working set shrinks by a factor of 512×, effectively eliminating TLB pressure for tick data buffers up to 4GB.

memory_alloc.cpp C++17
// Hugepage-backed ring buffer for tick data
void* alloc_hugepage_ring(size_t size) {
    void* mem = mmap(nullptr, size,
                     PROT_READ | PROT_WRITE,
                     MAP_PRIVATE | MAP_ANONYMOUS |
                     MAP_HUGETLB | MAP_LOCKED,   // 2MB pages, no swap
                     -1, 0);

    if (mem == MAP_FAILED)
        fallback_standard_alloc(size);

    // Pre-fault all pages to avoid runtime page faults
    madvise(mem, size, MADV_WILLNEED);
    mlock(mem, size);
    return mem;
}

NUMA Topology

All NIC queues, tick buffers, and order-state data structures are allocated on NUMA node 0 — the same node as the Solarflare NIC PCIe slot. Cross-NUMA memory access introduces ~80 ns additional latency per cache miss, which is unacceptable at the nanosecond scale.

End-to-End Latency Budget

The total wire-to-order latency target is < 800 ns. This budget is decomposed across each pipeline stage, with hard ceilings enforced via hardware timestamps on every ingress packet.

Wire Ingress
~0 ns
NIC RX timestamp
DMA Transfer
~120 ns
PCIe Gen4 DMA
ITCH Parse
~180 ns
Zero-alloc parser
VPIN Update
~240 ns
Cache-local kernel
Signal / Order
< 800 ns
FIX TX egress
Total Latency Budget Decomposition
$$L_{\text{total}} = L_{\text{DMA}} + L_{\text{parse}} + L_{\text{compute}} + L_{\text{signal}} < 800\text{ ns}$$
StageComponentBudget
NIC → CPUPCIe Gen4 DMA, zero-copy120 ns
DeserializeNASDAQ ITCH 5.0 parser60 ns
State UpdateOrder book L2 update80 ns
VPIN ComputeInline kernel, L1-cached160 ns
DecisionSignal evaluation180 ns
TX EgressFIX/OUCH order submission120 ns
Total P50720 ns
Total P99740 ns

Security & Telemetry Encryption

All telemetry, metrics, and strategy parameters transmitted between co-location and monitoring endpoints are encrypted using AES-256-GCM with a rotating ECDH (Curve25519) key exchange. GCM mode provides authenticated encryption — both confidentiality and integrity are guaranteed for every telemetry frame.

telemetry_crypto.cpp C++17 / OpenSSL
// AES-256-GCM authenticated encryption for telemetry frames
bool encrypt_telemetry_frame(
    const uint8_t* plaintext, size_t pt_len,
    uint8_t* ciphertext, uint8_t* tag)
{
    EVP_AEAD_CTX ctx;
    // 256-bit session key derived via ECDH Curve25519
    EVP_AEAD_CTX_init(&ctx, EVP_aead_aes_256_gcm(),
                      session_key, 32, TAG_LEN, nullptr);

    return EVP_AEAD_CTX_seal(&ctx,
        ciphertext, &out_len, pt_len + TAG_LEN,
        nonce, NONCE_LEN,
        plaintext, pt_len,
        aad, aad_len) == 1;
}

Nonce values are derived from hardware timestamp counters (RDTSC) combined with a per-session counter, ensuring nonce uniqueness across all telemetry streams. Key rotation occurs every 3600 seconds via a dedicated ECDH re-handshake triggered by the monitoring daemon.