Qishi C++ Low-Latency Systems Study Group

  • Completed an 8-week C++ systems curriculum through the Qishi C++ Study Group
  • Curriculum adopted from Building Low-Latency Applications with C++
  • Week 8 sharing topic: Instrumentation & Optimization in Low-Latency C++ Applications
  • Prepared and presented materials on RDTSC, hop-by-hop timestamping, latency distributions, and hot-path tuning
  • Wrote a standalone rdtsc_latency_demo.cpp to demonstrate measurement, p95/p99 analysis, and spiky vs stable latency paths
  • Curriculum GitHub

Table of contents


Curriculum Outline

The study group was structured as an 8-week path from C++ performance foundations to a full electronic-trading-style system.

Week Topic Focus
1 C++ low-latency systems overview Performance goals, why C++ matters for latency-sensitive systems, compiler flags, and common designs across streaming, gaming, IoT, electronic trading, AI infrastructure, and crypto
2 Core building blocks Threads, concurrency models, memory pools, lock-free queues, low-latency logging, TCP/UDP utilities
3 Trading ecosystem design Trading system requirements, exchange/client/strategy architecture, and system constraints
4 Matching engine Limit order book design and order matching implementation
5 Exchange connectivity Market data publisher, order server, and assembling the exchange binary
6 Client connectivity Market data consumer, client-side order book, and order gateway
7 Strategy framework and trading strategies Position/PnL tracking, signal generation, order management, risk controls, market making, liquidity taking, and end-to-end client integration
8 Instrumentation and optimization High-granularity timestamping, performance analysis, visualization, tuning, benchmark gains, and extension ideas

The weekly format mixed theory, code demos, GitHub code analysis, STL internals, and participant project discussion. Code was the real homework: every concept had to connect back to something measurable or implementable.


Week 8 Sharing Materials

My Week 8 sharing focused on turning “low latency” from a vague goal into a measurable engineering loop:

instrument -> collect logs -> compute metrics -> visualize -> diagnose -> optimize -> re-measure

The materials had two parts:

  • slides: instrumentation and optimization concepts from Chapters 11 and 12 of Building Low-Latency Applications with C++
  • demo code: a standalone RDTSC latency measurement program simulating matching-engine and sequencing paths

The most important point was that optimization without measurement is just guessing. In a trading-style system, the question is rarely “is it fast once?” It is usually “is it predictably fast at p99 under burst load?”


Measurement Philosophy

1. “You cannot optimize what you cannot measure”

Low-latency work begins with a baseline. Without instrumentation, it is easy to optimize the wrong function, improve average latency while hurting tails, or mistake I/O noise for CPU cost.

Practical rule: before changing a hot path, record at least one baseline distribution.


2. Observer effect

Measurement itself has cost. If the instrumentation is too heavy, the measurement changes the system being measured.

This is why the slides separate:

  • low-overhead cycle measurements for individual functions
  • absolute timestamps for end-to-end flow analysis

Practical rule: use the lightest measurement that answers the question.


3. Average latency is not enough

Average latency can hide bursts. A function can have a reasonable mean but still produce dangerous p99 spikes.

For latency-sensitive systems, the page-worthy metrics are:

  • min
  • max
  • mean
  • median
  • p50
  • p95
  • p99
  • distribution shape

Practical rule: if p99 is bad, the system is not predictably low-latency even if the mean looks good.


RDTSC Microbenchmarking

1. What RDTSC measures

rdtsc reads the CPU Time Stamp Counter, a 64-bit register that counts CPU cycles.

It is useful for:

  • hot function profiling
  • microbenchmarking small code regions
  • comparing before/after changes
  • detecting tight vs spiky latency distributions

It is not the same as wall-clock time. To convert cycles into time, you need an estimate of CPU frequency.


2. Why use RDTSC instead of std::chrono everywhere

The slides compare RDTSC with higher-level clocks:

  • RDTSC overhead is small enough for hot-path measurement
  • std::chrono::high_resolution_clock is easier to use but usually heavier
  • std::chrono is still useful for wall-clock timestamps and coarse timing

Practical rule: RDTSC is for very fine-grained local timing; chrono is better for readable timestamps and broader elapsed-time logic.


3. x86-specific limitation

The demo code uses inline assembly:

__asm__ __volatile__("lfence\n\t"
                     "rdtsc\n\t"
                     : "=a"(lo), "=d"(hi)
                     :
                     : "memory");

This is x86/x86_64 specific. It is appropriate for explaining low-latency measurement on common trading-system hardware, but it is not portable C++.

Practical rule: write architecture-specific timing utilities behind a small abstraction.


4. lfence before rdtsc

The demo uses lfence before rdtsc.

Why it matters:

  • modern CPUs execute instructions out of order
  • without serialization, measured regions can be polluted by reordering
  • lfence helps ensure earlier instructions retire before the timestamp is read

Practical rule: measuring a small code block needs ordering control, not just a clock call.


5. __volatile__ in inline assembly

The inline assembly is marked __volatile__.

Why it matters:

  • tells the compiler the assembly has observable behavior
  • reduces the chance the compiler removes or reorders the timing instruction
  • helps preserve measurement intent

Practical rule: compiler optimization is your friend in release builds, but it must not optimize away the measurement itself.


6. Combining high and low registers

rdtsc returns the counter through two 32-bit pieces:

return (static_cast<uint64_t>(hi) << 32) | lo;

This combines the high and low halves into one 64-bit cycle count.

Practical rule: tiny bit-level details matter in performance tooling because the measurement path must be correct before any result is meaningful.


7. noexcept on measurement utilities

The demo marks timing helpers as noexcept.

Why it matters:

  • timing helpers should not throw
  • exception metadata and pessimistic assumptions are undesirable in hot utilities
  • it documents that the function is safe to call in latency-sensitive code

Practical rule: hot-path helper functions should express their guarantees in the type system where possible.


TTT Hop Timestamping

1. What TTT means in the slides

TTT-style measurement records absolute nanosecond timestamps at system boundaries.

RDTSC answers:

How many cycles did this function take?

TTT answers:

How long did the message take to move from this component to the next?

Both are needed because a fast function can still sit behind a slow queue, network hop, or scheduler delay.


2. Exchange-side hops

The exchange topology tracks message flow from ingress to egress:

Hop Meaning
T1 TCP read at order server ingress
T2 lock-free queue write toward matching engine
T3 lock-free queue read by matching engine
T4 market data queue write
T5 market data queue read by publisher
T6 UDP/multicast market data send
T4t response queue write
T5t response queue read
T6t TCP response write

Practical rule: end-to-end latency is a chain. Measuring only the matching function misses the queue and network boundaries.


3. Client-side hops

The client topology tracks market data, strategy, order gateway, and response paths:

Hop Meaning
T7 UDP market data read
T8 queue write toward trade engine
T9 trade engine queue read
T10 order request queue write
T11 order gateway queue read
T12 TCP order send
T7t TCP response read
T8t response queue write
T9t response queue read

Practical rule: client latency is not just strategy logic; market data ingestion, queueing, and order gateway egress all count.


4. Lock-free queues still need measurement

The architecture uses LFQueues between components. “Lock-free” does not mean “free.”

Queue cost can come from:

  • cache-line transfer between cores
  • producer/consumer imbalance
  • bursts that create backlog
  • false sharing
  • memory-ordering costs

Practical rule: measure queue write/read hops directly instead of assuming the queue is negligible.


Timing Utilities

1. getCurrentNanos()

The demo uses:

std::chrono::steady_clock::now()

This is a good choice for monotonic elapsed-time measurement because steady_clock does not move backward due to wall-clock adjustments.

Practical rule: use steady_clock for elapsed time; use wall-clock time only when human-readable timestamps matter.


2. getCurrentTimeStr()

The demo formats wall-clock time as:

HH:MM:SS.nnnnnnnnn

This is useful for logs because humans need readable timestamps when correlating events.

Practical rule: keep two time concepts separate:

  • monotonic time for measurement
  • wall-clock time for log readability

3. localtime_r portability

The demo uses localtime_r, which is POSIX-style and works on Linux/macOS. On Windows, equivalent code usually uses localtime_s.

Practical rule: demo code can target Linux-style low-latency environments, but production utility code should isolate platform-specific time formatting.


Measurement Macros

1. START_MEASURE

The demo macro is:

#define START_MEASURE(TAG) const uint64_t TAG = Common::rdtsc()

It captures a start cycle count into a variable named by the tag.

Why this is useful: the tag becomes both the measurement variable and the identity of the measured region.


2. END_MEASURE

The demo macro is:

#define END_MEASURE(TAG, LOGGER)                                               \
  do {                                                                         \
    const uint64_t end_##TAG = Common::rdtsc();                                \
    (LOGGER).logRDTSC(#TAG, end_##TAG - TAG);                                  \
  } while (false)

Key C/C++ macro tricks:

  • #TAG stringifies the tag, so it can be logged as text
  • end_##TAG uses token pasting to create a unique variable name
  • do { ... } while (false) makes the macro behave like one statement

Practical rule: if you use macros for instrumentation, make them syntactically safe and easy to remove or disable.


3. TTT_MEASURE

The demo also defines:

#define TTT_MEASURE(TAG, LOGGER)                                               \
  do {                                                                         \
    const uint64_t ts_##TAG = Common::getCurrentNanos();                       \
    (LOGGER).logTTT(#TAG, ts_##TAG);                                           \
  } while (false)

It records an absolute timestamp rather than a cycle delta.

Practical rule: use delta measurements for local function cost; use absolute timestamps to reconstruct a message path.


4. Macro scope and early returns

In MEOrderBook::removeOrder, the code calls END_MEASURE before each early return:

if (it == id_map_.end()) {
  END_MEASURE(Exchange_MEOrderBook_removeOrder, g_logger);
  return false;
}

This matters because failure paths are still paths. In a real exchange, cancels for missing or invalid orders may be frequent enough to measure.

Practical rule: every return path inside an instrumented function should close the measurement.


Where to Instrument

1. Wrap the critical section

The slides show START_MEASURE at the beginning of the core operation and END_MEASURE after the core work.

Do not wrap too much:

  • includes unrelated logging
  • includes setup/teardown noise
  • hides the true function cost

Do not wrap too little:

  • misses important data structure operations
  • undercounts validation or early-return behavior
  • produces misleading numbers

Practical rule: the measurement boundary should match the question you are asking.


2. Tag naming convention

The slides use names like:

Exchange_MEOrderBook_removeOrder
Exchange_FIFOSequencer_seqPub

The pattern is:

Component_Class_Method

Why it matters: logs become much easier to parse and group when tag names are consistent.


3. Exchange-side functions worth measuring

The slides list these hot functions:

  • processClientRequest
  • addOrder
  • removeOrder
  • match
  • checkForMatch
  • sequenceAndPublish
  • send

These map naturally to matching engine latency, order server latency, and market data publication latency.


4. Client-side functions worth measuring

The slides list these client-side functions:

  • PositionKeeper::addFill
  • PositionKeeper::updateBBO
  • FeatureEngine::onOrderBookUpdate
  • RiskManager::checkPreTradeRisk
  • OrderManager::moveOrder

These represent a typical trading-client path: update state, compute signal, check risk, and send orders.


Performance Analysis Workflow

1. Parse logs

The raw logs include:

timestamp RDTSC tag cycles
timestamp TTT tag timestamp_ns

The first analysis step is extracting:

  • timestamp
  • measurement type
  • tag
  • value

Practical rule: log format should be boring and parseable.


2. Convert cycles to time

RDTSC gives cycles, not seconds.

The demo estimates CPU frequency:

const uint64_t elapsed_cycles = end_tsc - start_tsc;
const double cpu_freq_ghz =
    static_cast<double>(elapsed_cycles) / static_cast<double>(elapsed_ns);

Then it converts:

cycles_to_ns = 1.0 / cpu_freq_ghz;
cycles_to_us = cycles_to_ns / 1000.0;

Practical rule: always state whether a number is cycles, nanoseconds, or microseconds.


3. Compute distribution metrics

The demo sorts a copy of the latency vector and computes:

  • min
  • max
  • mean
  • median
  • p50
  • p95
  • p99
  • sample count

Sorting a copy preserves the original raw sequence while making percentile lookup simple.

Practical rule: keep raw samples; derived stats are summaries, not replacements.


4. Visualize

The slides recommend:

  • scatter plots for time-vs-latency
  • histograms for distribution shape
  • percentile bands for tail behavior

The demo prints a console histogram with 20 bins. It is not as polished as a Jupyter plot, but it demonstrates the same idea: shape matters.

Practical rule: a latency distribution is often more informative than a single “average latency” number.


5. Diagnose

The slides classify patterns:

  • tight distribution: stable code path
  • spiky distribution: contention, batching, burst behavior, or slow-path activity

Practical rule: a spike is a clue. Ask what event, branch, allocation, flush, queue backlog, or OS action caused it.


Latency Distribution Reading

1. Stable path: MEOrderBook::removeOrder

The slide example treats order cancellation as a relatively tight path:

min:  about 0.4 us
mean: about 1.2 us
max:  about 3.5 us

Interpretation:

  • the operation is narrow and predictable
  • it mostly touches known data structures
  • tail behavior still depends on cache and tree state

Practical rule: tight distributions are good, but still validate the data structure and cache behavior under larger books.


2. Spiky path: FIFOSequencer::sequenceAndPublish

The slide example shows a much wider distribution:

average: about 90 us
spikes:  500-1200 us
pattern: burst-driven

Interpretation:

  • sequencing cost depends on how many pending requests arrive
  • publishing/batching creates periodic heavy work
  • queue flush behavior can dominate tail latency

Practical rule: if the path processes variable-size batches, measure batch size alongside latency.


3. Network and queue latency

The slides call out:

  • one network hop can add roughly 15-20 us
  • consumer queue latency can average around 100 us and spike much higher

These numbers are context-dependent, but the lesson is stable:

function latency != system latency

Practical rule: combine function-level RDTSC with hop-level TTT to see both local and system-level cost.


Optimization Case Studies

1. Logger: avoid character-by-character hot-path work

The original logger path pushed one log element per character:

for (auto c : str) {
  pushValue(c);
}

The optimized path copies the string into a fixed-size buffer and pushes once.

Slide result:

before: 25,757 cycles/op
after:     466 cycles/op
speedup: about 55x

Why it matters:

  • logging is often thought of as “outside the core logic”
  • but if log construction happens on the hot path, it can dominate latency
  • one push per character multiplies queue traffic and memory movement

Practical rule: hot-path logging should batch, defer, or use fixed-size preallocated buffers.


2. Fixed-size string buffer in a union

The slide optimization stores a string payload inside a fixed-size union member:

struct LogElement {
  LogType type_;
  union { char s[64]; /* ... */ } u_;
};

Why this helps:

  • avoids dynamic allocation
  • keeps log element size predictable
  • turns variable-length string handling into bounded copy work

Trade-off:

  • long strings must be truncated or handled separately

Practical rule: bounded buffers are common in low-latency systems because predictability often beats flexibility.


3. Memory pool: remove debug checks from release hot path

The slide memory-pool example shows ASSERT checks inside allocate and deallocate.

Slide result:

before: 343 cycles/op
after:   44 cycles/op
speedup: about 8x

The trick is not to remove safety forever. The trick is:

#ifndef NDEBUG
#define ASSERT(cond, msg) assert((cond) && (msg))
#else
#define ASSERT(cond, msg) ((void)0)
#endif

Debug builds keep checks. Release builds remove them.

Practical rule: debug safety and release latency can coexist if checks compile out cleanly.


4. Thread affinity and CPU isolation

Thread affinity pins critical threads to dedicated cores.

The slide maps components like:

  • market data publisher
  • matching engine
  • order server
  • market data consumer
  • trade engine
  • order gateway

The hottest components are the matching engine and trade engine.

Linux kernel parameters mentioned:

isolcpus=0-5
nohz_full=0-5
rcu_nocbs=0-5

Why this helps:

  • fewer context switches
  • less scheduler interference
  • less cache pollution
  • tighter p99 latency

Practical rule: OS-level tuning often improves tail stability more than mean latency.


5. Eliminating std::function

The slide replaces runtime callback dispatch with compile-time CRTP.

Problem with std::function in hot paths:

  • type erasure
  • possible heap allocation for captures
  • indirect call
  • harder branch prediction
  • less inlining

CRTP shape:

template <typename Handler>
class OrderServer {
  void onRecv(ClientRequest& req) {
    static_cast<Handler*>(this)->onClientRequest(req);
  }
};

Why this helps:

  • callback target resolved at compile time
  • direct call can be inlined
  • no type-erasure object in the hot path

Practical rule: std::function is elegant, but in a hot callback loop it must justify its cost.


6. Continuous profiling

The final slide emphasizes that performance is iterative:

measure -> optimize -> re-measure

A one-time benchmark can go stale as code changes, data volume changes, or workloads become burstier.

Practical rule: performance work is a feedback loop, not a one-off cleanup pass.


Data Structure Tuning

1. Big-O is not the whole story

std::unordered_map is average O(1), but that does not guarantee it is fastest in practice.

CPU cost also depends on:

  • cache locality
  • pointer chasing
  • allocator behavior
  • load factor
  • rehashing
  • key type
  • collision behavior

Practical rule: for hot paths, “O(1)” is the beginning of the discussion, not the end.


2. Fixed array as a fast map

If the key range is known and small, std::array can act like a simple map:

key -> array index

Benefits:

  • contiguous memory
  • no dynamic allocation
  • predictable access
  • cache-friendly layout

Cost:

  • wastes memory for sparse key ranges
  • fixed maximum size
  • not suitable for arbitrary keys

Practical rule: fixed-range ids are an opportunity for array-backed lookup.


3. std::unordered_map trade-off

The slide benchmark:

array hash map:     142,650 cycles/op
unordered_map:      152,457 cycles/op
overhead:           about 6-7%

Why unordered_map can be slower:

  • buckets may point to nodes elsewhere in memory
  • node allocation hurts locality
  • collisions add traversal
  • rehash can be expensive if not controlled

Why it is still useful:

  • dynamic growth
  • standard library availability
  • supports large or unknown key ranges
  • works for strings and custom key types

Practical rule: choose the data structure for the workload, not just for the API convenience.


4. Alternatives to explore

The slides mention several alternatives:

  • absl::flat_hash_map
  • Robin Hood hashing
  • hopscotch hashing
  • folly::AtomicHashMap
  • emhash7::HashMap

The common theme is improved memory layout or concurrency behavior compared with a node-heavy general-purpose map.

Practical rule: if a hash map sits in the hottest path, benchmark more than one implementation.


RDTSC Demo Code Notes

1. Compile with optimization enabled

The demo compile command is:

g++ -O2 -std=c++17 -o rdtsc_demo rdtsc_latency_demo.cpp

-O2 matters because measuring unoptimized code answers a different question. Debug builds are useful for correctness, but performance conclusions should come from optimized builds.

Practical rule: benchmark the kind of binary you actually care about.


2. Header choices reveal the demo scope

The demo includes:

  • <chrono> for timestamps and frequency estimation
  • <cstdint> for fixed-width integer types
  • <map>, <queue>, <tuple>, and <vector> for simulated exchange components
  • <numeric> and <algorithm> for stats
  • <iomanip> for readable output

This makes the demo self-contained and easy to compile without external dependencies.

Practical rule: teaching demos should minimize setup friction so the concept is the focus.


3. enum class Side

The demo uses:

enum class Side { BUY, SELL };

Why it is good:

  • scoped enum avoids leaking BUY and SELL into the global namespace
  • stronger type safety than a raw char or unscoped enum

Trade-off:

  • real protocols may encode side as a byte for compactness

Practical rule: use type-safe representation in internal code, then encode/decode at boundaries.


4. Demo Order uses double price

The demo Order stores price as double.

For production matching engines, I would still use integer ticks. Here the demo is about instrumentation, not price correctness.

Practical rule: a demo can simplify unrelated concerns, but the simplification should be explicit.


5. MEOrderBook duplicates lookup state

The demo stores orders in:

  • bids_
  • asks_
  • id_map_

This makes cancellation lookup simple:

auto it = id_map_.find(order_id);

Then it erases from the side-specific map and from the id map.

Practical rule: matching engines often trade extra memory for faster cancel/modify paths.


6. Measure both success and failure paths

removeOrder logs measurement before returning false for:

  • unknown order id
  • wrong client id
  • wrong side

This is important because rejected cancels still consume engine time.

Practical rule: failure paths can be part of the hot path.


7. insert_or_assign

The demo uses:

bids_.insert_or_assign(order.order_id, order);
id_map_.insert_or_assign(order.order_id, order);

This keeps the code compact and makes repeated ids overwrite prior state in the demo.

In production, duplicate order ids would likely be rejected or handled explicitly.

Practical rule: demo convenience should not be confused with exchange semantics.


8. FIFOSequencer creates a bursty path

The demo’s sequencer does two stages:

pending queue -> sequenced queue -> published vector

Every tenth iteration adds 50 requests instead of 5. This intentionally creates latency spikes.

Practical rule: burst simulation is useful because real systems often fail at the tail, not at the average.


9. Periodic flush simulation

The sequencer clears published_ when it grows past 100:

if (published_.size() > 100) {
  published_.clear();
}

This simulates periodic batching or network flush behavior.

Practical rule: periodic cleanup work often explains latency spikes.


10. Turn off verbose logging during measurement

The demo prints the first few samples, then disables verbose logging:

if (i == 5) {
  g_logger.setVerbose(false);
}

Why it matters:

  • console I/O is extremely slow compared with the code being measured
  • printing every iteration would dominate the measurement
  • first few samples are useful for proving instrumentation works

Practical rule: never benchmark a hot path with uncontrolled printing inside the loop.


11. Pre-populate the order book

The demo inserts 1000 orders before measuring cancels.

Why it matters:

  • measuring an empty book is not representative
  • data structure state affects lookup and erase cost
  • prepopulation creates a more realistic baseline

Practical rule: benchmark stateful systems in a realistic state.


12. CPU frequency estimation

The demo estimates CPU frequency by reading RDTSC, sleeping for 100ms, reading RDTSC again, and dividing cycles by elapsed nanoseconds.

This is useful for a demo, but production measurement needs care because:

  • CPU frequency may vary
  • turbo boost can affect assumptions
  • TSC behavior depends on hardware
  • cross-core TSC consistency matters

Practical rule: cycles are great for comparing code paths; converting to time requires hardware awareness.


13. Percentile index calculation

The demo computes percentile indexes with:

size_t idx = static_cast<size_t>(p * static_cast<double>(n - 1));

Then clamps to n - 1.

This is simple and fine for a demo. More rigorous percentile definitions exist, but the lesson is to look beyond average latency.

Practical rule: do not let perfect statistics block useful tail-latency visibility.


14. Histogram binning

The demo prints 20 bins and scales each bar to a width of 50 characters.

Why this helps:

  • you can see whether samples cluster tightly
  • you can spot long tails
  • you can compare distributions without a plotting library

Practical rule: a simple text histogram is better than no distribution view.


15. Compare components side by side

The demo ends with a summary table:

Component                    Mean cycles    P99 cycles    Pattern
MEOrderBook::removeOrder     ...            ...           Tight
FIFOSequencer::seqPub        ...            ...           Spiky

This is a good presentation pattern because the audience can connect implementation differences to measured behavior.

Practical rule: performance reporting should name both the metric and the interpretation.


16. What the demo intentionally does not solve

The demo is not a production exchange engine. It intentionally skips:

  • price-time priority matching
  • integer tick correctness
  • protocol parsing
  • memory pool allocation
  • real lock-free queues
  • network I/O
  • thread pinning

That is fine because the purpose is to teach instrumentation mechanics.

Practical rule: a focused demo should isolate one concept instead of pretending to be the whole system.


Connection to MatchEngine

This Week 8 material directly shaped how I think about my MatchEngine project:

  • add instrumentation before optimizing data structures
  • keep printing outside benchmark loops
  • report p95/p99, not only average runtime
  • separate correctness decisions from performance demos
  • keep order cancel paths measurable
  • treat allocations, logging, and callbacks as possible hot-path costs
  • consider memory pools for orders/events
  • consider lock-free queue handoff if the engine becomes multi-threaded
  • use integer ticks for price correctness even when demos simplify price representation
  • compare alternative containers empirically instead of assuming unordered_map is always fastest

The next natural step for my own engine is to add a small instrumentation layer, run repeatable benchmarks for add/cancel/match paths, and report latency distributions the same way this Week 8 demo does.