This is where I take notes from watching talks from various C++ conferences. I may divide it into different subpages in the future.

CppCon 2017: Carl Cook “When a Microsecond Is an Eternity: High Performance Trading Systems in C++”

Link to the talk

  • Trading systems aim to minimize the latency of the "hotpath" (i.e. from data ingestion to computation and finally to order sending towards exchanges), though it's probably only executed 0.01% of the time.
  • Multi-threaded synchronization introduces overhead - most trading systems just utilizes single thread/process only to avoid cross-thread communication latency.
    • To further optimize the single thread/process, can turn off hyperthreading to ensure that the thread/process has the whole computation unit/processing core to itself.
    • If multi-threading is necessary, utilize lock-free DS such as LF queues.

Slow path removal

Reduce number of possible branches to optimize branch prediction. Example:

if (checkForErrorA())
  handleErrorA();
if (checkForErrorB())
  handleErrorB();
else
  processStuff();

// Improved
int64_t errorFlags;
...
if (!errorFlags)
  processStuff();
else
  handleError(errorFlags); // handleError can have more branches here, it's not hotpath so it does not need to be fast.

Personal take: it's something like the early return pattern, except that hotpath function by putting it at the top to reduce the number of checks done in the hotpath.

Template-based configuration

Replace virtual functions with templates if you know the set of possible instantiations of a class. Example:

class OrderSender {
public:
    virtual void sendOrder(const Order& order) = 0;
};

class FastOrderSender : public OrderSender {
public:
    void sendOrder(const Order& order) override {
        // Fast implementation
    }
};

class ReliableOrderSender : public OrderSender {
public:
    void sendOrder(const Order& order) override {
        // Reliable implementation with retry logic
    }
};

Issue with above:

  • When the program calls orderSenderObject.sendOrder() where orderSenderObject is of type OrderSenter, overhead is incurred due to looking up the correct function in a vtable at runtime.
  • vtable lookup also means additional thing fighting for cache space, causing more stress on the cache.
  • Faster implementation below:
// No base class needed - just concrete types
class FastOrderSender {
public:
    void sendOrder(const Order& order) { /* Fast implementation */ }
};

class ReliableOrderSender {
public:
    void sendOrder(const Order& order) { /* Reliable implementation */ }
};

// Template class instead of inheritance
template<typename SenderType>
class OrderManager {
    SenderType sender;
public:
    void processOrder(const Order& order) {
        sender.sendOrder(order);  // Direct function call - no virtual dispatch
    }
};

// Factory function to instantiate the right type
std::unique_ptr<OrderManager> createOrderManager(const Config& config) {
    if (config.senderType == "fast") {
        return std::make_unique<OrderManager<FastOrderSender>>();
    } else {
        return std::make_unique<OrderManager<ReliableOrderSender>>();
    }
}

Benefits to this approach

  • If config is known during compile time, compiler can possibly inline the sendOrder calls.
  • No vtable lookup overhead, better cache performance.

Personal take: this is going to anger software engineer principles purists as we toss OOP out of the window, but hey anything to squeeze out more performance.

Use lambda functions

Lambda functions are easily inlined by compilers. Example below:

void sendMessage(std::function<void(Message&)> populate) {
    Message msg = prepareMessage();
    populate(msg);
    // then do something with msg
}

sendMessage([](Message& msg) {
    msg.price = 100.5;
    msg.quantity = 1000;
})

Whole example above can be inlined to something like below:

Message msg = prepareMessage(); // alternatively, the lines in prepareMessage gets inlined here
msg.price = 100.5;
msg.quantity = 1000;
// do something with msg

Memory allocation

Memory allocation is slow. Possible solutions:

  • Write your own custom allocator
  • Pre-allocate objects in a pool, reuse the objects instead of deallocating.
  • delete calls free which involves syscalls and book-keeping, both are very expensive functions.
  • If large objects must be deleted (to maybe free up memory), do it from a non-critical thread that is not the main execution thread.

Exceptions in C++

  • Exceptions are mostly 0-cost as long as one does not get thrown.
  • Don't use exceptions for control flow, it's bad OOP practice as well and results in ugly code.

Multi-threading

  • Avoided for latency-sensitive code.
  • Locks are expensive, lock-free code may also require locks at hardware level.
  • If multi-threading is needed, consider not even sharing memory space, instead just throw copies of data around from producer to consumer (likely via a lock-free queue).
    • If data races are not a concern, can consider removing data synchronization entirely.

De-normalizing data

  • Optimize data structures around access patterns. Example:
class Instrument {
    float price;
    int marketId; // foreign key
    // other data
}

class Market {
    float quantityMultiplier;
    std::string_view marketName;
    // other market data
}

// Traditional "OOP-compliant" lookup pattern
void process(Instrument& instr) {
    Market* market = lookupMarket(instr.marketId);
    orderMsg = instr.price;
    float adjustedQuantity = qty * market->quantityMulitplier;
    // ...
}

If quantityMultipler is always going to be accessed alongside with Instrument.price like the example above, consider de-normalizing it and store it in the Instrument struct to reduce cache stress.

class Instrument {
    float price;
    float quantityMultiplier;
    // other data
}

class Market {
    float quantityMultiplier;
    std::string_view marketName;
    // other market data
}

void process(Instrument& instr) {
    orderMsg = instr.price;
    float adjustedQuantity = qty * instr.quantityMulitplier;
    // ...
}

Only one cache load is needed for the 2 lines of code in process as a cache line is 64 bytes in a x64 system, and de-normalizing the data makes it such that price and quantityMultiplier will be on the same cache line.

Hashmap alternate implementations

  • C++'s std::unordered_map are implemented with buckets pointing to a linked list each in case of collisions.
    • Problem: may suffer from performance degration due to linked lists not having great locality.
  • Consider using alternate implementations of hash maps such as one with open addressing
  • Example used by Optiver, provided by Carl Cook:
struct HashEntry {
    uint64_t hash; // 8 bytes - precomputed hash
    void* pointer; // 8 bytes - pointer to actual key-value
}

HashEntry table[TABLE_SIZE]; // array of 16-byte elements
  • Because an entry is 16 bytes, one cache line fetch can fetch 4 entries (total 64 bytes).
  • Linear probing collision resolution becomes more efficient as number of cache line fetches are reduced for probes.

((always_inline)) and ((noinline))

To not be confused with the inline keyword, which handles linkage issues.

Inlining can make code faster or slower.

  • Prefer specifying __attribute__((always_inline)) for functions that are always called along the hotpath for better instruction cache usage.
  • Prefer specifying __attribute__((noinline)) for non-critical tasks like error-handling which the program will not always execute along the hotpath, to reduce instruction cache pollution.
  • Always measure after specifying always_inline or noinline to verify whether it helps with performance.

Keeping cache hot

  • Because hotpath is executed very infrequently in trading systems, instruction cache can be trampled by non-hotpath data and instructions.
  • Branch predictor also does not predict branches along the hotpath as well.
  • Solution: Simulate hotpath executions (with dummy data) many times (1,000-10,000 times) before actual execution, to warm up the instruction cache and train the branch predictor.
    • Some NICs allow for pushing of data to the NIC without transmission just for this use case.

Disabling cores for single-threaded programs

Disable all but 1 core, so that the single core gets full L3 cache.

(Small nitpick) Placement new's slight inefficiency

Older versions of gcc (<7.1) and clang (<3.4) will compile placement new to include a null pointer check on the passed buffer. This is rectified by defining passing a null pointer into placement new as an undefined behaviour.

Small string optimization

  • instruments.find({"IBM"}) - avoids allocation if string is <=15 chars for gcc >=5.1 and <=22 chars for clang
  • However, allocation will be caused on systems with gcc >=5.1 if using an ABI compatible linux distro (e.g. RHEL/Centos/Ubuntu/Fedora)

(Small nitpick) Static local variable initialization

Static local variables has a guard variable to check whether it's initialized or not. Every reference to the static local variable will cause the guard variable to be checked.

(Small nitpick) std::function may allocate

gcc may not optimize std::function that does nothing, check ASM to verify.

std::pow can be slow.

std::pow is transcedental, will fallback to a slower phase if accuracy of the result is not acceptable after the first faster phase.

Measurement of low latency systems (wooo CS5239)

  • Profiling: examine what the code does and find bottlenecks.
    • Sampling profilers (e.g. gprof): misses key events because most of the time trading software does nothing
    • Instrumentation profilers (e.g. valgrind): basically a virtual machine that simulates a CPU, does not give accurate I/O results, too intrusive.
  • Benchmarking: timing the speed of a system with specified inputs.
    • Microbenchmarks (.e.g Google benchmark): not representative of a realistic environment

Best measurement: setup a benchmarking system that emulates production as much as possible.

  • e.g. have a server replay market data like an exchange server, have the data be fed to the server running the software to be benched. Have a switch that taps into the network and record all the packets that are passed.
  • Very hard to get a setup that is as close as possible to production.