From my experience building NSE/BSE exchange simulators at TCS, achieving sub-millisecond latency requires careful architecture, optimized code, and deep system understanding.

Architecture Overview

The exchange simulator processes millions of order events daily. Here's the high-level architecture:

// Worker pool for concurrent processing
type WorkerPool struct {
    workers int
    tasks   chan Task
}

func NewWorkerPool(workers int) *WorkerPool {
    return &WorkerPool{
        workers: workers,
        tasks:   make(chan Task, 10000),
    }
}

Key Optimization Techniques

1. Goroutine Pooling

Avoid goroutine creation overhead by reusing workers:

// Fixed pool of 1000 workers
pool := NewWorkerPool(1000)
for i := 0; i < 1000; i++ {
    go pool.worker()
}

2. Memory Pooling

Reduce GC pressure with sync.Pool:

var orderPool = sync.Pool{
    New: func() interface{} {
        return &Order{}
    },
}

func getOrder() *Order {
    return orderPool.Get().(*Order)
}

func putOrder(o *Order) {
    orderPool.Put(o)
}

3. Linux Kernel Tuning

4. Zero-Copy Serialization

Using Protocol Buffers with zero-copy optimization:

message Order {
    bytes id = 1;
    int64 timestamp = 2;
    double price = 3;
    int32 quantity = 4;
    OrderType type = 5;
}

Performance Results

Metric Before Optimization After Optimization
Average Latency 5.2ms 0.8ms
Throughput 50K orders/sec 250K orders/sec
CPU Usage 85% 65%

Lessons Learned

  1. Profile before optimizing - identify real bottlenecks
  2. Batch operations when possible
  3. Use appropriate data structures (arrays vs maps)
  4. Monitor GC pauses and memory usage
  5. Test with production-like load patterns

Note: These optimizations were implemented for TCS's order routing system handling real-time NSE/BSE market data.