From my experience building NSE/BSE exchange simulators at TCS, achieving sub-millisecond latency requires careful architecture, optimized code, and deep system understanding.
Architecture Overview
The exchange simulator processes millions of order events daily. Here's the high-level architecture:
// Worker pool for concurrent processing
type WorkerPool struct {
workers int
tasks chan Task
}
func NewWorkerPool(workers int) *WorkerPool {
return &WorkerPool{
workers: workers,
tasks: make(chan Task, 10000),
}
}
Key Optimization Techniques
1. Goroutine Pooling
Avoid goroutine creation overhead by reusing workers:
// Fixed pool of 1000 workers
pool := NewWorkerPool(1000)
for i := 0; i < 1000; i++ {
go pool.worker()
}
2. Memory Pooling
Reduce GC pressure with sync.Pool:
var orderPool = sync.Pool{
New: func() interface{} {
return &Order{}
},
}
func getOrder() *Order {
return orderPool.Get().(*Order)
}
func putOrder(o *Order) {
orderPool.Put(o)
}
3. Linux Kernel Tuning
- TCP_NODELAY: Disable Nagle's algorithm
- SO_REUSEPORT: Allow multiple sockets on same port
- CPU affinity: Pin goroutines to specific cores
- Hugepages: 2MB pages for memory-intensive operations
4. Zero-Copy Serialization
Using Protocol Buffers with zero-copy optimization:
message Order {
bytes id = 1;
int64 timestamp = 2;
double price = 3;
int32 quantity = 4;
OrderType type = 5;
}
Performance Results
| Metric | Before Optimization | After Optimization |
|---|---|---|
| Average Latency | 5.2ms | 0.8ms |
| Throughput | 50K orders/sec | 250K orders/sec |
| CPU Usage | 85% | 65% |
Lessons Learned
- Profile before optimizing - identify real bottlenecks
- Batch operations when possible
- Use appropriate data structures (arrays vs maps)
- Monitor GC pauses and memory usage
- Test with production-like load patterns
Note: These optimizations were implemented for TCS's order routing system handling real-time NSE/BSE market data.