Go Runtime: How GC Impacts P99 Latency in High-Load APIs
You have a great API with median latency of 5ms, but P99 suddenly spikes to 500ms? Clients complain about periodic freezes? Welcome to the world of Go Garbage Collector and its impact on tail latency.
In this article, we’ll tackle a real problem: how a 10ms GC pause turns into 500ms latency for users, and what to do to keep P99 latency under control.
The Problem: Great P50, Terrible P99
Real Production Case
API metrics before optimization:
P50: 4ms ✅
P90: 12ms ✅
P95: 45ms ⚠️
P99: 520ms ❌
P99.9: 2.1s 💥
What’s happening:
- 50% of requests processed in 4ms - excellent
- 90% in 12ms - good
- But 1% of users wait half a second
- And 0.1% wait 2+ seconds!
Why is this critical:
At 100,000 RPS:
- 1,000 requests/sec get 500ms latency
- 100 requests/sec wait 2+ seconds
- 60,000 bad requests per minute
The problem? Garbage Collector pauses.
How Go GC Works
Concurrent Mark-Sweep
Go uses a concurrent garbage collector with phases:
1. Mark Setup (STW) - ~50-200μs
↓ (stops entire application)
2. Concurrent Mark - main time
↓ (runs parallel with application)
3. Mark Termination (STW) - ~50-500μs
↓ (stops again)
4. Sweep (concurrent) - background cleanup
Stop-The-World (STW) phases are the source of latency spikes.
When GC Triggers
// GC triggers when twice as much memory is allocated
// as remained after the last GC
// Example:
// After GC remaining: 1GB
// GC triggers when: heap reaches 2GB
// This is controlled by GOGC (default 100)
The problem in high-load APIs:
// At 100k RPS and 1KB per request:
// 100,000 req/s * 1KB = ~100MB/s allocations
// With GOGC=100 and 1GB after last GC:
// GC triggers after ~10 seconds
// Much garbage accumulates
// GC pause will be long
Diagnosing the Problem
Step 1: Enable GC Logging
# Export environment variable
export GODEBUG=gctrace=1
# Run application
./your-app
GC trace output:
gc 1 @0.004s 2%: 0.018+1.3+0.076 ms clock, 0.14+0.35/1.2/3.0+0.61 ms cpu, 4->4->3 MB, 5 MB goal, 8 P
gc 2 @0.015s 3%: 0.021+2.1+0.095 ms clock, 0.17+0.42/2.0/5.2+0.76 ms cpu, 5->6->4 MB, 6 MB goal, 8 P
gc 3 @0.045s 4%: 0.025+15.2+0.12 ms clock, 0.20+0.68/14.8/42.1+0.99 ms cpu, 7->9->6 MB, 8 MB goal, 8 P
^^^^
Mark phase - affects latency!
Decoding important parts:
gc 3 @0.045s 4%: 0.025+15.2+0.12 ms clock
^^^^^ ^^^^ ^^^^^
STW Mark STW
setup phase term
Step 2: Measure Real Impact
package main
import (
"fmt"
"runtime"
"runtime/debug"
"time"
)
type LatencyTracker struct {
samples []time.Duration
}
func (lt *LatencyTracker) Track(d time.Duration) {
lt.samples = append(lt.samples, d)
}
func (lt *LatencyTracker) Percentile(p float64) time.Duration {
if len(lt.samples) == 0 {
return 0
}
sort.Slice(lt.samples, func(i, j int) bool {
return lt.samples[i] < lt.samples[j]
})
idx := int(float64(len(lt.samples)) * p / 100.0)
if idx >= len(lt.samples) {
idx = len(lt.samples) - 1
}
return lt.samples[idx]
}
func benchmarkWithGC() {
tracker := &LatencyTracker{}
// Simulate load
for i := 0; i < 100000; i++ {
start := time.Now()
// Simulate request processing with allocations
data := make([]byte, 1024)
_ = processRequest(data)
elapsed := time.Since(start)
tracker.Track(elapsed)
// Every 1000 requests - output statistics
if i > 0 && i%1000 == 0 {
var m runtime.MemStats
runtime.ReadMemStats(&m)
fmt.Printf("Request %d - Heap: %d MB, P99: %v\n",
i,
m.HeapAlloc/1024/1024,
tracker.Percentile(99),
)
}
}
fmt.Printf("\nFinal stats:\n")
fmt.Printf("P50: %v\n", tracker.Percentile(50))
fmt.Printf("P90: %v\n", tracker.Percentile(90))
fmt.Printf("P99: %v\n", tracker.Percentile(99))
fmt.Printf("P99.9: %v\n", tracker.Percentile(99.9))
}
func processRequest(data []byte) []byte {
// Create temporary objects (pressure on GC)
temp := make([]byte, len(data)*2)
copy(temp, data)
return temp[:len(data)]
}
Step 3: Profiling
import (
_ "net/http/pprof"
"net/http"
)
func main() {
// Enable pprof endpoints
go func() {
http.ListenAndServe("localhost:6060", nil)
}()
// Your application
runApp()
}
Analyze allocations:
# Collect heap profile
curl http://localhost:6060/debug/pprof/heap > heap.prof
# Analyze
go tool pprof heap.prof
# In pprof console
(pprof) top10
(pprof) list functionName
(pprof) web # visualization
Solution 1: GOGC Tuning
Understanding GOGC
// GOGC controls GC aggressiveness
// Default GOGC=100
// GOGC=100: GC triggers when heap grew by 100%
// Live heap: 1GB -> GC at 2GB
// GOGC=200: GC triggers when heap grew by 200%
// Live heap: 1GB -> GC at 3GB
// GOGC=50: GC triggers when heap grew by 50%
// Live heap: 1GB -> GC at 1.5GB
Strategy: Increase GOGC
package main
import (
"os"
"runtime/debug"
)
func init() {
// Option 1: via environment variable
// export GOGC=200
// Option 2: programmatically
debug.SetGCPercent(200)
}
func main() {
// Your application
}
Effect:
Before (GOGC=100):
- GC every 10 seconds
- Pause 15ms
- P99: 520ms
After (GOGC=200):
- GC every 20 seconds
- Pause 25ms (longer, but less frequent!)
- P99: 180ms ✅
Trade-off:
- ✅ Less frequent GC → fewer spikes
- ❌ More memory used
- ❌ When GC happens, pause is longer
Sweet Spot
// For high-load APIs, good starting point:
debug.SetGCPercent(200) // or even 300
// Monitor:
// 1. Memory usage (shouldn't lead to OOM)
// 2. P99 latency (should improve)
// 3. GC pause duration (will be longer, but less frequent)
Solution 2: GOMEMLIMIT (Go 1.19+)
Soft Memory Limit
package main
import (
"runtime/debug"
)
func init() {
// Set soft memory limit
// Application can use up to 8GB
debug.SetMemoryLimit(8 * 1024 * 1024 * 1024) // 8GB
}
How it works:
Without GOMEMLIMIT:
- GC works by GOGC formula
- Can use unlimited memory
- In container can lead to OOM
With GOMEMLIMIT=8GB:
- GC becomes more aggressive approaching limit
- Protects from OOM in Kubernetes
- Better predictability
Kubernetes Integration
apiVersion: apps/v1
kind: Deployment
metadata:
name: api-service
spec:
template:
spec:
containers:
- name: app
image: my-api:latest
env:
# Set GOMEMLIMIT ~90% of memory limit
- name: GOMEMLIMIT
value: "7200MiB" # 90% of 8GB
- name: GOGC
value: "200"
resources:
requests:
memory: "8Gi"
cpu: "2000m"
limits:
memory: "8Gi"
cpu: "4000m"
Why 90% of limit:
- 10% buffer for non-heap memory (stacks, mmap, etc.)
- Protection for edge cases
- Safety during spikes
Solution 3: Ballast Memory
Technique for Stable GC
package main
import (
"fmt"
"runtime"
)
func init() {
// Create large ballast slice
// This "tricks" GC, making pauses more predictable
ballast := make([]byte, 2*1024*1024*1024) // 2GB
runtime.KeepAlive(ballast)
fmt.Printf("Ballast allocated: %d GB\n", len(ballast)/1024/1024/1024)
}
func main() {
// Your application
}
How it works:
Without ballast:
- Live heap: 100MB
- GC target: 200MB (GOGC=100)
- Many frequent GC cycles
With 2GB ballast:
- Live heap: 2.1GB (2GB ballast + 100MB data)
- GC target: 4.2GB
- Fewer GC cycles
- More stable pauses
Important:
// Ballast should be a HUGE slice, but it doesn't use
// real memory thanks to virtual memory
// It's just address space reservation
// Check actual usage:
var m runtime.MemStats
runtime.ReadMemStats(&m)
fmt.Printf("Actual heap: %d MB\n", m.HeapAlloc/1024/1024)
Solution 4: Reduce Allocations
Find Hot Spots
// Run with alloc profiling
go test -bench=. -benchmem -memprofile=mem.prof
// Analyze
go tool pprof mem.prof
Technique 1: sync.Pool for Reuse
var bufferPool = sync.Pool{
New: func() interface{} {
return new(bytes.Buffer)
},
}
// Before: create new buffer every time
func processDataBad(data []byte) []byte {
buf := new(bytes.Buffer)
buf.Write(data)
// ... processing
return buf.Bytes()
}
// After: reuse buffers
func processDataGood(data []byte) []byte {
buf := bufferPool.Get().(*bytes.Buffer)
defer func() {
buf.Reset()
bufferPool.Put(buf)
}()
buf.Write(data)
// ... processing
return buf.Bytes()
}
Technique 2: Preallocate Slices
// Before: many reallocations
func collectDataBad() []Item {
var items []Item // capacity = 0
for i := 0; i < 1000; i++ {
items = append(items, getItem(i))
// append will cause reallocation many times
}
return items
}
// After: one allocation
func collectDataGood() []Item {
items := make([]Item, 0, 1000) // preallocate
for i := 0; i < 1000; i++ {
items = append(items, getItem(i))
// no reallocation
}
return items
}
Technique 3: Avoid String Concatenation
// Before: many allocations
func buildStringBad(parts []string) string {
result := ""
for _, part := range parts {
result += part // each concatenation = new string!
}
return result
}
// After: one allocation
func buildStringGood(parts []string) string {
var builder strings.Builder
builder.Grow(estimateSize(parts)) // preallocate
for _, part := range parts {
builder.WriteString(part)
}
return builder.String()
}
Solution 5: Optimize Data Structures
Example: JSON API Response
// Before: many small allocations
type UserResponseBad struct {
ID int `json:"id"`
Name *string `json:"name"` // pointer!
Email *string `json:"email"` // pointer!
Tags []string `json:"tags"`
Metadata map[string]string `json:"metadata"` // map allocates
}
// After: fewer allocations
type UserResponseGood struct {
ID int `json:"id"`
Name string `json:"name"` // value
Email string `json:"email"` // value
Tags [8]string `json:"tags"` // array instead of slice
Metadata [16]KeyValue `json:"metadata"` // array instead of map
}
type KeyValue struct {
Key string
Value string
}
Result:
UserResponseBad:
- 1 allocation for struct
- 2 allocations for string pointers
- 1 allocation for slice
- 1 allocation for map
= 5+ allocations per object
UserResponseGood:
- 1 allocation for entire struct
= 1 allocation per object
At 100k RPS: 500k vs 100k allocations/sec
Production Case: Reducing P99 from 500ms to 50ms
Initial Situation
Service: REST API for recommendations
RPS: 80,000
Memory: 4GB
Pods: 20
Metrics BEFORE optimization:
P50: 5ms
P95: 38ms
P99: 520ms ❌
P99.9: 1.8s 💥
GC pauses: 10-50ms every 15 seconds
Step 1: Diagnosis
# Enabled GC trace
export GODEBUG=gctrace=1
# Result:
gc 145 @45.123s 4%: 0.12+42.3+0.18 ms clock
^^^^
42ms mark phase!
Step 2: Applied Optimizations
// 1. Increased GOGC
debug.SetGCPercent(300)
// 2. Set GOMEMLIMIT
debug.SetMemoryLimit(3.6 * 1024 * 1024 * 1024) // 3.6GB (90% of 4GB)
// 3. Added ballast
ballast := make([]byte, 1*1024*1024*1024) // 1GB
runtime.KeepAlive(ballast)
// 4. Optimized hot path with sync.Pool
var responsePool = sync.Pool{
New: func() interface{} {
return &RecommendationResponse{
Items: make([]Item, 0, 100),
}
},
}
Step 3: Results
Metrics AFTER optimization:
P50: 4ms (was 5ms)
P95: 22ms (was 38ms) ✅
P99: 48ms (was 520ms) ✅✅✅
P99.9: 120ms (was 1.8s) ✅✅✅
GC pauses: 15-30ms every 45 seconds
Memory: 5GB (was 4GB)
Cost: +25% memory
Benefit: P99 improved 10x!
ROI Calculation
Before:
- 1% of requests (800/sec) with 500ms+ latency
- Conversion loss on slow requests: ~30%
- Lost revenue: ~$50k/month
After:
- All requests < 100ms
- Additional memory costs: +$500/month
- ROI: 100x
Monitoring and Alerting
Prometheus Metrics
package main
import (
"runtime"
"github.com/prometheus/client_golang/prometheus"
"github.com/prometheus/client_golang/prometheus/promauto"
)
var (
gcDuration = promauto.NewHistogram(prometheus.HistogramOpts{
Name: "go_gc_duration_seconds",
Help: "GC pause duration",
Buckets: []float64{0.0001, 0.001, 0.01, 0.1, 1},
})
heapAlloc = promauto.NewGauge(prometheus.GaugeOpts{
Name: "go_heap_alloc_bytes",
Help: "Heap memory allocated",
})
numGC = promauto.NewCounter(prometheus.CounterOpts{
Name: "go_gc_total",
Help: "Total number of GC runs",
})
)
func collectGCMetrics() {
var stats runtime.MemStats
runtime.ReadMemStats(&stats)
heapAlloc.Set(float64(stats.HeapAlloc))
numGC.Add(float64(stats.NumGC))
gcDuration.Observe(float64(stats.PauseNs[(stats.NumGC+255)%256]) / 1e9)
}
Grafana Dashboard
# P99 latency
histogram_quantile(0.99,
rate(http_request_duration_seconds_bucket[5m])
)
# GC pause P99
histogram_quantile(0.99,
rate(go_gc_duration_seconds_bucket[5m])
)
# Heap usage
go_heap_alloc_bytes / 1024 / 1024
# GC frequency
rate(go_gc_total[5m])
Alerts
groups:
- name: go_gc_alerts
rules:
- alert: HighP99Latency
expr: |
histogram_quantile(0.99,
rate(http_request_duration_seconds_bucket[5m])
) > 0.1
for: 5m
annotations:
summary: "P99 latency > 100ms"
- alert: FrequentGC
expr: rate(go_gc_total[5m]) > 2
for: 10m
annotations:
summary: "GC running more than 2 times per second"
- alert: LongGCPause
expr: |
histogram_quantile(0.99,
rate(go_gc_duration_seconds_bucket[5m])
) > 0.05
for: 5m
annotations:
summary: "GC pause P99 > 50ms"
Optimization Checklist
Quick Wins (< 1 day)
- Enable
GODEBUG=gctrace=1and measure baseline - Set
GOGC=200and measure effect - Add
GOMEMLIMITto Kubernetes deployment - Setup Prometheus metrics for GC
Medium Effort (1 week)
- Profile allocations with
pprof - Add
sync.Poolfor hot paths - Optimize data structures
- Preallocate slices of known sizes
High Effort (2-4 weeks)
- Implement ballast memory
- Rewrite critical parts for zero-allocation
- Optimize JSON serialization
- Consider object pooling for all types
Conclusion
Go Garbage Collector is a powerful tool, but it can kill P99 latency in high-load systems. Key takeaways:
Main Problems:
- Stop-The-World pauses create latency spikes
- By default GC is optimized for throughput, not latency
- In high-throughput systems GC runs frequently
Solutions:
- GOGC=200-300 - less frequent GC
- GOMEMLIMIT - OOM protection and predictability
- Ballast memory - stable GC pauses
- Fewer allocations - less work for GC
- sync.Pool - object reuse
Results:
- P99 latency: 500ms → 50ms (10x improvement)
- Cost: +20-30% memory
- ROI: huge for business-critical APIs
Golden Rule:
Measure, optimize, monitor. GC tuning is a balance between memory, latency, and throughput. Start with GOGC and GOMEMLIMIT, then dive into allocation optimization.
Additional Resources
- Go Blog: Understanding Go Garbage Collection
- Go Blog: Profiling Go Programs
- Go Documentation: runtime/debug Package
- Prometheus Go Client
- Go Memory Management
Successfully conquered GC in production? Share your cases and metrics!