Operations: The 2AM Debugging Toolkit
This guide is part of a collection of findings gained from community meetups. For more real world solutions and insights you can browse by specific problem. Suffering from high operational costs? Check out the Cost Optimization community insights guide.
When Everything is Broken: Emergency Diagnostics
Community philosophy: "If something looks odd even just slightly something is wrong - investigate before it gets worse"
The nightmare scenario: "One of the worst days of my life over the past three years... it locked the database you couldn't insert anything couldn't read anything CPU went crazy memory usage went crazy"
EMERGENCY: Production Incident Queries (Copy-Paste Ready)
When your ClickHouse is down at 2AM, run these in order:
Learning: Incident Pattern Recognition
Understand the failure modes with working examples:
Memory Exhaustion Detection
Bad Data Detection
The 2AM Methodology
Follow this exact sequence when everything is broken:
Phase 1: Immediate Triage (30 seconds)
- Run
system.errors
- any non-zero = active incident - Check disk space - "It took us from 12 to 4 AM... it was as simple as low disk"
- Look for replication lag > 5 minutes
Phase 2: Resource Investigation (2 minutes)
- Find memory-hungry queries in
system.processes
- Check for stuck merges running >1 hour
- Kill obviously problematic queries
Phase 3: Data Quality Check (5 minutes)
- Look for bad partitions (1998, 2050 dates)
- Check for parts explosion (>1000 parts per table)
Emergency Actions Reference
Production-tested solutions:
Problem | Detection Query | Solution |
---|---|---|
Memory OOM | SELECT * FROM system.processes WHERE memory_usage > 8GB | Enable external_aggregation=1 |
Disk Full | SELECT sum(bytes_on_disk) FROM system.parts | Delete old partitions, expand disk |
Replication Lag | SELECT * FROM system.replicas WHERE absolute_delay > 300 | Check network, restart lagging replica |
Stuck Query | SELECT * FROM system.processes WHERE elapsed > 300 | KILL QUERY WHERE query_id = '...' |
Parts Explosion | SELECT count() FROM system.parts WHERE active=1 | Enable async_insert, increase batch sizes |
The golden rule: "Problems very rarely just pop out of nowhere there are signs... investigate before it goes from 15 milliseconds to 30 seconds"
Community War Stories & Lessons
Disk Space Issues:
- "Default AWS EBS limit of 16TB hits you when you least expect it"
- "Other nodes keep sending data to the full node creating cascading failure"
Memory Exhaustion:
- "Out of memory typically appears when you have a big aggregation with a lot of keys"
- "Enable external aggregation - query will be slower but won't crash"
Bad Data:
- "Developers send data from 1998 or 2050 causing partition chaos"
- "Always validate timestamps before they hit production"
The key insight: Most 2AM incidents are preventable if you recognize the warning signs and have ready-to-use diagnostic queries.