The Too Many Parts Problem
This guide is part of a collection of findings gained from community meetups. For more real world solutions and insights you can browse by specific problem. Need more performace optimization tips? Check out the Performance Optimization community insights guide.
Universal pain point: Small frequent inserts create performance degradation through part explosion.
Clear warnings from ClickHouse engineers and users:
- "If you are doing 100,000 inserts it is going to create 100,000 parts and behind the scenes over time click house is going to merge those parts... more parts you create the more meta work that you create for click house to merge"
- "If the inserts are too small we have another question that the background merging stress will not merge these small parts together in time then when the parts accumulating too many it will easy to get too many parts error which will slow down our writing throughput"
- "We encourage our users to insert the data in batches for example 20,000 rows at once"
Recognize the Problem Early
Proper Insert Batching
Community-proven batching strategy from production deployments:
"All these batching services they buffer the traffic... either for 30 seconds or till the data size hits 200 MB... there is a fine balance that needs to be struck here for how much to buffer... we don't want to insert again too frequently because then it unnecessarily consumes click house cycles" - Production observability platform processing TB/day
Alternative: Async Inserts (ClickHouse 21.11+)
"We developed a function called async insert... this mechanism is straightforward similar to buffer table we insert to the server side and use some buffer to collect these inserts by default we have 16 threads to collect this buffer and if the buffer is large enough or reach timeout we will flush the buffer to the storage so apart will contain multiple inserts" - ClickHouse team explaining built-in solution