I’m working on setting up a monitoring solution and have some questions about database performance. We’re looking at implementing a system with one main server and multiple proxy instances to monitor about 2000 network devices like switches and routers.
We expect to gather around 500,000 metrics every hour. My main worry is that having all this data flow into a single database on the central server might create performance issues.
Does anyone know what kind of throughput I can realistically expect from this setup? Are there any hard limits I should know about, or is it mostly about adding more RAM and CPU power to handle the load?
Any advice on optimizing database performance for this type of high-volume data collection would be really helpful!
I’ve run a similar monitoring setup and database choice beats raw hardware specs every time. We started with PostgreSQL and hit write bottlenecks around 400k metrics/hour, even with solid hardware. TimescaleDB fixed most of our problems since it’s actually designed for time-series data. Traditional databases just can’t handle constant high-frequency writes from monitoring. Set up data retention policies from day one - storage explodes fast at that volume. We learned this the hard way and nearly ran out of disk space within months. Also partition your data by time ranges. Makes pulling historical reports way faster.
500k/hour is totally doable with proper indexing. We’re pushing 800k+ through MySQL with basic InnoDB tweaks. Start simple and scale when you actually hit limits. Most people overengineer before finding the real bottlenecks.
500k metrics/hour isn’t too crazy tbh, we handle similar loads with decent hardware. Main thing is batch your inserts instead of doing them one by one - makes a huge difference. Also consider time-series db like InfluxDB instead of regular SQL, way better for this kind of data.
Your setup should handle that volume fine, but watch for lock contention during peak writes. We had similar throughput and found connection pooling config matters way more than people think. Set pool size based on actual concurrent writers, not total devices. We screwed up early by using default transaction isolation levels - dropping to READ COMMITTED cut blocking significantly. Also stagger collection intervals slightly across proxies so all 2000 devices don’t slam the database at once. Split write workload from reporting queries with read replicas if you can. Hardware requirements are pretty light until you start doing complex aggregations on live data.
I’ve run similar setups, and your bottleneck won’t be the database itself - it’ll be write patterns and disk I/O. We pushed the same volumes through one PostgreSQL instance just fine, but we had to nail the write buffers and connection pooling first. Make sure your proxies batch metrics before sending them out. Cuts down on connection overhead big time. Go with async writes and tune your WAL settings for throughput over durability (assuming you’re okay losing some data if things crash). More memory helps, but don’t ignore network latency between your proxies and main server - that’s what actually killed our performance. Load test with real network conditions before you go live.