I need some guidance on building a k8s cluster for hosting around 200 WordPress websites. We want to keep costs low at first so we’re starting with 3-4 nodes with good hardware.
I’m worried about scaling issues when we get more customers later. Here’s what we’re thinking so far:
Network setup:
10G network ports at our datacenter
One or two IP gateways for DNS
MetalLB with BGP for load balancing between nodes
Ingress options:
Currently testing Traefik but not sure if it can handle TLS for 200 domains
Looked at Nginx Ingress but the team is moving to new projects
Storage questions:
Want RWX storage so we can run multiple pods per site
Testing Longhorn now but heard it has issues with lots of volumes
Should we try Rook/Ceph instead?
Architecture decision:
Should each node run its own Nginx and MariaDB instances?
Or use cluster-wide database with MariaDB Galera?
WordPress setup:
Using FPM images to save resources but it’s tricky
Any tips for reducing memory and CPU usage?
Management:
Should we build a custom operator or stick with Helm charts?
What do you think about this approach? Any suggestions for improvements?
traefik should handle 200 domains fine tbh, we’re doing ~180 sites with it and tls works great. just make sure you configure cert limits properly or letsencrypt will rate limit you. also consider database per site vs shared - shared galera sounds good but backup/restore gets messy when one client needs rollback
Been running a similar setup for about 18 months now, though with fewer sites initially. Your architecture sounds solid but I’d suggest a few tweaks based on what we learned the hard way. For storage, we started with Longhorn and ran into exactly the issues you mentioned around 150+ volumes. Switched to Rook/Ceph about 8 months ago and it’s been much more stable under load. The initial setup is more complex but the performance gains are worth it. Regarding databases, we went with a hybrid approach - MariaDB Galera cluster for the database layer but kept it separate from the worker nodes. This gives you the redundancy without eating into resources needed for WordPress pods. We found that mixing database and application workloads on the same nodes created resource contention issues during traffic spikes. One thing that saved us significant resources was implementing Redis for object caching across all sites. Cut our average pod memory usage by about 30% and dramatically improved response times. Also consider using a CDN early - it takes load off your cluster and makes scaling more predictable. For the operator vs Helm question, we built a simple custom operator that handles WordPress deployments and database provisioning. It was worth the initial development time because managing 200+ Helm releases manually becomes a nightmare.
Honestly the biggest mistake I see here is trying to cram everything onto 3-4 nodes from the start. We tried a similar approach two years back and hit serious bottlenecks around the 120 site mark. The issue isn’t just resource limits but also blast radius when nodes go down. Consider separating your database tier completely - run MariaDB on dedicated instances outside the cluster initially. This gives you better performance isolation and makes troubleshooting much easier when things break. For WordPress specifically, we found that using a shared NFS backend performed better than RWX solutions like Longhorn for static assets, especially media files. The real game changer was implementing proper resource quotas per namespace early on. Without this you’ll have sites randomly consuming all available memory during traffic spikes. Also watch out for WordPress cron jobs - 200 sites running wp-cron simultaneously will crush your nodes. Disable wp-cron and run it externally via kubernetes cronjobs instead. One last thing - don’t underestimate the operational overhead of managing this many WordPress instances. Build your monitoring and alerting before you scale, not after.