How can I design an API to support 100k requests per second?

I’m currently operating a system that uses AWS API Gateway along with Lambda, but it only scales up to around 3k requests per second. I’ve noticed that there seem to be inherent limitations with this configuration. Are there alternative solutions to Lambda that could better handle higher loads, or might the API Gateway itself be the bottleneck? I value the seamless AWS integrations, yet I’m open to restructuring my setup if it means achieving the desired performance.

Based on my experience, one promising approach is to shift away from a pure serverless model towards container-based solutions that give you more control over scalability. I moved from AWS Lambda to a container service such as ECS with an Application Load Balancer to support higher request volumes. This method not only bypasses some inherent throttling constraints of Lambda but also allows for more granular device tuning. You can also integrate in-memory caching and custom rate limiting to manage peak loads more efficiently without compromising performance.

In my experience, one way to scale beyond what API Gateway and Lambda offer is to redesign your architecture to decouple the API layer from compute processing. For instance, building a lightweight proxy layer using a self-managed service such as Nginx or HAProxy on EC2 instances can help funnel requests into a more scalable backend. Using a container orchestration system like Kubernetes with autoscaling capabilities has worked well for me. Additionally, employing a message queue for request buffering can further smooth out traffic bursts and enhance overall system resilience.

hey, my experance with similar scale issues showed a hybrid aproach works. try using edge cache and dedicated reverse proxies rather than pure lambda and api gateway. it freed up resources and lessened throttle issues.

Based on my experience, using a microservices architecture in conjunction with robust caching mechanisms has been effective in scaling API endpoints beyond typical Lambda constraints. I transitioned some high-traffic endpoints to a persistent container service such as AWS Fargate, enabling a more predictable resource allocation and performance tuning. In parallel, I implemented a custom caching layer to intercept common queries, reducing the load on the compute layer. Although this setup requires extra orchestration and monitoring, the increase in throughput and lower latency have proven well worth the extra effort in environments handling high requests per second.