I set up a new Azure AI project through the portal and created all the required resources from scratch. Then I deployed a gpt-4o-mini model using the standard settings.
My deployment details are:
Type: Global Standard
Token limit: 8,000 per minute
Request limit: 80 per minute
Status: Successfully deployed
When I check the usage metrics for this model, I can see that I’m nowhere near hitting these limits. The actual usage is very low compared to what’s allowed.
Next, I created an Agent using the Assistants API in the Azure AI Agent Service preview feature. When I try to test this agent in the playground by starting a new conversation thread and sending any message, I get a rate limit error.
This is confusing because the same model works fine when I test it directly in its own playground without any rate limiting problems. The error only happens when using the Agent Service.
Why would the Agent Service show rate limit errors when the underlying model has plenty of capacity available? Has anyone else run into this issue with the Agent Service preview?
This issue typically stems from the Agent Service having different throttling mechanisms than direct model access. The preview version uses shared infrastructure that can bottleneck even when your individual deployment shows available capacity. I encountered this several times during my testing phase. The Agent Service makes multiple backend calls for each user interaction - thread creation, message processing, and response generation all count separately against internal quotas that aren’t visible in your deployment metrics. One thing that helped me was checking the specific error message details. Sometimes it’s not your model deployment hitting limits but rather the Agent Service’s own API endpoints. The service is still in preview so Microsoft is likely being conservative with resource allocation. Try testing during off-peak hours to see if the issue persists. If it does, consider upgrading your deployment tier or requesting quota increases through Azure support. The standard tier can be restrictive for agent workloads since they’re more resource intensive than simple chat completions.
microsoft probably has hidden quotas for agent service that arent shown anywhere. ive seen this with other azure preview stuff where they throttle behind the scenes. try deleting and recreating your agent - sometimes it gets stuck on a bad server node or something
I ran into something similar when testing the Agent Service last month. The issue appears to be that Microsoft applies separate throttling controls at the Agent Service level that don’t reflect your actual model deployment limits. During preview, they’re using conservative rate limiting to manage overall system load. In my case, the problem was intermittent and seemed tied to peak usage times. The Agent Service creates multiple internal operations for each conversation - it’s not just calling your model once but also managing thread state, processing instructions, and handling tool calls if configured. One workaround that helped was adding a small delay between test messages in the playground. The service seems particularly sensitive to rapid consecutive requests. I also noticed that longer conversations tend to hit limits more frequently, possibly due to context processing overhead. Microsoft acknowledged this limitation in their documentation update from a few weeks ago. They mentioned that production release will have more predictable rate limiting aligned with your actual deployment quotas. For now, patience with the preview restrictions seems to be the main solution unfortunately.
I’ve been dealing with Azure AI services for a while now and this is actually a pretty common issue with preview features.
The Agent Service has its own separate rate limiting layer that sits on top of your model deployment. Even though your GPT-4o-mini shows plenty of capacity, the Agent Service itself has much stricter limits during preview.
I ran into this exact problem about a month ago. What fixed it for me was switching my deployment from Global Standard to a specific region like East US or West Europe. The global deployments seem to have shared rate limiting pools that get exhausted faster.
Also check if you have other agents or applications hitting the same service. The Agent Service aggregates usage across all your agents, not just the one you’re testing.
If switching regions doesn’t work, try creating a new deployment specifically for the agent with higher limits. I usually bump up to at least 150 requests per minute for agent deployments since they tend to make multiple API calls per user interaction.
sounds like a known bug with the agent service preview tbh. i had similar issues last week where the agent would throw rate limits even tho my deployment was barely used. try recreating the agent or switching to a diffrent region if possible - worked for me