Azure OpenAI throws rate limit error despite being well under TPM quota

I keep running into a rate_limit_exceeded error when using Azure OpenAI with my GPT model, and this has been happening for multiple days now. The strange thing is that my deployment has plenty of quota available - around 100 tokens per minute allowed but I’m only using less than 10 TPM currently.

The error message says to try again in 86400 seconds (24 hours) which seems excessive given my low usage. Has anyone else experienced this issue where the rate limiting doesn’t match the actual quota consumption?

Here’s my implementation:

from azure.identity import DefaultAzureCredential
from azure.ai.projects import AIProjectClient
from azure.ai.projects.models import CodeInterpreterTool
import os
from pathlib import Path

# Initialize connection
connection_string = "my_connection_string"
client = AIProjectClient.from_connection_string(
    conn_str=connection_string,
    credential=DefaultAzureCredential()
)

print("Starting AI assistant test")

with client:
    # Setup code interpreter tool
    interpreter_tool = CodeInterpreterTool()
    
    # Create assistant
    assistant = client.agents.create_agent(
        model="gpt-4o-mini-deployment",
        name="data-assistant",
        instructions="You are a helpful data analysis assistant",
        tools=interpreter_tool.definitions,
        tool_resources=interpreter_tool.resources,
    )
    print(f"Assistant created with ID: {assistant.id}")
    
    # Start conversation thread
    conversation = client.agents.create_thread()
    print(f"Thread created with ID: {conversation.id}")
    
    # Send user message
    user_message = client.agents.create_message(
        thread_id=conversation.id,
        role="user",
        content="Please generate a pie chart showing sales data: Product X: $800k, Product Y: $1.2M, Product Z: $600k, Product W: $1.5M",
    )
    print(f"Message sent with ID: {user_message.id}")
    
    # Execute the request
    execution = client.agents.create_and_process_run(
        thread_id=conversation.id, 
        assistant_id=assistant.id
    )
    print(f"Execution completed with status: {execution.status}")
    
    if execution.status == "failed":
        print(f"Execution failed: {execution.last_error}")
    
    # Retrieve responses
    responses = client.agents.list_messages(thread_id=conversation.id)
    print(f"Responses: {responses}")
    
    # Clean up
    client.agents.delete_agent(assistant.id)
    print("Assistant deleted")

The error output shows:

Execution completed with status: RunStatus.FAILED
Execution failed: {'code': 'rate_limit_exceeded', 'message': 'Rate limit is exceeded. Try again in 86400 seconds.'}

My quota shows 100 TPM available with current usage under 10 TPM. Why would this rate limiting occur when I’m nowhere near my limits?

This is a common Azure OpenAI issue - rate limiting isn’t just about TPM but also concurrent requests and regional capacity. I’ve hit the same problem with the agents API before. That 86400-second timeout means you’ve maxed out your daily quota, not the per-minute limit you’re watching. The agents API with code interpreter tools eats way more resources than regular chat completions because it’s running code execution behind the scenes. Your token usage looks low, but the code interpreter makes multiple internal calls that don’t show up in your quota dashboard. Try switching regions if you can, or use the regular chat completions API instead of agents to see if that fixes it. Also check if other apps or team members are using the same deployment - they might be burning through your quota too.