I keep running into a rate_limit_exceeded error when using Azure OpenAI with my GPT model, and this has been happening for multiple days now. The strange thing is that my deployment has plenty of quota available - around 100 tokens per minute allowed but I’m only using less than 10 TPM currently.
The error message says to try again in 86400 seconds (24 hours) which seems excessive given my low usage. Has anyone else experienced this issue where the rate limiting doesn’t match the actual quota consumption?
Here’s my implementation:
from azure.identity import DefaultAzureCredential
from azure.ai.projects import AIProjectClient
from azure.ai.projects.models import CodeInterpreterTool
import os
from pathlib import Path
# Initialize connection
connection_string = "my_connection_string"
client = AIProjectClient.from_connection_string(
conn_str=connection_string,
credential=DefaultAzureCredential()
)
print("Starting AI assistant test")
with client:
# Setup code interpreter tool
interpreter_tool = CodeInterpreterTool()
# Create assistant
assistant = client.agents.create_agent(
model="gpt-4o-mini-deployment",
name="data-assistant",
instructions="You are a helpful data analysis assistant",
tools=interpreter_tool.definitions,
tool_resources=interpreter_tool.resources,
)
print(f"Assistant created with ID: {assistant.id}")
# Start conversation thread
conversation = client.agents.create_thread()
print(f"Thread created with ID: {conversation.id}")
# Send user message
user_message = client.agents.create_message(
thread_id=conversation.id,
role="user",
content="Please generate a pie chart showing sales data: Product X: $800k, Product Y: $1.2M, Product Z: $600k, Product W: $1.5M",
)
print(f"Message sent with ID: {user_message.id}")
# Execute the request
execution = client.agents.create_and_process_run(
thread_id=conversation.id,
assistant_id=assistant.id
)
print(f"Execution completed with status: {execution.status}")
if execution.status == "failed":
print(f"Execution failed: {execution.last_error}")
# Retrieve responses
responses = client.agents.list_messages(thread_id=conversation.id)
print(f"Responses: {responses}")
# Clean up
client.agents.delete_agent(assistant.id)
print("Assistant deleted")
The error output shows:
Execution completed with status: RunStatus.FAILED
Execution failed: {'code': 'rate_limit_exceeded', 'message': 'Rate limit is exceeded. Try again in 86400 seconds.'}
My quota shows 100 TPM available with current usage under 10 TPM. Why would this rate limiting occur when I’m nowhere near my limits?