Best practices for handling rate limits when making multiple API calls to AI services

I’m building a small project that needs to make several requests to an AI API service. My script processes a bunch of items from a JSON file and sends each one to the API for processing. Here’s what I have so far:

<?php
require_once 'vendor/autoload.php';
use GuzzleHttp\Client;

$token = 'YOUR_API_TOKEN';
$endpoint = 'https://api.example-ai.com/v1/completions';

$dataFile = file_get_contents($dataFilePath);
$instructions = file_get_contents($instructionsPath);
$items = json_decode($dataFile, true);

function makeApiCall($message, $prompt, $token, $endpoint) {
    $http = new Client([
        'base_uri' => $endpoint,
        'headers' => [
            'Content-Type' => 'application/json',
            'Authorization' => 'Bearer ' . $token,
        ],
    ]);

    $payload = [
        'messages' => [
            ['role' => 'user', 'content' => $message],
            ['role' => 'system', 'content' => $prompt],
        ],
        'model' => 'llama-7b-chat',
        'temperature' => 0.8,
        'max_tokens' => 512,
        'top_p' => 0.9,
        'stream' => false,
    ];

    $result = $http->post('', ['json' => $payload]);
    return json_decode($result->getBody(), true);
}

foreach ($items as $entry) {
    if (isset($entry['uuid'])) {
        $content = json_encode($entry);
        $response = makeApiCall($content, $instructions, $token, $endpoint);
        $output = $response['choices'][0]['message']['content'];
        echo $output . "\n";
        sleep(3);
    }
}
?>

The problem is I keep getting rate limit errors even with the sleep delay. The API throws a 429 status code saying too many requests. What’s the right way to handle this kind of batch processing without hitting the limits? Should I implement some kind of backoff strategy or queue system?

honestly just bumping the sleep to like 5-7 seconds might fix it without getting too fancy. some ai apis are stricter than others and 3 sec gaps dont always cut it especially if your processing alot of items back to back.

Three seconds might not be enough depending on your API provider’s specific limits. I ran into similar issues and found that implementing exponential backoff made a huge difference. Instead of fixed sleep intervals, catch the 429 response and gradually increase your wait time - start with 5 seconds, then 10, 20, etc. Also check if your API has burst limits versus sustained rate limits, as some providers allow short bursts but enforce stricter long-term quotas. Adding retry logic with proper exception handling will make your script much more robust. You might also want to consider processing items in smaller batches rather than one continuous loop, which gives the API breathing room between chunks.

Queue-based processing solved this exact problem for me when working with OpenAI’s API. Instead of processing everything sequentially, I implemented a simple queue system using Redis that processes requests at a controlled rate. Set up a worker that pulls from the queue every 2-3 seconds based on your tier limits, and have your main script just push items into the queue. This way you can handle hundreds of items without worrying about hitting rate limits, and if something fails you don’t lose your place in the batch. The added benefit is you can run multiple queue workers if you upgrade your API plan later. Also worth mentioning that some providers offer batch endpoints specifically for this use case - check if your AI service has bulk processing options before building complex retry logic.

Your sleep duration is probably too conservative for sustained processing. Most AI APIs have different rate limits for different subscription tiers, so first verify what your actual limits are through the documentation or support. I’ve found that checking the response headers for X-RateLimit-Remaining and X-RateLimit-Reset gives you precise timing information rather than guessing with fixed delays. Consider implementing a token bucket approach where you track your remaining quota and adjust timing dynamically. Another effective strategy is to process your JSON items in parallel using multiple workers with staggered start times, which can actually improve throughput while staying within limits. The key is monitoring your actual usage patterns rather than applying blanket delays that might be unnecessarily slow.

yeah i’d defintely add some error handling around that api call first. wrap it in try/catch and actually check for the 429 status before deciding what to do next. maybe also log the response headers - most apis tell you exactly when you can retry again in the rate limit headers.