I’m working on integrating Mailgun event tracking into our system. The official documentation suggests an event polling method, but I’m not happy with their recommended approach.
The main issues I see are:
- Throwing away already fetched data when retries happen seems wasteful
- No clear guidance on when to stop retrying
- The suggested 30-minute threshold seems too conservative
I think there’s a better way to handle this. My idea is to use a different strategy based on event reliability timing.
My proposed approach:
def fetch_events(last_timestamp, reliability_delay):
start_time = last_timestamp
end_time = current_time() - reliability_delay
response = mailgun_client.get_events(
begin=start_time,
end=end_time,
ascending=True
)
all_events = []
while response.has_next_page():
all_events.extend(response.events)
response = response.next()
return all_events, end_time
This way I can:
- Set time boundaries that only fetch reliable events
- Keep all fetched data instead of discarding it
- Use the end timestamp as the starting point for next iteration
Questions:
- Does this approach have any obvious flaws?
- What’s the shortest reliable delay I can use instead of 30 minutes? I want faster event processing if possible.
Any insights would be helpful!
Honestly, your approach beats Mailgun’s default polling setup by miles. I’ve run something similar for 2 years - works great. One heads up though: don’t trust their timestamps completely. We’ve seen events show up with timestamps that don’t match when they actually happened. Build in a small buffer when you set those time boundaries or you’ll miss events during heavy traffic. The 5-6 minute delay works for most situations, but test it with your real traffic first.
I’ve been running a similar Mailgun setup for 18 months - here’s what I’ve learned. Your approach looks solid but a few things to consider: For the reliability delay, 5-10 minutes usually works fine instead of 30. Depends on your volume and how critical timing is. High-volume senders see events stabilize in 2-3 minutes, but I’ve seen stragglers show up after 15 minutes during Mailgun maintenance. Watch out for clock drift between your system and Mailgun’s servers with those time boundaries. I got burned by this when we started missing events due to timestamp mismatches. Add a 30-second overlap buffer to your start_time to catch these cases. Use exponential backoff for API failures instead of fixed intervals. Mailgun’s rate limiting gets weird during peak hours, and hammering them during outages will just get you blocked. Pagination handling looks good, but store those pagination URLs as checkpoints too. You’ll want them if you need to resume mid-fetch after a restart.
Been handling Mailgun events for 3 years and your approach beats their standard polling method. The time boundary strategy makes sense and will save headaches later.
For the reliability delay - 7-8 minutes works well. We started at 15 minutes but cut it down after watching event patterns. Don’t guess - measure your actual event latency. Most events show up in 2-3 minutes, but network issues or Mailgun’s processing delays can push some to 6-7 minutes.
Add duplicate detection on your end. Even with proper time boundaries, you’ll occasionally see the same event twice because of Mailgun’s eventual consistency model. We hash the event ID and timestamp to catch duplicates before processing.
Also implement a dead letter queue for events that fail processing. Your approach will reliably fetch events, but you still need to handle downstream processing failures. Better to acknowledge the fetch but retry processing separately rather than re-fetching the entire batch.