I’m facing a problem regarding the synchronization of data from MySQL to BigQuery without using CDC tools like Debezium or Datastream, as they are quite pricey for my needs.
In my MySQL database, there’s a table that holds session-level data including various status columns such as processing_stage, file_location, event_end_timestamp, time_spent, and others. The challenge is that different backend services change different sets of these columns at different times.
For example:
Service A might change file_type and file_location
Service B may update event_end_timestamp and duration_of_activity
Service C could update post_process_status
Has anyone experienced something like this?
I would greatly appreciate any thoughts or examples you could share!
We solved this with MySQL’s query log plus checksums instead of timestamps. Added a hash column that gets computed from all relevant data fields whenever any service updates the row. Then we run periodic jobs comparing checksums between MySQL and our last BigQuery snapshot to catch changed records. This works even when services forget to update timestamps properly. Main advantage? You can batch these comparisons efficiently and it’s bulletproof against services that skip timestamp updates. We’re processing about 50k records hourly this way with minimal overhead since you’re mostly just comparing hashes.
u could also parse MySQL’s binlog directly without setting up full CDC. I wrote a Python script that reads binlog files and pulls row changes for specific tables. Way cheaper than Datastream and still gives you real-time updates. Just make sure u handle binlog rotation and set up failover in case things break.
I’ve handled this exact problem before. Here’s what worked for me: Add an updated_at timestamp column to your MySQL table with ON UPDATE CURRENT_TIMESTAMP. This auto-updates whenever any column changes. Run a scheduled job that grabs all rows where updated_at is newer than your last sync timestamp, then batch loads them into BigQuery. Use BigQuery’s MERGE statement or table decorators for proper upserts. I keep a simple sync state table in MySQL to track the last successful sync timestamp. This setup’s been rock solid for months and costs way less than CDC solutions. Just make sure you get the initial schema right and that all your services actually trigger the timestamp updates.
I encountered a similar challenge with data synchronization between MySQL and BigQuery without utilizing CDC tools. The solution that proved effective was implementing an ‘updated_at’ timestamp for each record. This timestamp is modified whenever any service updates a row, which allows us to run a scheduled job every 15 minutes. This job pulls records with an ‘updated_at’ time that exceeds the last synchronization point. This approach avoids complications by fetching entire rows instead of tracking individual column changes. Furthermore, we employed merge operations in BigQuery for efficient upsert functionality based on primary keys, simplifying both new records and updates. A crucial point to watch out for is clock skew among services, so adding a minor buffer to timestamp checks can prevent missed updates.