Hi everyone,
I need help with synchronizing data between MySQL and BigQuery. I can’t use CDC solutions like Debezium or Datastream because they’re too expensive for my project.
My MySQL database has a table that stores session metadata. This table has multiple status columns like process state, file location, completion timestamp, duration values, and other fields. The problem is that various backend applications modify different column groups at different moments.
For instance:
- Application X updates location_type and file_location
- Application Y later modifies completion_timestamp and session_duration
- Application Z updates the processing_state field
Has anyone worked with a similar scenario? Any suggestions or code samples would be great!
We went with a hybrid approach - MySQL checksums plus incremental sync windows. Instead of tracking individual column changes, we calculate checksums for different column groups and store them with each record. During sync, we compare current checksums against what’s stored in BigQuery to see which records actually changed. This cuts out unnecessary transfers while catching modifications from any app. The trick is partitioning your columns into logical groups that match how your apps modify data. We run this every 15 minutes and it handles our multi-app setup perfectly - no triggers or binlog parsing needed.
Had the same issue at my company. We ended up using MySQL’s binlog position tracking instead of full CDC - way cheaper. You just read binlog events for specific tables and filter by the columns you need. Then batch those changes every few minutes to BigQuery. Takes some custom scripting but saves a ton of money.
In my experience, managing this type of synchronization can indeed be tricky. One effective approach we’ve utilized is adding a last_updated timestamp to track changes. Each time a relevant field is modified, this timestamp gets refreshed. We then run a scheduled task that checks for updates based on this timestamp. It’s critical to ensure that during synchronization, you handle potential conflicts, perhaps by implementing a strategy that prioritizes the latest updates. For data insertion into BigQuery, using the MERGE command works well to avoid duplicates while keeping your records current, streamlining the process without the need for more expensive CDC solutions.
We solved this with column-level change tracking using MySQL triggers. Set up triggers on your session table that log changes to a separate table whenever specific columns get modified. Each log entry captures the primary key, which columns changed, and when. Your sync job then processes these logs and upserts the changes to BigQuery. Overhead’s minimal since you’re only tracking actual changes, and it handles multiple apps perfectly - each trigger captures its own column group changes independently.
This topic was automatically closed 4 days after the last reply. New replies are no longer allowed.