Maintaining historical data in BigQuery while using GCP Datastream for MySQL replication

We’re using Google Cloud Datastream to copy data from our MySQL database to BigQuery. It’s great for inserts and updates, but we have a problem with deletes. When we delete something from MySQL, it also gets removed from BigQuery. We want to keep our old data in BigQuery.

We tried using APPEND-ONLY mode in Datastream, but it made our tables too big and hard to query. We also thought about turning off binary logging for deletes, but that won’t work because our deletes happen automatically in different ways.

What we’re looking for is a simple way to stop delete and truncate actions from affecting our BigQuery data. We’d like to keep the newest version of data for updates but ignore deletes when copying.

Can we set up Datastream or BigQuery to do this? Is there a way to use Dataflow to filter out deletes before they hit BigQuery? Are there other Google Cloud tools that could help?

We’re hoping to find an easy-to-manage solution that doesn’t need a lot of work from us. Any ideas would be really helpful!

Having encountered a similar challenge, I can share our solution. We implemented a soft delete approach in our MySQL database. Instead of actually deleting records, we added a ‘deleted_at’ timestamp column. When a record is ‘deleted’, we update this column with the current timestamp.

We then modified our Datastream configuration to treat this ‘deleted_at’ column as a regular field. This way, all records, including ‘deleted’ ones, are replicated to BigQuery. In BigQuery, we created views that filter out the ‘deleted’ records for normal querying, while maintaining the full historical data in the underlying tables.

This method requires minimal changes to your existing setup and provides flexibility in data retention and querying. The main trade-off is slightly increased storage usage in both MySQL and BigQuery, but it’s generally more efficient than storing all historical versions of records.

Having dealt with a similar situation, I can share our approach. We implemented a custom Dataflow job between Datastream and BigQuery. This job intercepted the Datastream output, filtered out delete operations, and only propagated inserts and updates to BigQuery.

For updates, we used a merge operation in BigQuery to ensure we always had the latest version of each record. This method allowed us to maintain historical data while keeping our BigQuery tables manageable.

The Dataflow job wasn’t overly complex, but it did require some development and ongoing maintenance. If you’re comfortable with a bit of coding, this solution offers good flexibility and control over your data replication process.

Another option to consider is using BigQuery’s time travel feature for short-term historical data retention, though this has limitations for long-term storage.

hey, i’ve been dealing with this too. what worked for us was using pubsub between datastream and bigquery. we set up a cloud function to filter the messages, dropping the deletes. it’s pretty simple to manage once it’s set up.

the downside is u might get some lag in updates, but it keeps all ur historical data intact. just make sure to optimize ur pubsub and cloud function settings to handle ur data volume.

I’ve been through this exact scenario, and we found a neat workaround using Change Data Capture (CDC) logs. We set up a separate process to read the CDC logs from MySQL and push them into a staging area in BigQuery. This way, we captured all changes, including deletes.

From there, we used scheduled queries in BigQuery to process these logs. For inserts and updates, we merged the data into our main tables. For deletes, we simply marked the records as ‘deleted’ by updating a status column, rather than removing them.

This approach gave us full control over how we handled different operations. It preserved our historical data while keeping the current state easily queryable. The main challenge was setting up the CDC log reader, but once that was done, the rest was mostly BigQuery SQL work.

One thing to watch out for is the potential increase in storage costs over time. We implemented a partitioning strategy in BigQuery to manage this, which made it easier to archive or delete truly old data when necessary.

I’ve been in your shoes, and I found a workable solution that might help. We set up a staging table in BigQuery that receives all the Datastream updates. Then, we created a scheduled query that runs periodically to merge the staging data into our main tables.

The key is in the merge logic. We only insert new records or update existing ones, effectively ignoring deletes. This way, our main tables retain all historical data while staying up-to-date with the latest changes.

It does require some SQL know-how to set up the merge query correctly, but once it’s running, it’s pretty low-maintenance. The biggest challenge we faced was optimizing the query performance as our data grew.

One caveat: this approach can lead to data growth over time, so you might need to implement a data retention policy eventually. But for maintaining historical data without the headaches of managing a custom Dataflow job, this method has served us well.