How to prevent duplicate records when loading CSV data from S3 to MySQL RDS through AWS Glue

I’m working with an AWS Glue ETL job that processes a CSV file stored in an S3 bucket. This file gets updated weekly with fresh data from my machine learning pipeline. My current setup reads the CSV from S3 and loads it into a MySQL database in RDS.

The problem I’m facing is data duplication. When I first ran the job, it successfully loaded 50000 records. However, when I execute the job again, it appends the entire dataset again instead of just adding the new records.

Here’s my current implementation:

output_sink = glueContext.write_dynamic_frame.from_catalog(
    frame = processed_frame, 
    database = "my_database", 
    table_name = "target_table", 
    transformation_ctx = "output_sink"
)

What I need is for the job to only insert the 1000 new rows that get added each week, so my total becomes 51000 instead of 100000.

I’m thinking about truncating the target table before each load to avoid duplicates. Is this the right approach? Here’s my basic job setup:

from awsglue.utils import getResolvedOptions
import sys

options = getResolvedOptions(sys.argv, ['JOB_NAME'])
sparkContext = SparkContext()
glue_context = GlueContext(sparkContext)
spark_session = glue_context.spark_session
etl_job = Job(glue_context)
etl_job.init(options['JOB_NAME'], options)

What’s the best way to handle this incremental data loading scenario?

Don’t reinvent the wheel - just use AWS Glue’s built-in bookmarks. They’re made for exactly this situation. Turn on job bookmarks in your Glue job config and switch your data source to glueContext.create_dynamic_frame.from_catalog() with bookmark support. It’ll automatically track what’s already been processed and only grab new changes on future runs. The bookmark system keeps tabs on your S3 files and their timestamps, so when your ML pipeline drops fresh CSV data, Glue only picks up what’s changed. You’ll want your S3 data partitioned by date or with predictable naming for best performance. This cuts out messy joins or manual timestamp tracking, and it’s already part of your AWS setup. Just add transformation_ctx parameters to your transforms and flip on bookmarks in the job properties. Handles weekly updates perfectly without touching your MySQL table schema.

I’ve hit this exact problem several times. The key is proper change detection - don’t just truncate or blindly upsert everything. Add a timestamp or version column to track when records were last modified in your source system. Then in your Glue job, load both the existing MySQL table and new S3 data into DataFrames. Do a left anti-join to find genuinely new records: new_records = s3_df.join(existing_df, on=['primary_key'], how='left_anti'). You’ll only process actual additions this way. If your CSV has a reliable timestamp field, just filter the S3 data before loading - compare it against the max timestamp in your target table. This works great for ML pipeline outputs since they usually include processing timestamps. Truncating works but creates downtime and adds complexity you don’t need. The join approach handles edge cases better, especially with large datasets where network issues might cause partial loads.

use upsert with on duplicate key update in mysql instead of regular inserts. glue can load the full dataset and mysql handles deduplication automatically. just set up a unique key constraint on your target table first.

Glue bookmarks sound good but they’re unreliable with CSV files that get replaced weekly. I’ve seen them break when file metadata changes or your ML pipeline writes to the same path.

The real problem? You’re juggling multiple AWS services that don’t play well together. Glue jobs, RDS connections, S3 permissions - it gets messy quick.

I had the same issue and switched to Latenode. It’s got dedicated nodes for CSV processing and MySQL that handle incremental loading automatically.

Set up duplicate detection on any column combo you want. New data from S3 gets compared against your MySQL records - only fresh stuff gets inserted. No DataFrame joins or bookmark headaches.

The visual builder makes error handling and monitoring dead simple. Schedule it weekly, get alerts when things break. Way more reliable than wrestling with AWS services.

Bonus: no vendor lock-in and no surprise Glue bills when your data grows.

Check it out: https://latenode.com

AWS Glue is overkill for incremental loading. I’ve hit similar pipeline issues and found dedicated automation platforms work way better.

Don’t truncate your table before each load - you’ll lose data if something breaks mid-process. Plus Glue job configs are unnecessarily complex.

You need a workflow that:

  • Tracks existing records in MySQL
  • Compares against S3 CSV data
  • Inserts only new records
  • Handles errors gracefully

I ditched AWS Glue for Latenode on similar data processing workflows. Latenode has built-in nodes for S3 reading, MySQL ops, and data deduplication. You build the entire pipeline visually without Python code.

It automatically handles incremental loads by comparing primary keys or timestamps between your CSV and table. Only processes delta records, saving time and preventing duplicates.

For weekly ML pipeline updates, schedule the workflow to run automatically with failure notifications. Way cleaner than managing Glue configs and Spark contexts.

Check it out: https://latenode.com