Syncing DynamoDB Changes to Aurora PostgreSQL pgVector for Real-time RAG Embedding Updates

I’m working on a system where my main data lives in DynamoDB and gets updated regularly by my backend applications. I want to build a vector database using Aurora PostgreSQL with pgVector extension to keep my embeddings in sync.

The challenge is making sure the vector database stays current whenever something changes in DynamoDB. I’ve been looking at AWS EventBridge as a possible solution but I’m not sure if that’s the best approach.

What I need is an automatic process that creates fresh embeddings each time there’s an insert or update in my DynamoDB table, then saves those vectors to the PostgreSQL database. This way my RAG system always works with the most recent data and my language model responses stay accurate.

I’m still in the planning phase and would love to hear what architecture patterns work best for this kind of setup.

dynamodb streams are def the better choice! just remembr to implement solid error handling. ive seen setups fail totally when the embedding service crashed — like, lost sync completely. a dead letter queue and checkpoining are lifesavers for those moments.

I built something similar recently. Managing the embedding pipeline is just as important as the sync itself. DynamoDB Streams with Lambda works great, but throw an SQS queue between your Lambda trigger and embedding generation. This stops bottlenecks when your DynamoDB table gets hit with burst updates. Embedding generation eats resources, so use a dedicated Lambda or ECS task for it. Version your embeddings in PostgreSQL - you’ll thank me when you update your embedding model and need to know which vectors to regenerate. Watch out for DynamoDB schema changes that affect which fields get embedded. I used a config-driven approach for field selection and it saved tons of refactoring later.

In my experience, using DynamoDB Streams along with AWS Lambda has proven to be a more efficient approach than EventBridge for synchronizing data. This setup allows for near real-time updates, ensuring that changes are captured in the correct order, which is critical when managing embeddings. When a change occurs in DynamoDB, the Lambda function can be set up to generate new embeddings through a dedicated service before storing them in Aurora PostgreSQL. Additionally, consider batching embedding requests to reduce latency and costs. Be mindful of potential timeouts during embedding generation, and implement strategies for handling partial failures. Lastly, if you are using larger models, keeping some Lambda instances warm with scheduled rules can help mitigate cold start issues, leading to quicker processing times — typically around 2-3 seconds in production.