Best Practices for Persisting ChromaDB Vectors to S3 in RAG Applications

Hi everyone! I’m working on a RAG chatbot project using ChromaDB for vector storage and running into some issues with persistence. My setup consists of two parts: an admin section for uploading PDFs and converting them into vectors, and a chat interface for user interaction.

The main challenge I’m facing is properly saving and loading ChromaDB vectors from AWS S3. Currently, I’m trying to serialize the vectors and store them in SQLite, but I encounter difficulties during the deserialization process.

Does anyone have experience with this kind of setup? Should I continue using SQLite for storing the serialized data, or might something like FAISS be more effective for cloud storage? As I’m relatively new to vector databases, any advice would be greatly appreciated.

Here’s my approach for serialization:

def setup_vector_database(connection, doc_list):
    # Initialize table for vector storage
    cursor = connection.cursor()
    cursor.execute('''CREATE TABLE IF NOT EXISTS vector_data (
                        idx INTEGER PRIMARY KEY AUTOINCREMENT,
                        doc_ref TEXT,
                        vector_blob BLOB
                    )''')

    # Store vectors in database
    for index, doc_vector in enumerate(doc_list):
        cursor.execute("INSERT INTO vector_data (doc_ref, vector_blob) VALUES (?, ?)",
                       (f"doc_{index}", pickle.dumps(doc_vector)))

    connection.commit()

And here’s how I’m using it:

def process_pdf_to_vectors(s3_bucket, file_path, aws_region='us-east-1'):
    try:
        print(f"Processing from bucket: {s3_bucket}")
        print(f"File path: {file_path}")
        
        # Extract text using Textract
        pdf_loader = AmazonTextractPDFLoader(f's3://{s3_bucket}/{file_path}', region_name=aws_region)
        extracted_docs = pdf_loader.load()
        
        # Break into smaller chunks
        splitter = RecursiveCharacterTextSplitter(chunk_size=50000, chunk_overlap=5000)
        chunked_docs = splitter.split_documents(extracted_docs)
        
        # Generate embeddings
        embedding_model = OpenAIEmbeddings()
        vector_store = Chroma.from_documents(chunked_docs, embedding_model, persist_directory="./vector_db")
        
        # Setup temporary database
        temp_file = tempfile.NamedTemporaryFile(delete=False, suffix='.db')
        sqlite_conn = sqlite3.connect(temp_file.name)
        
        # Save to database
        setup_vector_database(sqlite_conn, chunked_docs)
        
        # Upload to S3
        output_key = f"{os.path.splitext(file_path)[0]}.vectors.db"
        s3_uploader = boto3.client('s3')
        s3_uploader.upload_file(temp_file.name, s3_bucket, output_key)
        
        return vector_store
        
    except Exception as error:
        print(f"Error processing file: {error}")

Any suggestions on better approaches or what I might be doing wrong?

Your code has a fundamental mismatch that’s causing problems. You’re creating a ChromaDB vector store but then manually serializing document chunks (not vectors) into SQLite.

I hit this exact issue 8 months ago on a similar project. Here’s what worked:

Drop the manual SQLite approach. ChromaDB’s built-in persistence works way better:

# Create the vector store with a persistent directory
vector_store = Chroma.from_documents(
    chunked_docs, 
    embedding_model, 
    persist_directory="./chroma_db"
)

# ChromaDB automatically saves everything
vector_store.persist()

# Upload the entire directory to S3
s3_client = boto3.client('s3')
for root, dirs, files in os.walk('./chroma_db'):
    for file in files:
        local_path = os.path.join(root, file)
        s3_key = f"vector_stores/{file_path.split('.')[0]}/{file}"
        s3_client.upload_file(local_path, s3_bucket, s3_key)

For loading back:

# Download all files from S3 first
s3_objects = s3_client.list_objects_v2(Bucket=bucket, Prefix=f"vector_stores/{doc_id}/")
for obj in s3_objects.get('Contents', []):
    local_file = f"./temp_chroma/{obj['Key'].split('/')[-1]}"
    s3_client.download_file(bucket, obj['Key'], local_file)

# Load the vector store
vector_store = Chroma(persist_directory="./temp_chroma", embedding_function=embedding_model)

This lets ChromaDB handle all vector serialization internally. Much more reliable than trying to pickle everything yourself.

On FAISS vs ChromaDB - I’ve used both extensively. FAISS is faster for large datasets but ChromaDB is easier to work with for most RAG applications. Unless you’re dealing with millions of vectors, stick with Chroma.

you’re overcomplicating this. just use chromadb’s s3 integration directly. i run collection.persist() then sync the persist directory with s3. way simpler than messing with pickle and sqlite. also, you’re storing documents instead of actual vectors in that blob - that’s probably why deserialization breaks.