Hi everyone! I’m working on a RAG chatbot project using ChromaDB for vector storage and running into some issues with persistence. My setup consists of two parts: an admin section for uploading PDFs and converting them into vectors, and a chat interface for user interaction.
The main challenge I’m facing is properly saving and loading ChromaDB vectors from AWS S3. Currently, I’m trying to serialize the vectors and store them in SQLite, but I encounter difficulties during the deserialization process.
Does anyone have experience with this kind of setup? Should I continue using SQLite for storing the serialized data, or might something like FAISS be more effective for cloud storage? As I’m relatively new to vector databases, any advice would be greatly appreciated.
Here’s my approach for serialization:
def setup_vector_database(connection, doc_list):
# Initialize table for vector storage
cursor = connection.cursor()
cursor.execute('''CREATE TABLE IF NOT EXISTS vector_data (
idx INTEGER PRIMARY KEY AUTOINCREMENT,
doc_ref TEXT,
vector_blob BLOB
)''')
# Store vectors in database
for index, doc_vector in enumerate(doc_list):
cursor.execute("INSERT INTO vector_data (doc_ref, vector_blob) VALUES (?, ?)",
(f"doc_{index}", pickle.dumps(doc_vector)))
connection.commit()
And here’s how I’m using it:
def process_pdf_to_vectors(s3_bucket, file_path, aws_region='us-east-1'):
try:
print(f"Processing from bucket: {s3_bucket}")
print(f"File path: {file_path}")
# Extract text using Textract
pdf_loader = AmazonTextractPDFLoader(f's3://{s3_bucket}/{file_path}', region_name=aws_region)
extracted_docs = pdf_loader.load()
# Break into smaller chunks
splitter = RecursiveCharacterTextSplitter(chunk_size=50000, chunk_overlap=5000)
chunked_docs = splitter.split_documents(extracted_docs)
# Generate embeddings
embedding_model = OpenAIEmbeddings()
vector_store = Chroma.from_documents(chunked_docs, embedding_model, persist_directory="./vector_db")
# Setup temporary database
temp_file = tempfile.NamedTemporaryFile(delete=False, suffix='.db')
sqlite_conn = sqlite3.connect(temp_file.name)
# Save to database
setup_vector_database(sqlite_conn, chunked_docs)
# Upload to S3
output_key = f"{os.path.splitext(file_path)[0]}.vectors.db"
s3_uploader = boto3.client('s3')
s3_uploader.upload_file(temp_file.name, s3_bucket, output_key)
return vector_store
except Exception as error:
print(f"Error processing file: {error}")
Any suggestions on better approaches or what I might be doing wrong?