I’m building a chatbot using RAG architecture with vector embeddings. My app has two parts: one where admins upload documents and create vectors, and another where users chat with the bot.
Right now I’m having trouble with saving the vector data to AWS S3 and loading it back when needed. The main issue is with converting the vectors to a format that can be stored and then retrieved properly.
I tried using SQLite to store the serialized vectors but running into problems. Should I stick with this approach or look at other options like FAISS? What’s the recommended way to handle vector storage in cloud environments?
Here’s my current approach for database creation:
def setup_vector_database(connection, vector_data):
cursor = connection.cursor()
cursor.execute('''CREATE TABLE IF NOT EXISTS vectors (
idx INTEGER PRIMARY KEY AUTOINCREMENT,
doc_reference TEXT,
vector_data BLOB
)''')
for index, vector in enumerate(vector_data):
cursor.execute("INSERT INTO vectors (doc_reference, vector_data) VALUES (?, ?)",
(f"doc_{index}", pickle.dumps(vector)))
connection.commit()
And here’s how I process the documents:
def process_document_embeddings(s3_bucket, document_path, aws_region='us-east-1'):
try:
pdf_loader = AmazonTextractPDFLoader(f's3://{s3_bucket}/{document_path}', region_name=aws_region)
raw_documents = pdf_loader.load()
splitter = RecursiveCharacterTextSplitter(chunk_size=50000, chunk_overlap=5000)
text_chunks = splitter.split_documents(raw_documents)
embedding_model = OpenAIEmbeddings()
vector_store = Chroma.from_documents(text_chunks, embedding_model, persist_directory="./vectors")
temp_file = tempfile.NamedTemporaryFile(delete=False, suffix='.db')
connection = sqlite3.connect(temp_file.name)
setup_vector_database(connection, text_chunks)
output_key = f"{os.path.splitext(document_path)[0]}.vectors.db"
s3 = boto3.client('s3')
s3.upload_file(temp_file.name, s3_bucket, output_key)
return vector_store
except Exception as error:
print(f"Error processing document: {error}")
Any suggestions for better approaches would be helpful. I’m relatively new to vector databases and cloud storage patterns.