I’m building a retrieval system using LlamaIndex and want to use the SubDocSummaryPack for better document chunking compared to basic text splitting. The problem is I can’t figure out how to save the generated embeddings to my local ChromaDB instance so I don’t have to recreate them every time I run my code.
from llama_index.core import SimpleDirectoryReader
# Loading a large PDF document (120 pages)
files = SimpleDirectoryReader("documents").load_data()
import os
from llama_index.packs.subdoc_summary import SubDocSummaryPack
from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding
os.environ["OPENAI_API_KEY"] = "your_api_key_here"
# Processing documents with SubDocSummaryPack
summary_pack = SubDocSummaryPack(
files,
parent_chunk_size=4096,
child_chunk_size=256,
llm=OpenAI(model="gpt-3.5-turbo"),
embed_model=OpenAIEmbedding(),
)
# Setting up ChromaDB for persistence
import chromadb
from llama_index.core import VectorStoreIndex
from llama_index.vector_stores.chroma import ChromaVectorStore
from llama_index.core import StorageContext
# Initialize database
client = chromadb.PersistentClient(path="./db")
# Get or create collection
my_collection = client.get_or_create_collection("my_docs")
# Setup vector store
vec_store = ChromaVectorStore(chroma_collection=my_collection)
store_context = StorageContext.from_defaults(vector_store=vec_store)
# STUCK HERE - need to persist the embeddings from summary_pack
How can I save the embeddings created by SubDocSummaryPack to my ChromaDB collection? I want to avoid re-embedding the same documents when nothing has changed.
SubDocSummaryPack creates its own internal index that you’ve got to extract and persist separately. Hit this same issue last year working with hierarchical document processing.
You need to grab the index from the pack and rebuild it with your ChromaDB vector store:
# After creating your summary_pack
summary_pack = SubDocSummaryPack(
files,
parent_chunk_size=4096,
child_chunk_size=256,
llm=OpenAI(model="gpt-3.5-turbo"),
embed_model=OpenAIEmbedding(),
)
# Extract the nodes from the pack's index
nodes = summary_pack.index.docstore.docs.values()
# Create new index with your ChromaDB storage
persistent_index = VectorStoreIndex(
nodes=list(nodes),
storage_context=store_context,
embed_model=OpenAIEmbedding()
)
# Now your embeddings are saved to ChromaDB
For later runs, just check if your collection has documents:
if my_collection.count() > 0:
# Load existing index
persistent_index = VectorStoreIndex.from_vector_store(
vec_store,
embed_model=OpenAIEmbedding()
)
else:
# Process with SubDocSummaryPack and save as above
Saved me hours of reprocessing time with large document collections. The hierarchical chunks from SubDocSummaryPack work great in ChromaDB.
I’ve been wrestling with chunky document workflows too, and manually pulling nodes from SubDocSummaryPack is a nightmare. You still end up writing all that persistence code anyway.
Instead, I build embedding pipelines that automate everything upfront. Set up monitoring on your document folder - it processes new/changed files with whatever chunking you want and dumps everything straight into ChromaDB.
The best part? You can add conditional logic. Hash checks, timestamp comparisons, only re-embed what’s actually changed. Different processing paths too - SubDocSummaryPack for PDFs, simple chunks for text files.
Built one last month for our doc system. Monitors files, generates embeddings, handles database stuff, even cleans up old embeddings when docs update. All automatic.
No more boilerplate for node extraction or ChromaDB connections. Define your logic once and you’re done.
SubDocSummaryPack doesn’t handle persistence during setup, but there’s a workaround. Just grab the generated nodes after processing. I hit this same issue with technical docs that took forever to re-embed. Extract the nodes from the pack’s internal structure, then create a new VectorStoreIndex with your ChromaDB context. The pack stores all processed chunks (parent and child) in its docstore - access them through summary_pack.index.docstore.docs.values(). I set up a simple file hash check to see if documents changed since last processing. Only re-run SubDocSummaryPack when source files actually change, otherwise load from ChromaDB. The hierarchical chunking from SubDocSummaryPack works great in ChromaDB and keeps those parent-child relationships intact for better retrieval. One gotcha - use the same embedding model when loading the existing index as you did during creation, or you’ll get dimension mismatches.