I’m just getting started with Microsoft’s Semantic Kernel framework and want to create a retrieval-augmented generation system that works with documents stored locally on my machine.
I’ve seen examples that demonstrate how to set this up using OpenAI’s services with the KernelMemoryBuilder class and the WithOpenAIDefaults configuration method, but I specifically need to work with files that are stored offline rather than using cloud-based solutions.
I’ve looked at several documentation examples and code repositories that show RAG implementations, but most of them seem to rely on OpenAI’s API services. What I’m looking for is a way to process and query my local documents without making external API calls.
Can someone point me in the right direction for implementing this kind of offline RAG system? Any code examples or guidance would be really helpful since I’m still learning the framework.
Everyone’s overcomplicating this. I wasted weeks on custom ITextEmbeddingGeneration implementations and Python interop headaches before realizing there’s a simpler way.
Just automate the whole pipeline. Set up a workflow that watches your document folders, chunks files automatically, generates embeddings with your local model, and dumps everything into your vector database. For queries, another flow handles retrieval and feeds results to your LLM.
I built this for our knowledge base. The automation handles document updates, re-indexing when files change, and switching embedding models when I want to test new approaches. No manual interface coding or dimension mismatch debugging.
You can wire together Ollama, SQLite, ONNX models - whatever you’re using - without writing custom connectors. Define your flows and you’re done.
Saved me 40+ hours of debugging and maintenance. Check it out: https://latenode.com
I’ve been running offline RAG with Semantic Kernel for six months. Game changer was Microsoft’s ONNX runtime with quantized models - crazy fast on CPU, no GPU needed. For docs, I built a custom memory store using SQLite with FTS5 extension. Works great for smaller collections. Biggest pain was matching embedding dimensions when I ditched OpenAI’s text-embedding-ada-002 for local models like BGE or E5. You’ll need your own IMemoryStore implementation - skip the WithOpenAIDefaults method completely. Check the Semantic Kernel docs on custom memory connectors, saved me tons of debugging time. Sure, it’s slower than cloud solutions, but when you can’t let sensitive docs leave your network, it’s solid.
I just dealt with this same issue building an offline RAG system with Semantic Kernel. Here’s what worked for me: use Hugging Face Sentence Transformers through Python interop for embeddings, then pair it with Chroma for your vector database (or just SQLite if you want something simpler). You’ll need to replace OpenAI’s embedding service with your local setup - implement the ITextEmbeddingGeneration interface to wrap your local model. For the LLM, I went with Ollama running Mistral or Llama locally. The document processing stays pretty much the same, but you’ll handle chunking and embedding storage yourself. Microsoft’s custom connector docs were super helpful for wiring everything together without cloud dependencies.
you might want to check out local embedding models like all-MiniLM-L6-v2 using onnx, or run llama locally with ollama. there’s also offline examples in the semantic kernel samples repo that dont rely on cloud APis. hope this helps!