Troubleshooting RAG issues with complex philosophical texts

Hello everyone! I’m facing some challenges with my RAG configuration while trying to work through a difficult philosophy book, Hegel’s Science of Logic, which is extremely dense and abstract in nature.

I’m asking questions about key concepts in the book but am receiving incorrect answers. For instance, when I inquire about whether mass is classified as extensive or intensive magnitude, my system without RAG provides entirely wrong information. While RAG improves the accuracy, it still struggles with some questions.

Here’s a glimpse of my setup:

from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_community.document_loaders import PyPDFLoader  
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_core.vectorstores import InMemoryVectorStore
from langgraph.graph import START, StateGraph
from typing_extensions import List, TypedDict
from langchain_core.documents import Document
import gradio as gr

# Setup API keys
os.environ["OPENAI_API_KEY"] = "..."

# Initialize components
model = ChatOpenAI(model="gpt-4o-mini-2024-07-18", temperature=0)
embedding_model = OpenAIEmbeddings(model="text-embedding-3-large")
vector_db = InMemoryVectorStore(embedding_model)

# Load and process document
doc_loader = PyPDFLoader("philosophy_book.pdf")
raw_docs = doc_loader.load()
splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
split_docs = splitter.split_documents(raw_docs)
vector_db.add_documents(documents=split_docs)

# Define workflow state
class WorkflowState(TypedDict):
    query: str
    retrieved_context: List[Document]
    final_response: str

# Retrieval function
def fetch_relevant_docs(state: WorkflowState):
    matching_docs = vector_db.similarity_search(state["query"])
    return {"retrieved_context": matching_docs}

# Generation function  
def create_answer(state: WorkflowState):
    combined_context = "\n\n".join(doc.page_content for doc in state["retrieved_context"])
    prompt_messages = system_prompt.invoke({"query": state["query"], "context": combined_context})
    llm_response = model.invoke(prompt_messages)
    return {"final_response": llm_response.content}

# Build workflow
workflow_builder = StateGraph(WorkflowState).add_sequence([fetch_relevant_docs, create_answer])
workflow_builder.add_edge(START, "fetch_relevant_docs")
final_workflow = workflow_builder.compile()

# Chat interface
def chat_handler(user_message, chat_history):
    result = final_workflow.invoke({"query": user_message})
    return result["final_response"]

gr.ChatInterface(
    chat_handler,
    title="Philosophy Text Assistant"
).launch()

I find that while the system answers some questions correctly, it fails on others. I’ve observed that the PDF has strange formatting where italicized text appears with spaced-out characters, which may confuse the model. I’ve also tried increasing the chunk size to 2000 and sorting retrieved documents by page number, but issues persist.

Has anyone else experienced similar difficulties with dense academic texts? I’d appreciate any suggestions for improving text processing or retrieval methods.

The preprocessing nightmare you described is exactly why I ditched manual chunking. Philosophy texts are brutal - they reference concepts from 200 pages back, and traditional RAG can’t handle those relationships.

I hit the same wall with dense economics papers where context was everything. What worked? An automated pipeline that handles the whole flow differently.

Instead of wrestling with PDF formatting and chunk sizes, I built a system that auto-cleans text, creates multiple embedding strategies, and handles cross-references between concepts. The key was automating preprocessing, retrieval logic, and context assembly.

For Hegel, you need something that tracks conceptual relationships across the entire work, not just similar text chunks. Mine automatically identifies when concepts get defined in one section and referenced elsewhere, then pulls both contexts together.

The automation handles your edge cases - italic spacing, logical argument flows, cross-referencing between different parts of dialectical arguments.

You can build this kind of intelligent document processing pipeline pretty easily without coding from scratch: https://latenode.com

Philosophical texts wreck embeddings. Same thing happened with Kant’s Critique - it’s not just formatting, it’s how these authors keep redefining terms throughout the work. I’d try hybrid search (keyword + semantic) since it handles concept drift better. Also swap to Claude with a bigger context window for generation. GPT-4o-mini can’t handle complex philosophical reasoning even when you feed it good context.

Complex philosophical texts break standard RAG because they work completely differently than technical docs. I hit this same wall working through phenomenology papers where concepts kept evolving throughout the text. Your problem isn’t just technical - it’s conceptual. Hegel doesn’t define intensive magnitude once and call it done. He builds the concept across sections, refines it, then connects it to other ideas later. When your retrieval grabs chunks about mass as intensive magnitude, it’s missing the developmental logic that makes his whole argument work. What saved me: retrieval that follows philosophical structure. After I get initial matches, I automatically pull the sections right before and after from the same chapters. Philosophy unfolds step by step - you can’t just grab random passages. For the formatting mess, run your text through basic cleanup before chunking. Replace multiple spaces with singles and fix broken italics. Those PDF artifacts mess up your embeddings and tank retrieval. Also heads up - gpt-4o-mini struggles with dialectical reasoning even with perfect context. These texts need models that can handle complex logical relationships between abstract concepts.

Had the same headache with mathematical philosophy papers. Your chunk size tests were smart, but you’re overcomplicating it.

Philosophical arguments connect across entire chapters. Ask about extensive vs intensive magnitude? You’ll need the definition from chapter 2, an example from chapter 7, and some distinction Hegel drops later on. Your retrieval grabs the closest matches and completely misses these connections.

I fixed this by expanding search after finding initial matches. Get your top results, then pull extra chunks from those same sections plus adjacent pages. Philosophy books aren’t dictionaries - they’re layered arguments.

For the formatting nightmare, clean your text before embedding. Strip extra spaces, fix punctuation, kill those italic artifacts. Five minutes upfront beats hours of debugging garbage results.

Also - grab more chunks. Don’t use the default 4. Pull 8-10 for complex questions. Hegel builds concepts brick by brick and your system needs way more context to connect his definitions.

Dense philosophical texts are brutal for RAG systems. Your Hegel example reminds me of the nightmare I had processing Wittgenstein’s later works. It’s not just a tech problem - these authors don’t deal in discrete facts, they build arguments slowly over pages.

What worked for me was ditching pure similarity search. I switched to multi-stage retrieval: find potentially relevant sections first, then grab surrounding context from those same chapters. You need that argumentative flow for philosophical reasoning to make sense.

For your formatting issues - preprocess those chunks before they hit the LLM. Basic regex to fix spacing and normalize text formatting stops the model from choking on OCR garbage. Worth the extra work when you’re dealing with scanned academic texts.

Also, philosophical concepts need definitional context that’s often buried way earlier in the text. Your 200-character chunk overlap is way too small for this stuff.

I’ve dealt with the same headaches on dense academic texts, especially Spinoza’s Ethics. That spacing issue with italicized text happens all the time with PDF extractions from academic books. I fixed it by running a preprocessing step to clean up the text before chunking - just strip out excessive whitespace between characters and standardize the formatting. For philosophical stuff like Hegel, semantic chunking beats fixed character splitting every time. These texts have logical argument structures that get destroyed when you split randomly. Try sentence-based chunking or paragraph-level splits since philosophical arguments usually span multiple sentences but stay coherent within paragraphs. I also changed up my retrieval strategy. Instead of straight similarity search, try MMR (Maximum Marginal Relevance) to grab more diverse chunks. Hegel’s concepts build on each other across different sections, so you need varied context instead of just the most similar passages.