Assistance required for developing a custom RAG solution utilizing the Mistral 7B model

alexm · August 8, 2025, 11:48am

I’m relatively new to Langchain and am seeking some assistance.

Overview of My Project

I’m creating an AI assistant for a movie theater application referred to as CinemaHub. This theater features two screening rooms (Room X and Room Y) where films are shown twice each day: once in the afternoon and once in the evening, across specific date ranges.

The assistant’s role is to provide users with details about movies, including pricing, showtimes, seat availability, and age restrictions. I aim for it to eventually manage bookings and cancellations, but at this moment, the primary retrieval functionality isn’t operating correctly.

Current Implementation

At present, I’m utilizing two distinct pipelines that are connected together:

query_pipeline = pipeline(
    model=mistral_llm,
    tokenizer=tok,
    task="text-generation",
    temperature=0.0,
    do_sample=False,
    repetition_penalty=1.1,
    return_full_text=False,
    max_new_tokens=800,
)
query_llm = HuggingFacePipeline(pipeline=query_pipeline)

answer_pipeline = pipeline(
    model=mistral_llm,
    tokenizer=tok,
    task="text-generation",
    do_sample=True,
    repetition_penalty=1.1,
    return_full_text=False,
    max_new_tokens=4000,
)
answer_llm = HuggingFacePipeline(pipeline=answer_pipeline)

Following this, I have established my chain:

DOC_TEMPLATE = PromptTemplate.from_template(template="{page_content}")

def merge_docs(documents, doc_template=DOC_TEMPLATE, separator="\n\n"):
    formatted_docs = [format_document(doc, doc_template) for doc in documents]
    return separator.join(formatted_docs)

chat_memory = ConversationBufferMemory(
    return_messages=True, output_key="response", input_key="query"
)

memory_loaded = RunnablePassthrough.assign(
    history=RunnableLambda(chat_memory.load_memory_variables) | itemgetter("history"),
)

rephrased_query = {
    "rephrased_query": {
        "query": lambda x: x["query"],
        "history": lambda x: get_buffer_string(x["history"]),
    }
    | REPHRASE_PROMPT
    | query_llm,
}

fetched_docs = {
    "documents": itemgetter("rephrased_query") | doc_retriever,
    "query": lambda x: x["rephrased_query"],
}

final_prompt_inputs = {
    "context": lambda x: merge_docs(x["documents"]),
    "query": itemgetter("query"),
}

final_response = {
    "response": final_prompt_inputs | RESPONSE_PROMPT | answer_llm,
    "query": itemgetter("query"),
    "context": final_prompt_inputs["context"]
}

full_chain = memory_loaded | rephrased_query | fetched_docs | final_response

My Data Format

I have generated a JSON file that includes 21 movies structured as follows:

[
  {
    "title": "Avatar: The Way of Water",
    "tagline": "Return to the world of Pandora",
    "synopsis": "Set more than a decade after the events of the first film, Avatar: The Way of Water begins to tell the story of the Sully family...",
    "actors": ["Sam Worthington", "Zoe Saldana", "Sigourney Weaver"],
    "category": "Sci-Fi",
    "duration": "3h 12min",
    "showDates": {
      "from": "2024-07-01",
      "to": "2024-07-15"
    },
    "matinee": {
      "startTime": "15:30",
      "cost": 12,
      "discounted_cost": 8
    },
    "evening": {
      "startTime": "20:30",
      "cost": 18,
      "discounted_cost": 12
    },
    "standardSeats": {
      "open": 180,
      "booked": 95
    },
    "accessibleSeats": {
      "open": 12,
      "booked": 3
    },
    "allSeats": {
      "open": 192,
      "booked": 98
    },
    "theater": "Room X",
    "rating": "PG-13"
  }
]

Key Challenges

It seems the retriever isn’t retrieving the most pertinent documents.
Occasionally, Mistral appears to be disoriented by lengthy prompts.
I am unsure whether to use similarity search, MMR, or another method.

I transform each film into a textual summary and employ RecursiveCharacterTextSplitter with a chunk size of 1200. The embeddings used are from ChromaDB with BAAI/bge-large-en-v1.5.

Can anyone provide insights into what might be going wrong? Any advice would be greatly appreciated!

evelynh · August 19, 2025, 2:48am

Had the same issue with a booking system last year. Your problem’s in the document structure, not chunking size.

JSON objects suck when you convert them to text summaries. Don’t create one text block per movie - break each movie into separate docs for different query types.

Split into individual chunks:

Movie details (title, synopsis, actors, rating)
Scheduling (dates, times, theater room)
Pricing (costs, discounts)
Availability (seat counts)

When someone asks “What’s playing tonight in Room X?”, the retriever finds scheduling chunks instead of full movie descriptions.

For the Mistral confusion - your REPHRASE_PROMPT is too aggressive. Most queries about showtimes and pricing are already clear. Skip rephrasing for simple questions, only use it for complex follow-ups.

Try MMR with k=6 and fetch_k=20. Similarity search gives duplicate movie info, but MMR grabs different aspects (pricing vs showtimes vs availability) for the same query.

One more thing - you’ve got query pipeline at temperature 0.0 but answer pipeline is sampling. Pick one: make both deterministic or both creative, not mixed.

neonNautilus · August 17, 2025, 11:17pm

Your setup’s solid but way too complex. Running two Mistral pipelines locally is killing your resources and causing inconsistency problems.

I hit the same wall with a restaurant booking system. The chunk size and retrieval method aren’t your real problems - it’s all that manual pipeline management.

I moved everything to Latenode and it fixed this mess. Connect ChromaDB directly, automate document processing, and handle query rephrasing without juggling two LLM instances.

Latenode handles prompt optimization automatically. No more Mistral choking on long prompts - it manages context length smartly. You can A/B test similarity vs MMR retrieval without rewriting anything.

Just push your movie theater JSON straight into Latenode’s processor. It’ll chunk everything and create separate retrieval paths for movie info and booking details automatically.

Your current chain becomes a simple workflow with built-in memory management. Done with manual pipeline headaches.

Check it out: https://latenode.com

mikezhang · August 17, 2025, 2:45am

I’ve dealt with similar entertainment RAG systems and hit the same retrieval problems. Your 1200 chunk size is probably too big for JSON movie data. I dropped mine to 300-500 tokens and made sure each chunk had complete info for one movie - relevance got way better. For embeddings, try creating multiple versions of each movie doc. One for plot/actor queries, another for booking stuff like showtimes and prices. This worked great when users asked specific availability questions vs general movie info. The Mistral issue with long prompts - bump your temperature to 0.1-0.3 instead of 0.0 on query rephrasing. Zero makes responses too rigid. Also, your rephrasing might be making simple queries worse. Log what the rephrased queries actually look like and see if they’re helping or hurting retrieval.