Endless execution when using LangChain DocumentQA on MacBook M3 processor

I have a MacBook M3 Pro with 18GB memory and I’m building a retrieval-augmented generation setup using LangChain with the Llama-3.2-3B-Instruct model. My vector database is Milvus and I’m working in Jupyter notebook.

The Problem: When I call DocumentQA.from_chain_type, my notebook cell keeps running forever. I waited about 15 minutes but it never completes.

from langchain.chains import DocumentQA

qa_system = DocumentQA.from_chain_type(
    llm=language_model,
    retriever=document_retriever,
    return_source_documents=True,
    chain_type_kwargs={"prompt": custom_prompt}
)
result = qa_system.invoke({"query": user_question})

Here’s my custom LLM wrapper:

from langchain.llms.base import LLM
from typing import List, Dict
from pydantic import PrivateAttr

class CustomHFLLM(LLM):
    _model_pipeline: any = PrivateAttr()

    def __init__(self, pipeline):
        super().__init__()
        self._model_pipeline = pipeline

    def _call(self, prompt: str, stop: List[str] = None) -> str:
        output = self._model_pipeline(prompt, num_return_sequences=1)
        return output[0]["generated_text"]

    @property
    def _identifying_params(self):
        return {"name": "CustomHFLLM"}

    @property
    def _llm_type(self):
        return "custom"

language_model = CustomHFLLM(pipeline=text_pipeline)

My pipeline setup:

from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline

model_id = "meta-llama/Llama-3.2-3B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_id, use_auth_token=auth_token)
model = AutoModelForCausalLM.from_pretrained(model_id, use_auth_token=auth_token)

text_pipeline = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    device=0,
    max_new_tokens=256,
    temperature=0.7,
    top_p=0.9,
    truncation=True
)

Prompt configuration:

template_string = """
You are an AI assistant. Answer the question based on the provided context.
If the context doesn't contain enough information, say you don't know.

Context:
{context}

Question:
{question}

Response:
"""

custom_prompt = PromptTemplate(
    input_variables=["context", "question"],
    template=template_string
)

Custom retriever class:

class CustomMilvusRetriever(BaseRetriever, BaseModel):
    milvus_collection: any
    embed_func: Callable[[str], np.ndarray]
    content_field: str
    vector_field: str
    result_count: int = 5

    def get_relevant_documents(self, query: str) -> List[Dict]:
        query_vector = self.embed_func(query)
        
        search_config = {"metric_type": "IP", "params": {"nprobe": 10}}
        search_results = self.milvus_collection.search(
            data=[query_vector],
            anns_field=self.vector_field,
            param=search_config,
            limit=self.result_count,
            output_fields=[self.content_field]
        )
        
        docs = []
        for match in search_results[0]:
            docs.append(
                Document(
                    page_content=match.entity.get(self.content_field),
                    metadata={"score": match.distance}
                )
            )
        return docs
    
    async def aget_relevant_documents(self, query: str) -> List[Dict]:
        return self.get_relevant_documents(query)

document_retriever = CustomMilvusRetriever(
    milvus_collection=my_collection,
    embed_func=embedding_model.embed_query,
    content_field="text",
    vector_field="embedding",
    result_count=5
)

I confirmed MPS acceleration works:

import torch
if torch.backends.mps.is_available():
    print("MPS acceleration available")

Update: Adding verbose mode shows the chain enters successfully and formats the prompt correctly with retrieved context, but then hangs at the LLM generation step.

Any ideas what could cause this infinite loop?

This looks like a memory allocation deadlock on M3 chips when running large model inference. I’ve hit the same thing with Llama models - the pipeline just hangs forever during generation. It’s how the transformers pipeline handles memory with MPS that’s causing it. Add torch_dtype=torch.float16 when loading your model and set low_cpu_mem_usage=True in the AutoModelForCausalLM call. This’ll cut memory pressure way down. Also check your tokenizer padding setup. If it’s missing, add tokenizer.pad_token = tokenizer.eos_token and include padding=True in your pipeline. Without proper padding tokens, generation can loop forever. One more thing - your custom prompt might be creating circular generation. Strip that “Response:” suffix from your template and let the model generate naturally. Instruction-tuned models sometimes get confused with explicit response indicators and get stuck in infinite loops trying to complete the format.

Memory pressure is part of it, but there’s more going on. Your M3 is running out of unified memory during generation because you’re loading a 3B parameter model without quantization.

I hit this same issue when prototyping local inference. Fixed it by switching to 4-bit quantization with BitsAndBytesConfig:

from transformers import BitsAndBytesConfig

quant_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16
)

model = AutoModelForCausalLM.from_pretrained(
    model_id, 
    quantization_config=quant_config,
    use_auth_token=auth_token
)

This cuts memory usage by 75% and stops those hanging generation calls. Your 18GB should handle it fine.

Check out this video on running large models with limited GPU memory:

One more tip - add a timeout to your pipeline call so you catch hangs early instead of waiting 15 minutes. Set max_time=30 in your pipeline config.

check ur tokenizer - missing pad_token crashes llama models. add tokenizer.pad_token_id = tokenizer.eos_token_id before creating the pipeline. also drop the top_p=0.9 parameter, it hangs generation on m3 chips with mps.

Been there. M3 chips have memory allocation quirks that make LangChain hang during heavy inference loads.

The real problem? You’re running everything locally, which eats your RAM and processing power. Your custom LLM wrapper works fine, but the pipeline setup is choking your system.

Ditch the device mappings and memory management headaches. Offload this workflow to the cloud. I built a similar RAG system last month and moved it to automated cloud execution.

Set up your retrieval logic to trigger cloud-based LLM calls instead of local inference. Keep your Milvus search local but let the heavy lifting happen remotely. No more memory bottlenecks or hanging processes.

You can automate the entire flow - document retrieval, prompt formatting, LLM calls, and response handling. Takes about 10 minutes to set up and runs perfectly every time.

This scales way better than local execution and you won’t hit those M3 compatibility walls.

set device=-1 instead of device=0 in ur pipeline - mps gets weird with device mapping. also throw in do_sample=False to make it deterministic, often fixes hanging on m-series chips.

This is probably an MPS backend issue with the transformers pipeline on M3 chips. I’ve seen the same hanging with MPS devices that aren’t properly set up for text generation. First, try switching to CPU to isolate the problem - set your pipeline to device='cpu' and see if it runs. If CPU works, you know it’s MPS-related. Your custom LLM wrapper might also be causing issues. Llama models usually include the original prompt in their output, but your _call method just returns the first generated sequence without stripping the input prompt. This creates malformed responses that mess up LangChain’s processing. Fix your _call method to extract only the new content after the original prompt. Also throw in some error handling and logging so you can see exactly where it’s hanging.