HuggingFace model returns input prompt along with response

DancingBird · August 23, 2025, 7:49am

I’m trying to build a ReAct Agent using langchain but running into problems. When I test the language model directly, I notice something weird happening. The model output includes both my original question and the actual response together.

Here’s what I’m seeing:

llm_model = HuggingFaceHub(repo_id="mistralai/Mixtral-8x7B-Instruct-v0.1")
result = llm_model.invoke("hi there, what's up?")
print(result)

The output I get is: “hi there, what’s up?\nHello! I’m doing well, thanks for asking.”

This behavior is causing issues with my agent setup since it expects clean responses. Has anyone encountered this before? Is there a way to configure the model to only return the generated text without echoing the input prompt?

RunningTiger · August 29, 2025, 6:00pm

Had this exact problem with my first HuggingFace agent. The inference API handles text generation differently than instruction following. Most instruction-tuned models need proper prompt formatting with instruction markers.

Try wrapping your prompt in Mixtral’s instruction format:

llm_model = HuggingFaceHub(repo_id="mistralai/Mixtral-8x7B-Instruct-v0.1")
formatted_prompt = "[INST] hi there, what's up? [/INST]"
result = llm_model.invoke(formatted_prompt)
print(result)

This shows the model where instructions end and response generation begins. Way cleaner than post-processing output and avoids weird edge cases where your prompt text shows up in responses. I’ve used this in production for months without issues.

evelynh · August 29, 2025, 1:46am

Been there, super annoying issue. This happens because the Mixtral model on HuggingFace Hub doesn’t automatically strip the input prompt from the output.

Quick fix - just remove the prompt yourself:

llm_model = HuggingFaceHub(repo_id="mistralai/Mixtral-8x7B-Instruct-v0.1")
prompt = "hi there, what's up?"
result = llm_model.invoke(prompt)
clean_response = result.replace(prompt, "").strip()
print(clean_response)

For your ReAct agent, wrap this in a custom function or create a simple wrapper class that handles cleanup automatically. I’ve done this for similar setups and it works fine.

Alternatively, try adding return_full_text=False in the model kwargs, though not all HuggingFace models respect this parameter.

mythicMuse · August 27, 2025, 5:44pm

Skip the manual prompt stripping and custom wrappers - automation’s the way to go.

I’ve hit this same wall on multiple projects. You get stuck with whatever weird format HuggingFace throws at you and end up writing hacky string replacements.

Now I route everything through Latenode. Set up a workflow that handles prompt cleaning, response formatting, and retry logic when models get cranky. You get real logging and can swap models without touching your code.

Best part? Define your cleaning rules once in the workflow and every call gets processed the same way. No more guessing if your string replacement caught everything or worrying about edge cases breaking your parser.

For ReAct agents this is perfect - you can add validation steps to verify response format before it hits your app.

sofiag · August 27, 2025, 12:40pm

This happens because the HuggingFace Hub inference API treats your input as completion, not chat. The model just continues from where your prompt ended.

I fixed this by ditching HuggingFaceHub and using the transformers library directly. Load the model locally with proper chat templates:

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("mistralai/Mixtral-8x7B-Instruct-v0.1")
model = AutoModelForCausalLM.from_pretrained("mistralai/Mixtral-8x7B-Instruct-v0.1")

messages = [{"role": "user", "content": "hi there, what's up?"}]
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt")
outputs = model.generate(inputs, max_new_tokens=100)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)

You’ll get clean responses without prompt echoing. Uses more memory but works reliably for agent apps where you need consistent output formatting.