I’m currently utilizing the meta-llama/Llama-2-7b-chat-hf model, but I keep receiving responses that resemble lengthy conversations instead of concise answers. This is my approach:
However, rather than receiving a brief reply, I end up with a lengthy dialogue that keeps going with multiple voices. I’ve attempted to include parameters like max_length but to no avail. Since I aspire to create a chat-like interface, it’s crucial for the answers to be shorter and more manageable. Any suggestions on how to resolve this issue?
Temperature’s your main issue here. 0.5 still lets the model get creative - drop it to 0.1 or 0.2 for tighter responses. You’re also missing system prompting entirely. The model has no clue you want brief answers. Add a system message like this: inputs: “System: You are a helpful assistant that gives brief, direct answers.\n\nUser: What is your favorite color?” I also throw “Answer in one sentence:” at the start of prompts - works like a charm for forcing concise responses. The inference API’s wonky with chat models anyway since they expect conversation context that isn’t there.
Set repetition_penalty to 1.2-1.3 - it’ll stop those endless rambling responses. Also heads up: HuggingFace’s inference API sometimes ignores stop tokens, so you might need to manually trim the output in your JS code after you get the response.
Been wrestling with this for months too. Llama-2-chat doesn’t know when to shut up - it just keeps going way past what you actually want. Add eos_token_id and pad_token_id to your parameters so it’ll actually respect its own stop tokens. You might also want to try a different tokenization approach where you set clear conversation boundaries. What fixed it for me was preprocessing the input to tell the model upfront how long responses should be. The thing treats everything like one giant conversation unless you force it to stop with proper token management.