AI model generating excessively long dialogue responses

I’m using the google/flan-t5-large model from HuggingFace and encountering problems with the length of its outputs. It consistently returns lengthy conversational replies instead of brief and straightforward responses.

Here’s how I’ve implemented it:

// main.js
const AUTH_TOKEN = "your_token_here"

async function callModel() {
    const result = await fetch("https://api-inference.huggingface.co/models/google/flan-t5-large", {
        method: "POST",
        headers: {
            Authorization: `Bearer ${AUTH_TOKEN}`,
            "Content-Type": "application/json",
        },
        body: JSON.stringify({
            inputs: "How are you feeling today?",
        })
    });

    const output = await result.json();
    console.log(output[0].generated_text.trim());
}

callModel();

And here’s my HTML setup:

<!DOCTYPE html>
<html>
    <head>
        <meta charset="UTF-8">
        <title>Test Page</title>
    </head>
    <body>
        <script src="main.js"></script>
    </body>
</html>

I’ve also tried adjusting various parameters:

parameters: {
    temperature: 0.5,
    top_p: 0.8,
    return_full_text: false,
    max_length: 50
}

Regardless of the adjustments and different models, I’m still receiving long conversation-like outputs rather than succinct responses. I’m aiming to create a chat interface akin to existing AI assistants, so I really need the responses to be more focused and relevant. Has anyone else faced similar challenges with controlling output length?

Use max_new_tokens instead of max_length - it’s way more reliable for controlling output size. Also, flan-t5 works better when you give it specific instructions like ‘respond in one sentence:’ or ‘answer briefly:’ right at the beginning. Fixed the same problem for me.

Yeah, this is super common with FLAN-T5 models - they’re built for detailed instruction following, not quick chat responses. I’ve had better luck mixing temperature tweaks with prompt tricks instead of just adjusting parameters. Drop your temperature to around 0.2 and throw constraining phrases right into your prompt like ‘Answer in 5 words or less:’ before your actual question. What also worked for me was adding a post-processing step that cuts responses off at the first complete sentence or after X characters. The thing is, these models were trained on tasks that need detailed explanations, so you’re basically fighting the core design when you want short answers.

Had this exact problem with T5 models last month. Your parameter settings aren’t the issue - FLAN-T5-large just wasn’t built for chat. It’s trained for instruction following and detailed explanations, so it naturally wants to be verbose even when you set max_length constraints. I switched to models actually designed for conversation like microsoft/DialoGPT-medium or facebook/blenderbot-400M-distill. They give you shorter, more natural chat responses right out of the box. If you’re stuck with FLAN-T5, make your prompts more direct like ‘Give a brief answer: How are you feeling today?’ It responds way better when you explicitly tell it how to answer.