Streaming LLM Responses: Stop Making Users Wait

Nothing kills UX faster then a loading spinner that spins for 10 seconds. LLM responses take time, but users dont need to wait for the full response before seeing something.

Stream it.

The Difference

Users see text appearing in ~100ms instead of staring at nothing for 8 seconds. Same total time, way better experience.

Basic Implementation

// Server: stream from LLM
async function* generateStream(prompt: string) {
  const response = await openai.chat.completions.create({
    model: "gpt-4",
    messages: [{ role: "user", content: prompt }],
    stream: true
  });

  for await (const chunk of response) {
    const content = chunk.choices[0]?.delta?.content;
    if (content) yield content;
  }
}

// API route (Next.js App Router)
export async function POST(req: Request) {
  const { prompt } = await req.json();

  const stream = new ReadableStream({
    async start(controller) {
      for await (const chunk of generateStream(prompt)) {
        controller.enqueue(new TextEncoder().encode(chunk));
      }
      controller.close();
    }
  });

  return new Response(stream, {
    headers: { "Content-Type": "text/plain; charset=utf-8" }
  });
}

Client Side

async function streamResponse(prompt: string) {
  const response = await fetch("/api/chat", {
    method: "POST",
    body: JSON.stringify({ prompt })
  });

  const reader = response.body?.getReader();
  const decoder = new TextDecoder();

  let fullText = "";

  while (reader) {
    const { done, value } = await reader.read();
    if (done) break;

    const chunk = decoder.decode(value);
    fullText += chunk;

    // Update UI immediately
    setMessage(fullText);
  }
}

Even Simpler: Vercel AI SDK

If you use Next.js, the AI SDK handles all of this:

// app/api/chat/route.ts
import { streamText } from "ai";
import { openai } from "@ai-sdk/openai";

export async function POST(req: Request) {
  const { messages } = await req.json();

  const result = streamText({
    model: openai("gpt-4"),
    messages
  });

  return result.toDataStreamResponse();
}

// Client
import { useChat } from "ai/react";

function Chat() {
  const { messages, input, handleSubmit, handleInputChange } = useChat();

  return (
    <div>
      {messages.map(m => <div key={m.id}>{m.content}</div>)}
      <form onSubmit={handleSubmit}>
        <input value={input} onChange={handleInputChange} />
      </form>
    </div>
  );
}

Thats it. Streaming just works.

Gotchas

1. Cant stream JSON easily

Streaming returns chunks of text. If you need structured data, either:

Stream the response, then parse at the end
Use partial JSON parsing (experimental)
Dont stream for that endpoint

2. Error handling is different

Errors can happen mid-stream. Handle them:

try {
  for await (const chunk of response) {
    // ...
  }
} catch (err) {
  // Connection dropped, token limit, etc.
  setError("Response interrupted");
}

3. Tool calls complicate things

If your agent uses tools, streaming gets trickier. The AI SDK handles this, but rolling your own requires buffering tool call chunks.

When to Stream

Chat interfaces: Always stream
Long-form content: Always stream
API returns JSON: Dont stream
Very short responses: Doesnt matter much

Streaming makes slow things feel fast. If your LLM calls take more than a second or two, stream them. Your users will thank you.