Nothing kills UX faster then a loading spinner that spins for 10 seconds. LLM responses take time, but users dont need to wait for the full response before seeing something.
Stream it.
The Difference
Users see text appearing in ~100ms instead of staring at nothing for 8 seconds. Same total time, way better experience.
Basic Implementation
// Server: stream from LLM
async function* generateStream(prompt: string) {
const response = await openai.chat.completions.create({
model: "gpt-4",
messages: [{ role: "user", content: prompt }],
stream: true
});
for await (const chunk of response) {
const content = chunk.choices[0]?.delta?.content;
if (content) yield content;
}
}
// API route (Next.js App Router)
export async function POST(req: Request) {
const { prompt } = await req.json();
const stream = new ReadableStream({
async start(controller) {
for await (const chunk of generateStream(prompt)) {
controller.enqueue(new TextEncoder().encode(chunk));
}
controller.close();
}
});
return new Response(stream, {
headers: { "Content-Type": "text/plain; charset=utf-8" }
});
}
Client Side
async function streamResponse(prompt: string) {
const response = await fetch("/api/chat", {
method: "POST",
body: JSON.stringify({ prompt })
});
const reader = response.body?.getReader();
const decoder = new TextDecoder();
let fullText = "";
while (reader) {
const { done, value } = await reader.read();
if (done) break;
const chunk = decoder.decode(value);
fullText += chunk;
// Update UI immediately
setMessage(fullText);
}
}
Even Simpler: Vercel AI SDK
If you use Next.js, the AI SDK handles all of this:
// app/api/chat/route.ts
import { streamText } from "ai";
import { openai } from "@ai-sdk/openai";
export async function POST(req: Request) {
const { messages } = await req.json();
const result = streamText({
model: openai("gpt-4"),
messages
});
return result.toDataStreamResponse();
}
// Client
import { useChat } from "ai/react";
function Chat() {
const { messages, input, handleSubmit, handleInputChange } = useChat();
return (
<div>
{messages.map(m => <div key={m.id}>{m.content}</div>)}
<form onSubmit={handleSubmit}>
<input value={input} onChange={handleInputChange} />
</form>
</div>
);
}
Thats it. Streaming just works.
Gotchas
1. Cant stream JSON easily
Streaming returns chunks of text. If you need structured data, either:
- Stream the response, then parse at the end
- Use partial JSON parsing (experimental)
- Dont stream for that endpoint
2. Error handling is different
Errors can happen mid-stream. Handle them:
try {
for await (const chunk of response) {
// ...
}
} catch (err) {
// Connection dropped, token limit, etc.
setError("Response interrupted");
}
3. Tool calls complicate things
If your agent uses tools, streaming gets trickier. The AI SDK handles this, but rolling your own requires buffering tool call chunks.
When to Stream
- Chat interfaces: Always stream
- Long-form content: Always stream
- API returns JSON: Dont stream
- Very short responses: Doesnt matter much
Streaming makes slow things feel fast. If your LLM calls take more than a second or two, stream them. Your users will thank you.
