Context windows are the biggest constraint in LLM apps. GPT-4 gives you 128K tokens. Claude gives you 200K. Sounds like alot until you're building something real and suddenly hitting limits everywhere.
Let me show you how to manage context like it actually matters (because it does).
Understanding Token Economics
First, know what you're working with:
A typical breakdown:
const TOKEN_BUDGET = {
system: 500,
context: 2000,
history: 3500,
responseBuffer: 2000 // Always reserve for output
};
// If conversation exceeds 3500, something has to go
Rule of thumb: 1 token ≈ 4 characters in English. But always count precisely for important decisions.
Strategy 1: Sliding Window
Simplest approach - keep most recent messages:
function slidingWindow(messages: Message[], maxTokens: number) {
const system = messages.find(m => m.role === 'system');
const conversation = messages.filter(m => m.role !== 'system');
let tokens = countTokens(system?.content || '');
const keep: Message[] = [];
// Work backwards from newest
for (let i = conversation.length - 1; i >= 0; i--) {
const msg = conversation[i];
const msgTokens = countTokens(msg.content);
if (tokens + msgTokens > maxTokens) break;
tokens += msgTokens;
keep.unshift(msg);
}
return system ? [system, ...keep] : keep;
}
Pros: Simple, predictable Cons: Loses important early context
Strategy 2: Smart Summarization
Summarize old messages instead of dropping them:
async function summarizeOldMessages(messages: Message[], maxTokens: number) {
const totalTokens = messages.reduce((sum, m) => sum + countTokens(m.content), 0);
if (totalTokens <= maxTokens) return messages;
// Keep 30% recent, summarize 70% old
const recentCount = Math.floor(messages.length * 0.3);
const oldMessages = messages.slice(0, -recentCount);
const recentMessages = messages.slice(-recentCount);
const summary = await llm.complete(`
Summarize this conversation concisely.
Keep: key decisions, important facts, user preferences.
Discard: pleasantries, redundant exchanges.
${oldMessages.map(m => `${m.role}: ${m.content}`).join('\n')}`);
return [
{ role: 'system', content: `Previous conversation summary:\n${summary}` },
...recentMessages
];
}
Pros: Preserves key info Cons: Loses details, extra LLM call
Strategy 3: Hierarchical Context
Different info has different importance. Treat it that way:
function buildContext(layers: ContextLayer[], maxTokens: number) {
const sorted = layers.sort((a, b) =>
priorityOrder[a.priority] - priorityOrder[b.priority]
);
const included: string[] = [];
let used = 0;
for (const layer of sorted) {
if (used + layer.tokens <= maxTokens) {
included.push(layer.content);
used += layer.tokens;
} else if (layer.priority === 'critical') {
// Critical must be included - truncate if needed
const available = maxTokens - used;
included.push(truncateToTokens(layer.content, available));
break;
}
}
return included.join('\n\n');
}
Handling Long User Inputs
Sometimes users paste massive documents. Handle gracefully:
async function handleLongInput(input: string, maxTokens: number) {
const tokens = countTokens(input);
if (tokens <= maxTokens) return input;
// Moderately long - truncate with warning
if (tokens < maxTokens * 2) {
return truncateToTokens(input, maxTokens) +
'\n\n[Truncated due to length]';
}
// Very long - summarize
const summary = await llm.complete(`
Summarize preserving key facts and requests:
${truncateToTokens(input, maxTokens * 3)}`);
return `[Summarized from ${tokens} tokens]\n\n${summary}`;
}
Token Budget Tracker
Build a budget system:
class TokenBudget {
private limits: Record<string, number>;
private usage: Record<string, number> = {};
constructor(total: number) {
this.limits = {
system: total * 0.15,
context: total * 0.35,
history: total * 0.30,
response: total * 0.20,
};
}
allocate(category: string, content: string): string {
const tokens = countTokens(content);
const limit = this.limits[category];
if (tokens <= limit) {
this.usage[category] = tokens;
return content;
}
const truncated = truncateToTokens(content, limit);
this.usage[category] = limit;
return truncated;
}
getRemaining(category: string): number {
return this.limits[category] - (this.usage[category] || 0);
}
}
Common Pitfalls
1. Not reserving space for output
// Wrong - no room for response!
const prompt = fillToMaxTokens(context, 8000);
// Right
const prompt = fillToMaxTokens(context, 6000); // Leave 2000 for output
2. Forgetting tool call overhead
Tool definitions consume tokens too:
function calculateAvailable(model: string, tools: Tool[]) {
const limit = MODEL_LIMITS[model];
const toolTokens = tools.reduce(
(sum, t) => sum + countTokens(JSON.stringify(t)), 0
);
return limit - toolTokens - 2000; // response buffer
}
3. Wrong tokenizer
Different models = different tokenizers:
const TOKENIZERS = {
'gpt-4': encoding_for_model('gpt-4'),
'claude-3': new ClaudeTokenizer(),
};
function countTokens(text: string, model: string) {
return TOKENIZERS[model].encode(text).length;
}
Monitoring in Production
Track usage to optimize:
await analytics.track({
inputTokens,
outputTokens: countTokens(response),
utilization: inputTokens / MODEL_LIMIT,
truncated: inputTokens > MODEL_LIMIT * 0.9
});
Wrapping Up
Context management is about tradeoffs:
- Recency vs completeness - Recent or full history?
- Precision vs cost - Exact counts or approximations?
- Latency vs quality - Compress on-the-fly or pre-process?
Start with sliding windows. Add summarization when users complain. Implement hierarchical context when you have competing info sources.
And always monitor your token usage. Youll be surprised where they actually go.
