LLM Context Windows: Making Every Token Count

Context windows are the biggest constraint in LLM apps. GPT-4 gives you 128K tokens. Claude gives you 200K. Sounds like alot until you're building something real and suddenly hitting limits everywhere.

Let me show you how to manage context like it actually matters (because it does).

Understanding Token Economics

First, know what you're working with:

A typical breakdown:

const TOKEN_BUDGET = {
  system: 500,
  context: 2000,
  history: 3500,
  responseBuffer: 2000  // Always reserve for output
};

// If conversation exceeds 3500, something has to go

Rule of thumb: 1 token ≈ 4 characters in English. But always count precisely for important decisions.

Strategy 1: Sliding Window

Simplest approach - keep most recent messages:

function slidingWindow(messages: Message[], maxTokens: number) {
  const system = messages.find(m => m.role === 'system');
  const conversation = messages.filter(m => m.role !== 'system');

  let tokens = countTokens(system?.content || '');
  const keep: Message[] = [];

  // Work backwards from newest
  for (let i = conversation.length - 1; i >= 0; i--) {
    const msg = conversation[i];
    const msgTokens = countTokens(msg.content);

    if (tokens + msgTokens > maxTokens) break;

    tokens += msgTokens;
    keep.unshift(msg);
  }

  return system ? [system, ...keep] : keep;
}

Pros: Simple, predictable Cons: Loses important early context

Strategy 2: Smart Summarization

Summarize old messages instead of dropping them:

async function summarizeOldMessages(messages: Message[], maxTokens: number) {
  const totalTokens = messages.reduce((sum, m) => sum + countTokens(m.content), 0);

  if (totalTokens <= maxTokens) return messages;

  // Keep 30% recent, summarize 70% old
  const recentCount = Math.floor(messages.length * 0.3);
  const oldMessages = messages.slice(0, -recentCount);
  const recentMessages = messages.slice(-recentCount);

  const summary = await llm.complete(`
Summarize this conversation concisely.
Keep: key decisions, important facts, user preferences.
Discard: pleasantries, redundant exchanges.

${oldMessages.map(m => `${m.role}: ${m.content}`).join('\n')}`);

  return [
    { role: 'system', content: `Previous conversation summary:\n${summary}` },
    ...recentMessages
  ];
}

Pros: Preserves key info Cons: Loses details, extra LLM call

Strategy 3: Hierarchical Context

Different info has different importance. Treat it that way:

function buildContext(layers: ContextLayer[], maxTokens: number) {
  const sorted = layers.sort((a, b) =>
    priorityOrder[a.priority] - priorityOrder[b.priority]
  );

  const included: string[] = [];
  let used = 0;

  for (const layer of sorted) {
    if (used + layer.tokens <= maxTokens) {
      included.push(layer.content);
      used += layer.tokens;
    } else if (layer.priority === 'critical') {
      // Critical must be included - truncate if needed
      const available = maxTokens - used;
      included.push(truncateToTokens(layer.content, available));
      break;
    }
  }

  return included.join('\n\n');
}

Handling Long User Inputs

Sometimes users paste massive documents. Handle gracefully:

async function handleLongInput(input: string, maxTokens: number) {
  const tokens = countTokens(input);

  if (tokens <= maxTokens) return input;

  // Moderately long - truncate with warning
  if (tokens < maxTokens * 2) {
    return truncateToTokens(input, maxTokens) +
      '\n\n[Truncated due to length]';
  }

  // Very long - summarize
  const summary = await llm.complete(`
Summarize preserving key facts and requests:
${truncateToTokens(input, maxTokens * 3)}`);

  return `[Summarized from ${tokens} tokens]\n\n${summary}`;
}

Token Budget Tracker

Build a budget system:

class TokenBudget {
  private limits: Record<string, number>;
  private usage: Record<string, number> = {};

  constructor(total: number) {
    this.limits = {
      system: total * 0.15,
      context: total * 0.35,
      history: total * 0.30,
      response: total * 0.20,
    };
  }

  allocate(category: string, content: string): string {
    const tokens = countTokens(content);
    const limit = this.limits[category];

    if (tokens <= limit) {
      this.usage[category] = tokens;
      return content;
    }

    const truncated = truncateToTokens(content, limit);
    this.usage[category] = limit;
    return truncated;
  }

  getRemaining(category: string): number {
    return this.limits[category] - (this.usage[category] || 0);
  }
}

Common Pitfalls

1. Not reserving space for output

// Wrong - no room for response!
const prompt = fillToMaxTokens(context, 8000);

// Right
const prompt = fillToMaxTokens(context, 6000);  // Leave 2000 for output

2. Forgetting tool call overhead

Tool definitions consume tokens too:

function calculateAvailable(model: string, tools: Tool[]) {
  const limit = MODEL_LIMITS[model];
  const toolTokens = tools.reduce(
    (sum, t) => sum + countTokens(JSON.stringify(t)), 0
  );
  return limit - toolTokens - 2000;  // response buffer
}

3. Wrong tokenizer

Different models = different tokenizers:

const TOKENIZERS = {
  'gpt-4': encoding_for_model('gpt-4'),
  'claude-3': new ClaudeTokenizer(),
};

function countTokens(text: string, model: string) {
  return TOKENIZERS[model].encode(text).length;
}

Monitoring in Production

Track usage to optimize:

await analytics.track({
  inputTokens,
  outputTokens: countTokens(response),
  utilization: inputTokens / MODEL_LIMIT,
  truncated: inputTokens > MODEL_LIMIT * 0.9
});

Wrapping Up

Context management is about tradeoffs:

Recency vs completeness - Recent or full history?
Precision vs cost - Exact counts or approximations?
Latency vs quality - Compress on-the-fly or pre-process?

Start with sliding windows. Add summarization when users complain. Implement hierarchical context when you have competing info sources.

And always monitor your token usage. Youll be surprised where they actually go.