Testing AI Features: When Outputs Aren't Deterministic

You cant write expect(aiResponse).toBe("exact string") when the AI gives different answers each time. But you still need to test.

Heres what works.

The Problem

"What's 2+2?" might return "4", "The answer is 4", "Four", or "2+2 equals 4". All correct, but traditional tests would fail 3 of them.

Strategy 1: Test Properties, Not Values

Instead of exact matching, test that the response has certain properties:

describe("summarize", () => {
  it("returns shorter text than input", async () => {
    const input = longArticle;  // 5000 words
    const summary = await summarize(input);

    expect(summary.length).toBeLessThan(input.length / 5);
  });

  it("preserves key information", async () => {
    const input = "Company XYZ reported $50M revenue in Q3.";
    const summary = await summarize(input);

    expect(summary.toLowerCase()).toContain("xyz");
    expect(summary).toMatch(/\$?50.*m/i);
  });
});

Strategy 2: LLM-as-Judge

Use an LLM to evaluate another LLM's output:

async function evaluateResponse(
  input: string,
  response: string,
  criteria: string
): Promise<{ pass: boolean; reason: string }> {
  const evaluation = await llm.complete({
    prompt: `Evaluate this AI response.

Input: "${input}"
Response: "${response}"
Criteria: ${criteria}

Does the response meet the criteria? Reply with JSON:
{"pass": true/false, "reason": "explanation"}`,
    temperature: 0
  });

  return JSON.parse(evaluation);
}

// Usage in tests
it("answers questions correctly", async () => {
  const response = await askQuestion("What is the capital of France?");

  const result = await evaluateResponse(
    "What is the capital of France?",
    response,
    "Must correctly identify Paris as the capital"
  );

  expect(result.pass).toBe(true);
});

Strategy 3: Golden Dataset Testing

Build a dataset of inputs with acceptable outputs:

const goldenDataset = [
  {
    input: "Translate to French: Hello",
    validOutputs: ["Bonjour", "Salut"],
    mustContain: ["bonjour", "salut"],
    mustNotContain: ["hello"]
  },
  // ... more cases
];

describe("translation", () => {
  goldenDataset.forEach(({ input, mustContain, mustNotContain }) => {
    it(`handles: ${input.slice(0, 30)}...`, async () => {
      const result = await translate(input);
      const lower = result.toLowerCase();

      const hasRequired = mustContain.some(word =>
        lower.includes(word)
      );
      const hasProhibited = mustNotContain.some(word =>
        lower.includes(word)
      );

      expect(hasRequired).toBe(true);
      expect(hasProhibited).toBe(false);
    });
  });
});

Strategy 4: Snapshot Testing with Tolerance

Run tests multiple times, track drift:

async function snapshotTest(
  testFn: () => Promise<string>,
  snapshotId: string
) {
  const result = await testFn();
  const embedding = await getEmbedding(result);

  const snapshot = await loadSnapshot(snapshotId);

  if (!snapshot) {
    await saveSnapshot(snapshotId, embedding);
    return; // First run, just save
  }

  const similarity = cosineSimilarity(embedding, snapshot);

  // Allow some variation but flag big changes
  expect(similarity).toBeGreaterThan(0.85);
}

Strategy 5: Mock for Unit Tests

For unit tests, mock the LLM entirely:

// __mocks__/llm.ts
export const llm = {
  complete: jest.fn()
};

// test file
import { llm } from "./llm";
import { processWithAI } from "./processor";

jest.mock("./llm");

describe("processWithAI", () => {
  it("parses LLM response correctly", async () => {
    (llm.complete as jest.Mock).mockResolvedValue(
      '{"name": "John", "age": 30}'
    );

    const result = await processWithAI("some input");

    expect(result.name).toBe("John");
    expect(llm.complete).toHaveBeenCalledWith(
      expect.stringContaining("some input")
    );
  });
});

Mock for speed, real LLM for integration tests.

Testing Pyramid for AI

Unit tests: Mock everything, test your logic
Integration tests: Real LLM, test one feature
E2E tests: Full pipeline, expensive but necessary

Quick Tips

Set temperature to 0 for reproducible tests
Use seed parameters if available
Cache LLM responses in test fixtures
Run flaky tests multiple times and check pass rate
Budget for AI test costs - they add up

Testing AI isnt like testing normal code. Accept some uncertainty, test properties over values, and use the LLM itself to help evaluate. Its not perfect, but it catches real bugs.