You cant write expect(aiResponse).toBe("exact string") when the AI gives different answers each time. But you still need to test.
Heres what works.
The Problem
"What's 2+2?" might return "4", "The answer is 4", "Four", or "2+2 equals 4". All correct, but traditional tests would fail 3 of them.
Strategy 1: Test Properties, Not Values
Instead of exact matching, test that the response has certain properties:
describe("summarize", () => {
it("returns shorter text than input", async () => {
const input = longArticle; // 5000 words
const summary = await summarize(input);
expect(summary.length).toBeLessThan(input.length / 5);
});
it("preserves key information", async () => {
const input = "Company XYZ reported $50M revenue in Q3.";
const summary = await summarize(input);
expect(summary.toLowerCase()).toContain("xyz");
expect(summary).toMatch(/\$?50.*m/i);
});
});
Strategy 2: LLM-as-Judge
Use an LLM to evaluate another LLM's output:
async function evaluateResponse(
input: string,
response: string,
criteria: string
): Promise<{ pass: boolean; reason: string }> {
const evaluation = await llm.complete({
prompt: `Evaluate this AI response.
Input: "${input}"
Response: "${response}"
Criteria: ${criteria}
Does the response meet the criteria? Reply with JSON:
{"pass": true/false, "reason": "explanation"}`,
temperature: 0
});
return JSON.parse(evaluation);
}
// Usage in tests
it("answers questions correctly", async () => {
const response = await askQuestion("What is the capital of France?");
const result = await evaluateResponse(
"What is the capital of France?",
response,
"Must correctly identify Paris as the capital"
);
expect(result.pass).toBe(true);
});
Strategy 3: Golden Dataset Testing
Build a dataset of inputs with acceptable outputs:
const goldenDataset = [
{
input: "Translate to French: Hello",
validOutputs: ["Bonjour", "Salut"],
mustContain: ["bonjour", "salut"],
mustNotContain: ["hello"]
},
// ... more cases
];
describe("translation", () => {
goldenDataset.forEach(({ input, mustContain, mustNotContain }) => {
it(`handles: ${input.slice(0, 30)}...`, async () => {
const result = await translate(input);
const lower = result.toLowerCase();
const hasRequired = mustContain.some(word =>
lower.includes(word)
);
const hasProhibited = mustNotContain.some(word =>
lower.includes(word)
);
expect(hasRequired).toBe(true);
expect(hasProhibited).toBe(false);
});
});
});
Strategy 4: Snapshot Testing with Tolerance
Run tests multiple times, track drift:
async function snapshotTest(
testFn: () => Promise<string>,
snapshotId: string
) {
const result = await testFn();
const embedding = await getEmbedding(result);
const snapshot = await loadSnapshot(snapshotId);
if (!snapshot) {
await saveSnapshot(snapshotId, embedding);
return; // First run, just save
}
const similarity = cosineSimilarity(embedding, snapshot);
// Allow some variation but flag big changes
expect(similarity).toBeGreaterThan(0.85);
}
Strategy 5: Mock for Unit Tests
For unit tests, mock the LLM entirely:
// __mocks__/llm.ts
export const llm = {
complete: jest.fn()
};
// test file
import { llm } from "./llm";
import { processWithAI } from "./processor";
jest.mock("./llm");
describe("processWithAI", () => {
it("parses LLM response correctly", async () => {
(llm.complete as jest.Mock).mockResolvedValue(
'{"name": "John", "age": 30}'
);
const result = await processWithAI("some input");
expect(result.name).toBe("John");
expect(llm.complete).toHaveBeenCalledWith(
expect.stringContaining("some input")
);
});
});
Mock for speed, real LLM for integration tests.
Testing Pyramid for AI
- Unit tests: Mock everything, test your logic
- Integration tests: Real LLM, test one feature
- E2E tests: Full pipeline, expensive but necessary
Quick Tips
- Set temperature to 0 for reproducible tests
- Use seed parameters if available
- Cache LLM responses in test fixtures
- Run flaky tests multiple times and check pass rate
- Budget for AI test costs - they add up
Testing AI isnt like testing normal code. Accept some uncertainty, test properties over values, and use the LLM itself to help evaluate. Its not perfect, but it catches real bugs.
