AI Code Review: Building Automation That Actually Helps

AI code review is either incredibly useful or incredibly annoying. Theres no middle ground.

The annoying version nitpicks variable names, suggests unnecessary refactors, and adds noise to every PR. The useful version catches bugs before they hit production, spots security issues humans miss, and saves senior developers hours of review time.

Ive built both. Heres how to build the useful one.

The Problem with Naive AI Review

The obvious approach doesnt work:

// Don't do this
const review = await llm.complete(`
Review this code and provide feedback:

${diff}
`);

This produces:

Style suggestions nobody asked for
False positives everywhere
Generic advice that applies to any code
No understanding of project context

The model sees code in isolation. It doesnt know your conventions, your architecture, or what the PR is actually trying to accomplish.

Architecture of Useful AI Review

A good system has multiple components working together:

Each step builds on the previous. You cant just throw a diff at an LLM and expect magic.

Step 1: Gather Rich Context

The diff alone is not enough. You need the full picture:

async function gatherContext(pr: PullRequest) {
  const [
    diff,
    prDescription,
    linkedIssues,
    changedFiles,
    projectConfig
  ] = await Promise.all([
    github.getPRDiff(pr.id),
    github.getPRDescription(pr.id),
    github.getLinkedIssues(pr.id),
    github.getChangedFilesWithContent(pr.id),
    loadProjectConfig(pr.repo)
  ]);

  // Get full file content, not just changed lines
  const filesWithContext = await Promise.all(
    changedFiles.map(async (file) => ({
      path: file.path,
      diff: file.diff,
      fullContent: await github.getFileContent(pr.repo, file.path, pr.head),
      // Also get related files (imports, tests)
      relatedFiles: await findRelatedFiles(file.path, pr.repo)
    }))
  );

  return {
    diff,
    description: prDescription,
    intent: await extractIntent(prDescription, linkedIssues),
    files: filesWithContext,
    conventions: projectConfig.codeConventions,
    ignorePatterns: projectConfig.reviewIgnore
  };
}

The key insight here? Get the full file content, not just the diff. The model needs surrounding context to understand what the code is actually doing.

async function extractIntent(description: string, issues: Issue[]) {
  // Use LLM to understand what this PR is trying to do
  const intent = await llm.complete(`
Based on this PR description and linked issues, summarize:
1. What is this change trying to accomplish?
2. What are the acceptance criteria?
3. What areas are most critical to review?

PR Description:
${description}

Linked Issues:
${issues.map(i => `- ${i.title}: ${i.body}`).join('\n')}

Summary:`);

  return intent;
}

Understanding intent is huge. A PR that adds user deletion needs very different scrutiny then one that updates button colors.

Step 2: Multi-Pass Analysis

Different types of issues need different prompts. Dont try to catch everything in one shot.

Security Review

async function securityReview(context: ReviewContext) {
  const securityPrompt = `You are a security engineer reviewing code for vulnerabilities.

Focus ONLY on security issues:
- SQL injection
- XSS vulnerabilities
- Authentication/authorization flaws
- Secrets in code
- Insecure dependencies
- Input validation issues
- Path traversal

Project security requirements:
${context.conventions.security}

Changed files:
${context.files.map(f => `
=== ${f.path} ===
${f.fullContent}
`).join('\n')}

For each issue found, return JSON:
{
  "issues": [
    {
      "file": "path/to/file.ts",
      "line": 42,
      "severity": "critical" | "high" | "medium",
      "type": "security",
      "issue": "brief description",
      "suggestion": "how to fix",
      "confidence": 0.0-1.0
    }
  ]
}

If no security issues found, return {"issues": []}`;

  const result = await llm.complete(securityPrompt);
  return JSON.parse(result);
}

Bug Detection

async function bugDetection(context: ReviewContext) {
  const bugPrompt = `You are a senior developer reviewing code for bugs.

This PR is trying to: ${context.intent}

Focus on:
- Logic errors
- Off-by-one errors
- Null/undefined handling
- Race conditions
- Resource leaks
- Error handling gaps
- Edge cases not covered

DO NOT comment on:
- Code style
- Naming conventions
- Refactoring opportunities

Changed code with surrounding context:
${context.files.map(f => `
=== ${f.path} ===
FULL FILE:
${f.fullContent}

CHANGES (diff):
${f.diff}
`).join('\n')}

Return JSON with issues found. Include line numbers from the FULL FILE, not the diff.`;

  const result = await llm.complete(bugPrompt);
  return JSON.parse(result);
}

The "DO NOT comment on" section is critical. Without it the model will nitpick everything and developers will start ignoring all comments.

Performance Review

async function performanceReview(context: ReviewContext) {
  // Only for files that might have perf impact
  const perfRelevantFiles = context.files.filter(f =>
    f.path.includes('api/') ||
    f.path.includes('db/') ||
    f.path.includes('query') ||
    f.fullContent.includes('SELECT') ||
    f.fullContent.includes('fetch(')
  );

  if (perfRelevantFiles.length === 0) {
    return { issues: [] };
  }

  // ... run perf analysis
}

Skip performance review for files that obviously dont need it. No point checking CSS files for N+1 queries.

Step 3: Filter and Rank

This is where most AI review systems fail. They report everything and overwhelm developers.

function filterAndRank(allIssues: Issue[]): Issue[] {
  // Remove duplicates
  const deduplicated = deduplicateIssues(allIssues);

  // Filter by confidence
  const confident = deduplicated.filter(issue =>
    issue.confidence >= 0.7 ||
    (issue.severity === 'critical' && issue.confidence >= 0.5)
  );

  // Sort by severity and confidence
  const sorted = confident.sort((a, b) => {
    const severityOrder = { critical: 0, high: 1, medium: 2, low: 3 };
    const severityDiff = severityOrder[a.severity] - severityOrder[b.severity];
    if (severityDiff !== 0) return severityDiff;
    return b.confidence - a.confidence;
  });

  // Limit total comments to avoid overwhelming
  return sorted.slice(0, 10);
}

The confidence threshold is important. Id rather miss a few real issues then flood every PR with false positives. Trust gets destroyed fast.

Step 4: Post Thoughtful Comments

Make comments actionable. Nobody likes "this might be a problem" without a solution.

function formatComment(issue: Issue): string {
  const severityEmoji = {
    critical: '🚨',
    high: '⚠️',
    medium: '💡',
    low: 'ℹ️'
  };

  return `${severityEmoji[issue.severity]} **${issue.severity.toUpperCase()}**: ${issue.issue}

${issue.suggestion}

<details>
<summary>Why this matters</summary>
${issue.explanation || 'This could lead to issues in production.'}
</details>

---
<sub>🤖 AI Review | confidence: ${Math.round(issue.confidence * 100)}% | [false positive?](link-to-feedback)</sub>`;
}

Always include:

What the problem is
How to fix it
Why it matters
A way to report false positives

Handling False Positives

// Track when developers dismiss AI comments
async function handleCommentReaction(event: CommentEvent) {
  if (event.reaction === '👎' || event.resolved_without_change) {
    await feedbackStore.record({
      issueType: event.comment.issueType,
      file: event.comment.file,
      wasHelpful: false,
      context: event.comment.context
    });

    // If pattern of false positives emerges, adjust
    const falsePositiveRate = await feedbackStore.getFalsePositiveRate(
      event.comment.issueType
    );

    if (falsePositiveRate > 0.3) {
      await alerting.notify(
        `High false positive rate for ${event.comment.issueType}`
      );
    }
  }
}

Use the feedback to improve prompts over time:

async function loadCalibration(issueType: string) {
  const examples = await feedbackStore.getFalsePositives(issueType, 10);

  if (examples.length > 0) {
    return `

IMPORTANT: Avoid false positives like these previous mistakes:
${examples.map(e => `- ${e.file}: ${e.issue} (was NOT actually a problem)`).join('\n')}
`;
  }

  return '';
}

CI/CD Integration

Block merges for critical issues, but be smart about it:

async function checkPRStatus(pr: PullRequest): Promise<CheckResult> {
  const issues = await runFullReview(pr);

  const criticalIssues = issues.filter(i => i.severity === 'critical');
  const highConfidenceCritical = criticalIssues.filter(i => i.confidence >= 0.85);

  if (highConfidenceCritical.length > 0) {
    return {
      status: 'failure',
      message: `${highConfidenceCritical.length} critical issues found`,
    };
  }

  if (criticalIssues.length > 0) {
    return {
      status: 'neutral',  // Warning but don't block
      message: 'Potential critical issues (review recommended)',
    };
  }

  return { status: 'success' };
}

Only block when confidence is really high. Nothing kills trust faster then blocking a PR for a false positive.

Project Configuration

Make your review configurable per project:

# .ai-review.yml
version: 1

review:
  security: true
  bugs: true
  performance: true
  tests: false  # Don't review test files

thresholds:
  block_merge: critical
  require_response: high
  comment_only: medium

context:
  conventions: |
    - Use async/await, not callbacks
    - All API endpoints need authentication
    - Database queries must use parameterized statements

  ignore_patterns:
    - "*.test.ts"
    - "migrations/*"
    - "generated/*"

Every codebase is different. The AI needs to know your rules.

Wrapping Up

Effective AI code review requires:

Rich context - Full files, not just diffs
Focused prompts - Separate passes for different concerns
High precision - Filter aggressively, better to miss some then spam
Actionable feedback - Tell devs how to fix, not just whats wrong
Learning from mistakes - Track false positives and adapt

Start with security review only. Get that working well. Then add bug detection. Then performance. Each pass should earn its place by providing real value.

The goal isnt to replace human reviewers. Its to let them focus on architecture and design while AI handles the tedious pattern-matching stuff. When done right, everyone wins.