Making Claude verify its own work automatically

josephk · July 8, 2025, 7:52pm

I’m tired of Claude saying it fixed something when it didn’t actually work. You probably know this frustration too. Claude says the task is done but when you check the app, nothing changed. Then you have to send screenshots and explain what went wrong.

My Solution: Auto-Testing with Playwright

I built a system that makes Claude check its own work automatically. Here’s how it works:

Claude finishes a task and triggers a validation script
Playwright opens the browser and visits your pages
It takes screenshots and checks for errors
Claude looks at the results and fixes issues if needed

Setup Steps:

Install Playwright first:

npm install @playwright/test
npx playwright install

Create hook config (.claude/settings.json):

{
  "hooks": {
    "Stop": [
      {
        "matcher": "",
        "hooks": [
          {
            "type": "command",
            "command": "node validation/check-work.js"
          }
        ]
      }
    ]
  }
}

The validation script:

const { chromium } = require('@playwright/test');
const fs = require('fs');
const path = require('path');

const SETTINGS = {
  serverUrl: 'http://localhost:3000',
  outputDir: './test-results',
  
  testPages: [
    {
      url: '/',
      title: 'home',
      checkElements: ['h1', 'nav', 'main']
    },
    {
      url: '/dashboard',
      title: 'dashboard', 
      checkElements: ['header', '.sidebar', '.content']
    }
  ]
};

async function testPage(browser, pageInfo) {
  const testResult = {
    pageName: pageInfo.title,
    passed: true,
    issues: [],
    responseTime: 0
  };

  console.log(`Testing ${pageInfo.title} page...`);

  const browserPage = await browser.newPage();
  const errorLogs = [];
  
  browserPage.on('console', message => {
    if (message.type() === 'error') {
      errorLogs.push(message.text());
    }
  });

  try {
    const startTime = Date.now();
    await browserPage.goto(`${SETTINGS.serverUrl}${pageInfo.url}`, {
      waitUntil: 'domcontentloaded',
      timeout: 8000
    });
    testResult.responseTime = Date.now() - startTime;

    if (!fs.existsSync(SETTINGS.outputDir)) {
      fs.mkdirSync(SETTINGS.outputDir, { recursive: true });
    }
    
    await browserPage.screenshot({
      path: path.join(SETTINGS.outputDir, `${pageInfo.title}-screenshot.png`),
      fullPage: true
    });

    for (const element of pageInfo.checkElements) {
      try {
        await browserPage.waitForSelector(element, { timeout: 2000 });
        console.log(`Found element: ${element}`);
      } catch (err) {
        testResult.issues.push(`Element not found: ${element}`);
        testResult.passed = false;
        console.log(`Missing element: ${element}`);
      }
    }

    if (errorLogs.length > 0) {
      testResult.issues.push(...errorLogs.map(log => `Browser error: ${log}`));
      testResult.passed = false;
    }

  } catch (err) {
    testResult.issues.push(`Page load failed: ${err.message}`);
    testResult.passed = false;
  }

  await browserPage.close();
  return testResult;
}

async function runTests() {
  console.log('Starting automated validation...');

  const browser = await chromium.launch({ headless: true });
  const testResults = [];
  
  for (const pageConfig of SETTINGS.testPages) {
    const result = await testPage(browser, pageConfig);
    testResults.push(result);
  }

  await browser.close();

  const successCount = testResults.filter(r => r.passed).length;
  const totalCount = testResults.length;
  
  console.log(`Test Results: ${successCount}/${totalCount} pages passed`);
  
  if (successCount !== totalCount) {
    console.log('Found issues:');
    testResults.forEach(result => {
      if (!result.passed) {
        console.log(`${result.pageName}: ${result.issues.join(', ')}`);
      }
    });
  }

  process.exit(0);
}

runTests().catch(console.error);

Add to claude.md file:

## Task Validation Requirements

For any code changes, always:
1. Run: `node validation/check-work.js`
2. Review screenshots in test-results folder
3. Fix any issues found before marking task complete

This has saved me tons of time. Claude now catches its own mistakes before I have to point them out. What kind of validation checks would help with your projects?

sofiap · July 16, 2025, 5:10am

This is brilliant. I’ve been dealing with the same issue where Claude claims everything works but then I find broken layouts or missing functionality. Your Playwright approach is way more thorough than my current setup.

I’m just using basic HTTP status checks right now, but your DOM element validation is way better. The screenshot capture is especially useful - sometimes things load but look completely wrong.

One thing I’d add: check for specific text content, not just element presence. I’ve had cases where the right HTML tags existed but contained placeholder text or were empty. You could extend the checkElements array to include expected text matches.

This works great with CI/CD pipelines too. I run similar validation in GitHub Actions before deployments and it catches issues that would otherwise reach production. The automated screenshots have helped me spot visual regressions that functional tests miss.

Do you run this on every code change or just when Claude says it’s finished? I’m wondering about the performance impact during development.

ethant · July 13, 2025, 10:22pm

Perfect! This is exactly what I needed. Claude keeps saying it “fixed” things when nothing actually changed - driving me crazy.

Quick question: does this work with React dev servers that auto-reload? My localhost:3000 restarts constantly during development and I’m wondering if the timing screws up validation.

Also thinking I should add console.warn checks, not just errors. Claude sometimes introduces warnings that become bigger problems down the line.

JackHero77 · July 13, 2025, 11:52am

Smart approach. I’ve been manually checking Claude’s work for months - automating this makes total sense. Love the hook integration catching issues before you waste time debugging.

Add network request monitoring to your validation script. Claude sometimes breaks API calls without obvious visual signs. Use page.on(‘requestfailed’) to capture failed requests and include them in your results.

For database apps, I validate data persistence after Claude’s changes. Run quick SQL queries or hit API endpoints to verify CRUD operations still work. Claude fixes frontend stuff but often breaks backend functionality.

Upgrade screenshot comparison with visual diff tools like Pixelmatch. Catches subtle UI changes that slip past element tests but still mess up user experience. I’ve found Claude accidentally changing colors, spacing, or alignment this way.

Localhost works great for development, but I also test on staging for realistic conditions. Production-like environments reveal issues local testing misses - different server configs, external dependencies, etc.

josephk · July 18, 2025, 10:26am

This topic was automatically closed 4 days after the last reply. New replies are no longer allowed.