Blog Testing 9 min read

Automated Broken Image Scanner for Production Sites

Build an automated broken image checker that crawls production pages, reports 404 image URLs, and integrates with CI pipelines to catch regressions before users see them.

Broken image checkerImage 404 scannerSite crawlerProduction image monitoringCI image audit
Automated Broken Image Scanner for Production Sites

Broken images reach production in several ways: a CDN migration leaves stale URLs, a CMS editor deletes a media attachment, or a deployment removes a static asset. An automated broken image checker that crawls your site and reports 404 image URLs catches these regressions before users encounter broken-image icons.

This guide covers building a crawler with Playwright, extracting and validating image URLs, integrating the scanner into a CI pipeline, and using fallback.pics to understand which placeholder dimensions you need.

Approach

How automated broken image scanning works

A broken image scanner crawls pages, collects all img src and srcset URLs, makes HEAD requests to each URL, and reports any that return non-200 status codes. For large sites, the scanner deduplicates URLs across pages so each unique image URL is checked once regardless of how many pages reference it.

Running this in CI against a staging environment catches broken images before deployment. Running it nightly against production catches regressions from CMS changes, CDN purges, or third-party image host failures.

Playwright crawler

Building a broken image checker with Playwright

Playwright gives you full browser rendering, which means img src values set via JavaScript or frameworks (React, Vue) are captured after rendering — not just from the HTML source. This catches dynamically inserted images that a simple HTML parser would miss.

Use page.$$eval() to collect all img src values after the page reaches networkidle. Filter out data URIs and blob URLs that are not network resources.

Implementation tsx
// scripts/scan-images.ts
import { chromium, Browser, Page } from 'playwright';
import fetch from 'node-fetch';

const BASE_URL = process.env.BASE_URL ?? 'https://yourapp.com';
const ROUTES = ['/', '/shop', '/blog', '/about'];

async function getImageUrls(page: Page): Promise<string[]> {
  return page.$$eval('img[src]', (imgs) =>
    imgs
      .map((img) => img.getAttribute('src') ?? '')
      .filter((src) => src.startsWith('http') || src.startsWith('/'))
  );
}

async function checkUrl(url: string): Promise<{ url: string; status: number }> {
  try {
    const res = await fetch(url, { method: 'HEAD', redirect: 'follow' });
    return { url, status: res.status };
  } catch {
    return { url, status: 0 };
  }
}

async function scanRoute(browser: Browser, route: string) {
  const page = await browser.newPage();
  await page.goto(`${BASE_URL}${route}`, { waitUntil: 'networkidle' });
  const urls = await getImageUrls(page);
  await page.close();
  return urls;
}

async function main() {
  const browser = await chromium.launch();
  const urlSet = new Set<string>();

  for (const route of ROUTES) {
    const urls = await scanRoute(browser, route);
    urls.forEach((u) => urlSet.add(u.startsWith('/') ? `${BASE_URL}${u}` : u));
  }

  await browser.close();

  const results = await Promise.all([...urlSet].map(checkUrl));
  const broken = results.filter((r) => r.status !== 200 && r.status !== 301 && r.status !== 302);

  if (broken.length > 0) {
    console.error('Broken images found:');
    broken.forEach((r) => console.error(`  [${r.status}] ${r.url}`));
    process.exit(1);
  }

  console.log(`Checked ${urlSet.size} unique image URLs. All OK.`);
}

main();

srcset

Scanning srcset and picture element sources

Product grids and responsive images use srcset, and picture elements may have multiple source URLs. A scanner that only checks img.src misses the majority of image URLs on modern sites.

Parse srcset strings by splitting on commas and extracting the URL from each descriptor. The format is: 'url 2x, url2 3x' or 'url 400w, url2 800w'.

Implementation text
// Enhanced image URL extraction
async function getImageUrls(page: Page): Promise<string[]> {
  return page.evaluate(() => {
    const urls: string[] = [];

    document.querySelectorAll('img').forEach((img) => {
      if (img.src) urls.push(img.src);
      if (img.srcset) {
        img.srcset.split(',').forEach((part) => {
          const url = part.trim().split(/s+/)[0];
          if (url) urls.push(url);
        });
      }
    });

    document.querySelectorAll('source[srcset]').forEach((source) => {
      (source as HTMLSourceElement).srcset.split(',').forEach((part) => {
        const url = part.trim().split(/s+/)[0];
        if (url) urls.push(url);
      });
    });

    return [...new Set(urls)].filter((u) => u.startsWith('http'));
  });
}

CI integration

Running the broken image scanner in GitHub Actions

Schedule the scanner as a daily cron job against production and as a required check on PRs against your staging environment. Use different exit code handling: fail the PR check on any broken image, but only send an alert (not a build failure) for production scans.

Cache the list of known-broken URLs in CI so the scanner can distinguish new regressions from pre-existing issues. Report only newly broken URLs in PR comments.

Implementation text
# .github/workflows/image-scan.yml
name: Broken Image Scanner
on:
  schedule:
    - cron: '0 6 * * *'  # Daily at 6am UTC
  pull_request:
    branches: [main]

jobs:
  scan:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with: { node-version: 20 }
      - run: npm ci
      - run: npx playwright install chromium
      - name: Scan images
        run: npx tsx scripts/scan-images.ts
        env:
          BASE_URL: ${{ github.event_name == 'schedule' && 'https://yourapp.com' || 'https://staging.yourapp.com' }}

Reporting

Generating reports and alerting on broken images

Output the scan report as a JSON file with URL, status code, referring page, and image alt text. Upload as a CI artifact for debugging. For production scans, post the report to Slack or create a GitHub issue with the list of broken URLs.

Group broken images by domain to identify systematic failures. Fifteen broken images all from the same CDN domain indicate a CDN migration issue, not individual missing files.

Key takeaways

What to standardize before shipping

  • A Playwright-based crawler captures dynamically inserted image URLs that a static HTML parser misses — essential for React, Vue, and Next.js sites.
  • Parse srcset attributes and picture source elements in addition to img.src to catch the majority of image URLs on responsive sites.
  • Run the scanner daily against production on a cron schedule and as a required PR check against staging to catch both deployment regressions and CMS-driven changes.
  • Group broken images by domain in the report to distinguish systematic CDN failures from individual missing files.
  • Use fallback.pics dimension-matched URLs as emergency replacements while permanent fixes are being deployed.

Production fallback layer

Use fallback.pics anywhere an image URL is accepted.

Start with one deterministic URL and standardize fallback behavior across your design system.

Read the docs