Site Crawler

How GrowDeck discovers and maps your website.

Last updated: May 2025

Overview

The GrowDeck crawler uses Playwright โ€” a full browser engine โ€” to visit your website the same way a real user would. This means JavaScript-rendered content, SPAs, and dynamically loaded pages are all crawled correctly.

What gets extracted

For each page, the crawler extracts:

  • URL, title, meta description
  • H1โ€“H6 heading structure
  • Body text (cleaned, no nav/footer)
  • Internal links (used for link graph)
  • Canonical URL
  • Page type classification (homepage, blog, product, landing, etc.)
  • Word count and content depth
  • Schema markup presence
  • Open Graph metadata
  • Page load performance signals

Crawler settings

  • Depth limit: Maximum pages to crawl per run (default: 500)
  • Respect robots.txt: Always on โ€” the crawler checks robots.txt before crawling
  • Concurrency: 3 parallel browser contexts (configurable)
  • Delay: 500ms between requests to avoid rate limiting
Rate limiting
The crawler adds a 500ms delay between requests. On shared hosting or sites with aggressive rate limiting, consider reducing concurrency in site settings.

How it handles SPAs

Playwright waits for networkidle before extracting content. Dynamic routes in Next.js, React, and Vue are handled correctly.

JavaScript-heavy sites
If your site is a SPA or uses heavy client-side rendering, the crawler handles it โ€” Playwright waits for networkidle before extracting content.

Crawler technology: Sandflare microVMs

GrowDeck runs every crawl in an isolated Sandflare microVM for security and performance. Each page is visited in a fresh browser-agent sandbox with Chromium and Playwright pre-installed.

This isolation ensures:

  • Security โ€” Malicious JavaScript or tracking scripts can't escape the sandbox
  • Clean state โ€” Every crawl starts fresh with no cookies or cached data
  • Scalability โ€” Sandboxes launch in under 1 second with automatic cleanup

Learn more in the Sandflare docs, or see the crawler implementation in our open-source repository.

Triggering a crawl

Crawls can be triggered:

  1. Manually from the site dashboard โ†’ Crawl tab
  2. On a schedule (daily/weekly)PRO โ€” configure in site settings
  3. Via API: POST /api/v1/sites/:siteId/jobs with { "type": "CRAWL" }

Crawl status

  • PENDING โ€” queued, waiting for a worker
  • PROCESSING โ€” actively crawling
  • COMPLETED โ€” finished, pages extracted
  • FAILED โ€” error occurred (check job logs)

After a crawl

Crawled pages populate the Pages tab. The keyword engineautomatically runs after each completed crawl to refresh opportunity scoring.