Site Crawler

How GrowDeck discovers and maps your website.

Last updated: May 2025

Overview

The GrowDeck crawler uses Playwright — a full browser engine — to visit your website the same way a real user would. This means JavaScript-rendered content, SPAs, and dynamically loaded pages are all crawled correctly.

What gets extracted

For each page, the crawler extracts:

URL, title, meta description
H1–H6 heading structure
Body text (cleaned, no nav/footer)
Internal links (used for link graph)
Canonical URL
Page type classification (homepage, blog, product, landing, etc.)
Word count and content depth
Schema markup presence
Open Graph metadata
Page load performance signals

Crawler settings

Depth limit: Maximum pages to crawl per run (default: 500)
Respect robots.txt: Always on — the crawler checks robots.txt before crawling
Concurrency: 3 parallel browser contexts (configurable)
Delay: 500ms between requests to avoid rate limiting

Rate limiting

The crawler adds a 500ms delay between requests. On shared hosting or sites with aggressive rate limiting, consider reducing concurrency in site settings.

How it handles SPAs

Playwright waits for networkidle before extracting content. Dynamic routes in Next.js, React, and Vue are handled correctly.

JavaScript-heavy sites

If your site is a SPA or uses heavy client-side rendering, the crawler handles it — Playwright waits for networkidle before extracting content.

Crawler technology: Sandflare microVMs

GrowDeck runs every crawl in an isolated Sandflare microVM for security and performance. Each page is visited in a fresh browser-agent sandbox with Chromium and Playwright pre-installed.

This isolation ensures:

Security — Malicious JavaScript or tracking scripts can't escape the sandbox
Clean state — Every crawl starts fresh with no cookies or cached data
Scalability — Sandboxes launch in under 1 second with automatic cleanup

Learn more in the Sandflare docs, or see the crawler implementation in our open-source repository.

Triggering a crawl

Crawls can be triggered:

Manually from the site dashboard → Crawl tab
On a schedule (daily/weekly)PRO — configure in site settings
Via API: POST /api/v1/sites/:siteId/jobs with { "type": "CRAWL" }

Crawl status

PENDING — queued, waiting for a worker
PROCESSING — actively crawling
COMPLETED — finished, pages extracted
FAILED — error occurred (check job logs)

After a crawl

Crawled pages populate the Pages tab. The keyword engineautomatically runs after each completed crawl to refresh opportunity scoring.

Keyword EngineSee how fresh crawl data becomes scored keyword gaps.

Quick StartFollow the fastest path from crawl to generated page.

Site SettingsConnect analytics and configure supporting integrations.