Allscreenshots
Guides

Extract everything from a URL in one call

Capture a screenshot, the page markdown, and structured JSON from a single request — then hand the bundle to an LLM

Extract everything from a URL in one call

A common workflow looks like this: load the page in a browser to grab a screenshot, fetch the HTML separately to parse meta tags, run the body through a Markdown converter for an LLM, and then call an LLM to fill in the parts selectors can't reach. Four round trips, two browsers, inconsistent page state.

With the outputs array, one request to POST /v1/screenshots returns all of those formats from a single page load — and counts as one screenshot against your quota regardless of how many output types you ask for.

What you'll build

By the end of this guide, one request returns a payload like:

{
  "url": "https://linear.app",
  "outputs": {
    "screenshot": { "storageUrl": "https://storage.allscreenshots.com/.../screenshot.png", "size": 19520 },
    "markdown":   { "storageUrl": "https://storage.allscreenshots.com/.../markdown.md",   "size": 15276 },
    "json": {
      "data": {
        "title": "Linear – The system for product development",
        "description": "Purpose-built for planning and building products with AI agents.",
        "ogImage": "https://linear.app/static/og/homepage.jpg",
        "themeColor": "#08090a",
        "favicon": "/favicon.ico",
        "appleIcon": "/static/apple-touch-icon.png",
        "twitterSite": "@linear"
      }
    }
  },
  "renderTimeMs": 2267
}

That's enough to power a brand-profile auto-fill, a link-preview card, a CRM enrichment row, or a competitor snapshot — all from one call.

Why one call

  • One billing unit. A request with five outputs still counts as one screenshot. See multi-output billing.
  • One browser session. Screenshots, HTML, markdown, and JSON come from the same DOM. No drift between what you captured and what you scraped.
  • Captured in a safe order. Outputs are processed screenshot → PDF → HTML → markdown → JSON, so a late-running script can't mutate the page before the screenshot lands.

If you're new to the outputs array, the API reference is the canonical source for option lists and response shapes. This guide focuses on stitching the outputs into a working pipeline.

Implementation

Make a single request with the outputs you need

Ask for the three signals that cover most onboarding, enrichment, and preview workflows: a screenshot, the main content as markdown, and structured JSON for meta tags.

curl -X POST 'https://api.allscreenshots.com/v1/screenshots' \
  -H 'X-API-Key: YOUR_API_KEY' \
  -H 'Content-Type: application/json' \
  -d '{
    "url": "https://linear.app",
    "blockCookieBanners": true,
    "responseType": "url",
    "outputs": [
      { "type": "screenshot", "format": "png" },
      { "type": "markdown", "mainContentOnly": true },
      { "type": "json", "schema": {
        "title":         { "selector": "title", "type": "text" },
        "description":   { "selector": "meta[name=description]",            "type": "attribute", "attribute": "content" },
        "ogTitle":       { "selector": "meta[property=\"og:title\"]",       "type": "attribute", "attribute": "content" },
        "ogDescription": { "selector": "meta[property=\"og:description\"]", "type": "attribute", "attribute": "content" },
        "ogImage":       { "selector": "meta[property=\"og:image\"]",       "type": "attribute", "attribute": "content" },
        "ogSiteName":    { "selector": "meta[property=\"og:site_name\"]",   "type": "attribute", "attribute": "content" },
        "twitterSite":   { "selector": "meta[name=\"twitter:site\"]",       "type": "attribute", "attribute": "content" },
        "themeColor":    { "selector": "meta[name=theme-color]",            "type": "attribute", "attribute": "content" },
        "favicon":       { "selector": "link[rel~=icon]",                   "type": "attribute", "attribute": "href" },
        "appleIcon":     { "selector": "link[rel=\"apple-touch-icon\"]",    "type": "attribute", "attribute": "href" },
        "manifest":      { "selector": "link[rel=manifest]",                "type": "attribute", "attribute": "href" }
      }}
    ]
  }'

responseType: "url" stores each output on the CDN and returns short-lived download URLs in the response. Use "base64" if you'd rather inline binary outputs into the JSON response.

Build a reusable page schema

The schema in Step 1 is the canonical "page + brand signals" snippet. Drop it into a helper so every workflow shares the same source of truth:

// extract.ts
export const pageSchema = {
  title:         { selector: 'title', type: 'text' as const },
  description:   { selector: 'meta[name=description]',            type: 'attribute' as const, attribute: 'content' },
  ogTitle:       { selector: 'meta[property="og:title"]',         type: 'attribute' as const, attribute: 'content' },
  ogDescription: { selector: 'meta[property="og:description"]',   type: 'attribute' as const, attribute: 'content' },
  ogImage:       { selector: 'meta[property="og:image"]',         type: 'attribute' as const, attribute: 'content' },
  ogSiteName:    { selector: 'meta[property="og:site_name"]',     type: 'attribute' as const, attribute: 'content' },
  twitterSite:   { selector: 'meta[name="twitter:site"]',         type: 'attribute' as const, attribute: 'content' },
  themeColor:    { selector: 'meta[name=theme-color]',            type: 'attribute' as const, attribute: 'content' },
  favicon:       { selector: 'link[rel~=icon]',                   type: 'attribute' as const, attribute: 'href' },
  appleIcon:     { selector: 'link[rel="apple-touch-icon"]',      type: 'attribute' as const, attribute: 'href' },
  manifest:      { selector: 'link[rel=manifest]',                type: 'attribute' as const, attribute: 'href' },
};

export async function extractPage(url: string) {
  const res = await fetch('https://api.allscreenshots.com/v1/screenshots', {
    method: 'POST',
    headers: {
      'X-API-Key': process.env.ALLSCREENSHOTS_API_KEY!,
      'Content-Type': 'application/json',
    },
    body: JSON.stringify({
      url,
      blockCookieBanners: true,
      responseType: 'url',
      outputs: [
        { type: 'screenshot', format: 'png' },
        { type: 'markdown', mainContentOnly: true },
        { type: 'json', schema: pageSchema },
      ],
    }),
  });

  if (!res.ok) throw new Error(`Extract failed: ${res.status}`);
  return res.json();
}

Resolve relative URLs (favicon, manifest, appleIcon) against the requested URL on your side — the API returns the attribute as authored on the page.

function absolute(url: string, href: string | null) {
  return href ? new URL(href, url).toString() : null;
}

Feed the screenshot and markdown to an LLM

Deterministic selectors give you names, descriptions, and visual assets. For the parts that don't sit in a tag — industry, tone, what the company actually does — pass the screenshot and the trimmed markdown to an LLM.

Claude (Anthropic SDK):

import Anthropic from '@anthropic-ai/sdk';

const anthropic = new Anthropic();

async function classify(extract: Awaited<ReturnType<typeof extractPage>>) {
  const screenshotUrl = extract.outputs.screenshot.storageUrl;
  const markdownUrl = extract.outputs.markdown.storageUrl;
  const markdown = await fetch(markdownUrl).then((r) => r.text());

  const message = await anthropic.messages.create({
    model: 'claude-sonnet-4-6',
    max_tokens: 512,
    messages: [
      {
        role: 'user',
        content: [
          { type: 'image', source: { type: 'url', url: screenshotUrl } },
          {
            type: 'text',
            text: `Classify this company from its homepage screenshot and main content.\n\nReturn JSON: { industry, oneLineDescription, audience }.\n\nMain content:\n${markdown.slice(0, 8000)}`,
          },
        ],
      },
    ],
  });

  return JSON.parse(message.content[0].type === 'text' ? message.content[0].text : '{}');
}

OpenAI SDK:

import OpenAI from 'openai';

const openai = new OpenAI();

async function classify(extract: Awaited<ReturnType<typeof extractPage>>) {
  const screenshotUrl = extract.outputs.screenshot.storageUrl;
  const markdownUrl = extract.outputs.markdown.storageUrl;
  const markdown = await fetch(markdownUrl).then((r) => r.text());

  const response = await openai.chat.completions.create({
    model: 'gpt-4.1',
    response_format: { type: 'json_object' },
    messages: [
      {
        role: 'user',
        content: [
          { type: 'image_url', image_url: { url: screenshotUrl } },
          {
            type: 'text',
            text: `Classify this company from its homepage screenshot and main content.\n\nReturn JSON: { industry, oneLineDescription, audience }.\n\nMain content:\n${markdown.slice(0, 8000)}`,
          },
        ],
      },
    ],
  });

  return JSON.parse(response.choices[0].message.content ?? '{}');
}

Pass the screenshot and the markdown. The screenshot captures the visual brand and any text rendered after hydration; the markdown gives the LLM enough textual context to classify without you paying for image-only inference on long pages.

Batch with async + webhook (optional)

For ingestion at scale — backfilling a list of prospect URLs, regenerating brand profiles nightly, processing every signup in a queue — switch the same payload to the async endpoint and receive a webhook when the job completes.

curl -X POST 'https://api.allscreenshots.com/v1/screenshots/async' \
  -H 'X-API-Key: YOUR_API_KEY' \
  -H 'Content-Type: application/json' \
  -d '{
    "url": "https://acme.com",
    "webhookUrl": "https://your-server.com/webhooks/extract",
    "outputs": [
      { "type": "screenshot", "format": "png" },
      { "type": "markdown", "mainContentOnly": true },
      { "type": "json", "schema": { /* pageSchema */ } }
    ]
  }'

For thousands of URLs at once, see bulk processing.

Schema cookbook

The page schema above is the most general. Tune it to the workflow:

Minimal — title, description, the OG image, and a favicon.

{
  "type": "json",
  "schema": {
    "title":       { "selector": "meta[property=\"og:title\"]",       "type": "attribute", "attribute": "content" },
    "description": { "selector": "meta[property=\"og:description\"]", "type": "attribute", "attribute": "content" },
    "image":       { "selector": "meta[property=\"og:image\"]",       "type": "attribute", "attribute": "content" },
    "favicon":     { "selector": "link[rel~=icon]",                   "type": "attribute", "attribute": "href" },
    "siteName":    { "selector": "meta[property=\"og:site_name\"]",   "type": "attribute", "attribute": "content" }
  }
}

Product page

Listing data for ecommerce scrapes. The outputs.mdx reference has the canonical product extraction example:

{
  "title":       { "selector": "h1", "type": "text" },
  "price":       { "selector": ".price", "type": "text" },
  "images":      { "selector": "img.product", "type": "attribute", "attribute": "src", "multiple": true },
  "description": { "selector": ".description", "type": "html" }
}

Competitor / SEO audit

Pull titles, meta descriptions, headings, and the canonical link in one pass.

{
  "title":         { "selector": "title", "type": "text" },
  "metaDesc":      { "selector": "meta[name=description]", "type": "attribute", "attribute": "content" },
  "canonical":     { "selector": "link[rel=canonical]",    "type": "attribute", "attribute": "href" },
  "h1":            { "selector": "h1", "type": "text", "multiple": true },
  "h2":            { "selector": "h2", "type": "text", "multiple": true },
  "robotsMeta":    { "selector": "meta[name=robots]", "type": "attribute", "attribute": "content" }
}

Gotchas

selector: "h1" with the default multiple: false returns the matched element's textContent — which concatenates the text of every child. Sites that render multiple responsive h1s (display-toggled spans for desktop / tablet / mobile inside one element) will hand back joined text. Use multiple: true and dedupe on your side, or target a more specific selector.

  • Relative URLs come back relative. favicon, manifest, and appleIcon are returned exactly as authored on the page. Resolve them against the request URL with new URL(href, requestedUrl) before storing or displaying.
  • Missing fields are null, not errors. A site without OG tags returns ogImage: null rather than failing the request. Plan your fallbacks (e.g. use the screenshot's storageUrl when ogImage is null).
  • Twitter site handles include the @. Twitter cards historically write twitter:site as @handle. Strip the prefix before linking out to https://twitter.com/....

Next steps

On this page