AI Crawler Fetch Ingest in HitKeep

HitKeep cannot see AI crawler fetches from the browser tracker because most AI crawlers do not run JavaScript. To populate AI Visibility, forward matching edge, proxy, CDN, or origin log records to:

POST /api/sites/{site_id}/ingest/ai-fetch
Authorization: Bearer <hitkeep-api-client-token>
Content-Type: application/json

Use this guide for any platform: nginx, Caddy, Apache, Cloudflare Workers, Fastly Compute, Vercel edge logs, Netlify functions, app-server middleware, CDN log drains, or a small batch job that reads access logs.

What This Endpoint Records

AI fetch ingest stores server-side crawler fetch metadata for one HitKeep site. The dashboard uses those rows to show:

which AI crawlers fetched your pages
which paths and resource types they requested
4xx and 5xx patterns
response-time and byte-size context
correlation with later AI-referred human visits

The endpoint records the time when HitKeep accepts the row. It does not accept a caller-provided historical timestamp. For delayed CDN logs, forward new batches as close to log creation time as practical and keep the source logs as your exact audit trail.

Requirements

You need:

a HitKeep site ID for the site being tracked
an API client token with a site grant for that site
a site role grant that includes site.manage_data, such as site admin or owner
access to logs or middleware that includes request path, HTTP status, and user agent

Site grants are required. An instance/admin API-client role alone does not allow AI fetch ingest.

Guide: API Clients

Payload

Send one JSON object per AI crawler request:

{
  "path": "/guides/analytics/ai-visibility/?utm_source=docs",
  "hostname": "www.example.com",
  "status_code": 200,
  "content_type": "text/html; charset=utf-8",
  "response_ms": 143,
  "bytes_served": 48231,
  "user_agent": "Mozilla/5.0 (compatible; GPTBot/1.0; +https://openai.com/gptbot)"
}

Field	Required	Notes
`path`	Yes	URL path with optional query string. If your source log has a full URL, strip it to path and query.
`status_code`	Yes	HTTP status from the original crawler request. Must be between `100` and `599`.
`user_agent`	Yes	Original crawler user agent. HitKeep accepts known AI crawler tokens and rejects unknown user agents.
`hostname`	No	Host that served the request. Useful when one forwarder sees several hostnames.
`content_type`	No	Response content type. HitKeep derives `resource_type` from this value.
`response_ms`	No	Positive response time in milliseconds.
`bytes_served`	No	Positive response byte count.

HitKeep derives assistant_name, assistant_family, and resource_type server-side. Do not send those fields as your durable contract.

Minimal Curl Test

Use a real site ID and an API client token with a site grant:

curl -i -X POST "https://analytics.example.com/api/sites/YOUR_SITE_ID/ingest/ai-fetch" \
  -H "Authorization: Bearer YOUR_API_CLIENT_TOKEN" \
  -H "Content-Type: application/json" \
  --data '{
    "path": "/docs/",
    "hostname": "www.example.com",
    "status_code": 200,
    "content_type": "text/html",
    "response_ms": 120,
    "bytes_served": 18422,
    "user_agent": "Mozilla/5.0 (compatible; GPTBot/1.0; +https://openai.com/gptbot)"
  }'

A successful ingest returns 202 Accepted.

If you receive 400 user_agent must match a known AI bot, the row is not an AI crawler row HitKeep currently recognizes. Filter it out or update HitKeep if a new AI crawler needs first-class classification.

Known AI Crawler Families

HitKeep classifies common AI crawler user-agent tokens, including:

Family	Example tokens
OpenAI	`GPTBot`, `ChatGPT-User`
Anthropic	`ClaudeBot`, `Claude-Web`
Perplexity	`PerplexityBot`
Google	`Google-Extended`, `GoogleOther`, `Google-Safety`
Apple	`Applebot-Extended`
Meta	`Meta-ExternalAgent`, `Meta-ExternalFetcher`
Amazon	`Amazonbot`
Common Crawl	`CCBot`
Other supported crawlers	`Bytespider`, `Cohere`, `YouBot`, `AI2Bot`, `Diffbot`, `Timpibot`, `ImagesiftBot`, `DeepSeekBot`, `PetalBot`

The exact classifier lives in the HitKeep runtime. Treat this table as the current public contract for integrations, not as a replacement for checking the ingest response.

Integration Pattern

Every forwarder follows the same shape:

Read one request from an access log, edge event, or middleware hook.
Keep only known AI crawler user agents.
Normalize the request target to path plus query string.
Map status, content type, latency, and bytes into the HitKeep payload.
POST the payload to HitKeep with a site-granted API client token.
Retry transient 5xx or network errors with a bounded retry policy.
Do not retry permanent 4xx validation errors without changing the payload.

Forwarders should not send raw visitor IP addresses. The AI fetch endpoint does not accept or store them.

Node.js Forwarder Skeleton

This example shows the platform-neutral part. Replace readLogRows() with your own source: an nginx log parser, CDN log drain, edge function event, or app-server middleware.

const aiBotTokens = [
  "chatgpt-user",
  "gptbot",
  "claudebot",
  "claude-web",
  "perplexitybot",
  "google-extended",
  "googleother",
  "google-safety",
  "applebot-extended",
  "bytespider",
  "ccbot",
  "meta-externalagent",
  "meta-externalfetcher",
  "amazonbot",
  "cohere-ai",
  "youbot",
  "ai2bot",
  "diffbot",
  "timpibot",
  "imagesiftbot",
  "deepseekbot",
  "petalbot",
];

const hitkeepBaseUrl = process.env.HITKEEP_BASE_URL.replace(/\/+$/, "");
const hitkeepSiteId = process.env.HITKEEP_SITE_ID;
const hitkeepToken = process.env.HITKEEP_API_TOKEN;

function isAIBot(userAgent) {
  const normalized = (userAgent || "").toLowerCase();
  return aiBotTokens.some((token) => normalized.includes(token));
}

function toPathWithQuery(rawUrl) {
  if (!rawUrl) return "/";
  try {
    const parsed = new URL(rawUrl, "https://placeholder.invalid");
    return `${parsed.pathname}${parsed.search}`;
  } catch {
    return rawUrl.startsWith("/") ? rawUrl : "/";
  }
}

async function postToHitKeep(record) {
  const response = await fetch(`${hitkeepBaseUrl}/api/sites/${hitkeepSiteId}/ingest/ai-fetch`, {
    method: "POST",
    headers: {
      Authorization: `Bearer ${hitkeepToken}`,
      "Content-Type": "application/json",
    },
    body: JSON.stringify(record),
  });

  if (!response.ok) {
    throw new Error(`HitKeep AI fetch ingest failed: ${response.status} ${await response.text()}`);
  }
}

async function forwardRows(rows) {
  for (const row of rows) {
    if (!isAIBot(row.userAgent)) continue;

    await postToHitKeep({
      path: toPathWithQuery(row.url),
      hostname: row.hostname,
      status_code: row.statusCode,
      content_type: row.contentType,
      response_ms: row.responseMs,
      bytes_served: row.bytesServed,
      user_agent: row.userAgent,
    });
  }
}

Keep the token in an environment variable or secrets manager. Do not embed it in browser JavaScript.

Source-Specific Mapping

Source	Good mapping
nginx access log	`$request_uri`, `$host`, `$status`, `$sent_http_content_type`, `$request_time`, `$bytes_sent`, `$http_user_agent`
Caddy access log	`request.uri`, `request.host`, `status`, response content type header, `duration`, `size`, `request.headers.User-Agent`
Apache access log	request path, `%>s`, `%b`, `%{User-agent}i`, plus content type and duration if included in your log format
CDN logs	URI stem/query, host header, status, content type, edge/origin duration, response bytes, user agent
App middleware	request URL, host, response status, response content type, measured duration, response bytes if available, user agent

You do not need a perfect first version. path, status_code, and user_agent are enough to start seeing fetch volume and error patterns. Add content type, latency, and byte counts when your log source provides them reliably.

Verify Data In HitKeep

Open AI Visibility for the site.
Select a date range that includes the forwarder runtime.
Check total fetches, top assistants, top paths, and error paths.
Use the assistant and resource-type filters to confirm classification.
Open the correlation section after normal AI-referred visits arrive through hk.js.

For direct API checks:

curl "https://analytics.example.com/api/sites/YOUR_SITE_ID/ai-fetch/overview" \
  -H "Authorization: Bearer YOUR_API_CLIENT_TOKEN"

Troubleshooting

Symptom	Likely cause
`401 Unauthorized`	Missing or invalid bearer token.
`403 Forbidden`	Token has no site grant, or the grant does not include `site.manage_data`.
`404 Site not found`	The site ID is wrong or belongs to a site the token cannot access.
`400 user_agent must match a known AI bot`	The forwarder sent a non-AI crawler or an unsupported AI crawler token.
Rows arrive, but correlation is empty	AI crawler fetches exist, but matching AI-referred human visits have not arrived through the browser tracker for the same paths and window.
Timestamps look delayed	The endpoint records HitKeep ingest time. Forward CDN or batch logs promptly.