AI Fetch on AWS
If your site is served through AWS CloudFront, S3, or a similar AWS edge path, HitKeep cannot observe AI crawler fetches by itself. You need to forward those server-side fetches into:
POST /api/sites/{id}/ingest/ai-fetch
This guide shows the simplest production setup:
- CloudFront standard logs land in S3
- An AWS Lambda function parses each log object
- AI crawler requests are forwarded to HitKeep with a scoped API token
This is the setup we use for static-site deployments such as hitkeep.com.
What You Need
Section titled “What You Need”- A site already created in HitKeep for the domain you want to track
- A team API client or personal API client with access to that site
- A CloudFront distribution serving the site
- An S3 bucket for CloudFront standard logs
- A Lambda function in the same AWS account
Create The HitKeep Token
Section titled “Create The HitKeep Token”Create a site-scoped API token in HitKeep first:
- Open Settings → API Clients for a personal token, or Administration → Team → Settings for a team token.
- Create a token that can access the target site.
- Copy the token immediately.
Guide: API Clients
You will need:
HITKEEP_BASE_URLExample:https://cloud.hitkeep.euHITKEEP_SITE_IDExample:6d5f9e7b-...HITKEEP_API_TOKENHITKEEP_HOSTNAMEExample:hitkeep.com
Enable CloudFront Standard Logs
Section titled “Enable CloudFront Standard Logs”In your CloudFront distribution:
- Open Monitoring and logging
- Enable Standard logging
- Choose an S3 bucket and optional prefix
- Keep the default tab-delimited log format
HitKeep only needs these fields, which are already present in standard logs:
cs(User-Agent)cs-uri-stemcs-uri-queryx-host-headersc-statussc-bytestime-takensc-content-type
CloudFront logs are batched, so AI fetches usually appear in HitKeep with a short delay instead of instantly.
Deploy The Lambda Forwarder
Section titled “Deploy The Lambda Forwarder”An example Lambda forwarder might look like this:
import { gunzipSync } from "node:zlib";import { buffer } from "node:stream/consumers";
import { GetObjectCommand, S3Client } from "@aws-sdk/client-s3";
const s3 = new S3Client({});
// for exampleconst botMatchers = [ "chatgpt-user", "gptbot", "claudebot", "claude-web", "perplexitybot", "google-extended", "googleother", "google-safety", "applebot-extended", "bytespider", "ccbot", "meta-externalagent", "meta-externalfetcher", "amazonbot", "cohere-ai", "youbot", "ai2bot", "diffbot", "timpibot", "imagesiftbot", "deepseekbot", "petalbot",];
function getRequiredEnv(name) { const value = process.env[name]?.trim(); if (!value) { throw new Error(`Missing required environment variable: ${name}`); } return value;}
const hitkeepBaseUrl = getRequiredEnv("HITKEEP_BASE_URL").replace(/\/+$/, "");const hitkeepSiteId = getRequiredEnv("HITKEEP_SITE_ID");const hitkeepApiToken = getRequiredEnv("HITKEEP_API_TOKEN");const defaultHostname = process.env.HITKEEP_HOSTNAME?.trim().toLowerCase() || "";const requestTimeoutMs = Number.parseInt(process.env.HITKEEP_TIMEOUT_MS ?? "10000", 10);
function safeDecode(value) { if (!value || value === "-") { return ""; }
try { return decodeURIComponent(value); } catch { return value; }}
function parseInteger(value) { if (!value || value === "-") { return undefined; }
const parsed = Number.parseInt(value, 10); return Number.isFinite(parsed) ? parsed : undefined;}
function parseDurationMs(value) { if (!value || value === "-") { return undefined; }
const parsed = Number.parseFloat(value); if (!Number.isFinite(parsed) || parsed < 0) { return undefined; }
return Math.round(parsed * 1000);}
function looksLikeAIBot(userAgent) { const normalized = userAgent.trim().toLowerCase(); return botMatchers.some((token) => normalized.includes(token));}
function normalizePath(stem, query) { const base = stem && stem !== "-" ? stem : "/"; if (!query || query === "-") { return base; } return `${base}?${query}`;}
function parseCloudFrontLog(content) { const lines = content.split("\n").map((line) => line.trim()).filter(Boolean); const fieldsLine = lines.find((line) => line.startsWith("#Fields:")); if (!fieldsLine) { throw new Error("CloudFront log is missing #Fields header"); }
const fieldNames = fieldsLine.replace(/^#Fields:\s*/, "").split(/\s+/); const records = [];
for (const line of lines) { if (line.startsWith("#")) { continue; }
const values = line.split("\t"); if (values.length !== fieldNames.length) { continue; }
const row = Object.fromEntries(fieldNames.map((field, index) => [field, values[index]])); const userAgent = safeDecode(row["cs(User-Agent)"]); if (!looksLikeAIBot(userAgent)) { continue; }
const path = normalizePath( safeDecode(row["cs-uri-stem"]), safeDecode(row["cs-uri-query"]), ); const statusCode = parseInteger(row["sc-status"]); if (!statusCode) { continue; }
const hostname = safeDecode(row["x-host-header"]) || safeDecode(row["cs(Host)"]) || defaultHostname; const normalizedHostname = hostname ? hostname.toLowerCase() : "";
if (defaultHostname && normalizedHostname && normalizedHostname !== defaultHostname) { continue; }
records.push({ path, hostname: normalizedHostname || undefined, status_code: statusCode, content_type: safeDecode(row["sc-content-type"]) || undefined, response_ms: parseDurationMs(row["time-taken"]), bytes_served: parseInteger(row["sc-bytes"]), user_agent: userAgent, }); }
return records;}
async function loadLogObject(bucket, key) { const response = await s3.send(new GetObjectCommand({ Bucket: bucket, Key: key })); const body = await buffer(response.Body); const raw = key.endsWith(".gz") ? gunzipSync(body) : body; return raw.toString("utf8");}
async function postFetchRecord(record) { const controller = new AbortController(); const timeout = setTimeout(() => controller.abort(), requestTimeoutMs);
try { const response = await fetch( `${hitkeepBaseUrl}/api/sites/${hitkeepSiteId}/ingest/ai-fetch`, { method: "POST", headers: { Authorization: `Bearer ${hitkeepApiToken}`, "Content-Type": "application/json", }, body: JSON.stringify(record), signal: controller.signal, }, );
if (!response.ok) { const message = await response.text(); throw new Error(`HitKeep ingest failed with ${response.status}: ${message}`); } } finally { clearTimeout(timeout); }}
export const handler = async (event) => { let processedFiles = 0; let forwardedRecords = 0;
for (const s3Record of event.Records ?? []) { const bucket = s3Record.s3?.bucket?.name; const key = s3Record.s3?.object?.key ? decodeURIComponent(s3Record.s3.object.key.replace(/\+/g, " ")) : "";
if (!bucket || !key) { continue; }
const content = await loadLogObject(bucket, key); const records = parseCloudFrontLog(content);
for (const record of records) { await postFetchRecord(record); forwardedRecords += 1; }
processedFiles += 1; }
return { processedFiles, forwardedRecords, };};It:
- downloads each CloudFront log object from S3
- parses the
#Fieldsheader dynamically - filters for known AI crawler user agents such as
GPTBot,ClaudeBot,PerplexityBot,Google-Extended, andAmazonbot - forwards matching requests to HitKeep AI fetch ingest
Lambda Runtime
Section titled “Lambda Runtime”Use:
- Node.js 22.x or later
Lambda Environment Variables
Section titled “Lambda Environment Variables”Set these on the function:
HITKEEP_BASE_URL=https://cloud.hitkeep.euHITKEEP_SITE_ID=YOUR_SITE_UUIDHITKEEP_API_TOKEN=YOUR_BEARER_TOKENHITKEEP_HOSTNAME=hitkeep.comHITKEEP_TIMEOUT_MS=10000IAM Permissions
Section titled “IAM Permissions”The function needs:
s3:GetObjecton the CloudFront log bucket or prefix- standard CloudWatch Logs write permissions
Minimal inline policy example:
{ "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": [ "s3:GetObject" ], "Resource": [ "arn:aws:s3:::YOUR_LOG_BUCKET/YOUR_PREFIX/*" ] }, { "Effect": "Allow", "Action": [ "logs:CreateLogGroup", "logs:CreateLogStream", "logs:PutLogEvents" ], "Resource": "*" } ]}Package And Upload
Section titled “Package And Upload”If you only use the single-file example, you can zip it and create the Lambda function directly:
aws lambda create-function \ --function-name hitkeep-ai-fetch-forwarder \ --runtime nodejs22.x \ --handler index.handler \ --zip-file fileb://ai-fetch-cloudfront-forwarder.zip \ --role arn:aws:iam::123456789012:role/hitkeep-ai-fetch-forwarder \ --timeout 60 \ --memory-size 256Or update an existing function:
aws lambda update-function-code \ --function-name hitkeep-ai-fetch-forwarder \ --zip-file fileb://ai-fetch-cloudfront-forwarder.zipSet the environment:
aws lambda update-function-configuration \ --function-name hitkeep-ai-fetch-forwarder \ --environment "Variables={HITKEEP_BASE_URL=https://cloud.hitkeep.eu,HITKEEP_SITE_ID=YOUR_SITE_UUID,HITKEEP_API_TOKEN=YOUR_BEARER_TOKEN,HITKEEP_HOSTNAME=hitkeep.com,HITKEEP_TIMEOUT_MS=10000}"Connect S3 Event Notifications
Section titled “Connect S3 Event Notifications”Configure the CloudFront log bucket to invoke the Lambda on object creation:
- Event type:
s3:ObjectCreated:* - Prefix: your log prefix, if any
- Suffix:
.gz
That keeps the function focused on completed log files instead of unrelated bucket objects.
What Gets Sent To HitKeep
Section titled “What Gets Sent To HitKeep”Each matching AI crawler request is transformed into this payload shape:
{ "path": "/guides/analytics/ai-visibility/", "hostname": "hitkeep.com", "status_code": 200, "content_type": "text/html", "response_ms": 143, "bytes_served": 48231, "user_agent": "Mozilla/5.0 (compatible; GPTBot/1.0; +https://openai.com/gptbot)"}HitKeep derives:
assistant_nameassistant_familyresource_type
from the user agent and content type automatically.
Verify The Setup
Section titled “Verify The Setup”Use this checklist:
- Confirm CloudFront is writing
.gzlog files into S3. - Confirm the Lambda is triggered for new objects.
- Check CloudWatch Logs for
forwardedRecords. - Open AI Visibility in HitKeep and wait for the first crawler rows to arrive.
If the function runs but no rows appear:
- verify the token can access the site
- verify
HITKEEP_SITE_IDis the correct site UUID - verify the host is the tracked host
- confirm the requests are real AI bot user agents and not generic crawlers
Notes And Limits
Section titled “Notes And Limits”- CloudFront standard logs are delayed and batch-oriented. This is normal.
- The setup is best for static sites and edge-served origins where you cannot easily instrument an app server directly.
- If you run behind ALB, nginx, or Caddy on the origin, origin-side logging is usually even better because it can forward records immediately.
- Lambda and S3 event delivery are at-least-once systems. In rare retry scenarios, duplicate AI fetch rows are possible.