Skip to content
Start In Cloud

AI Fetch on AWS

If your site is served through AWS CloudFront, S3, or a similar AWS edge path, HitKeep cannot observe AI crawler fetches by itself. You need to forward those server-side fetches into:

POST /api/sites/{id}/ingest/ai-fetch

This guide shows the simplest production setup:

  1. CloudFront standard logs land in S3
  2. An AWS Lambda function parses each log object
  3. AI crawler requests are forwarded to HitKeep with a scoped API token

This is the setup we use for static-site deployments such as hitkeep.com.

  • A site already created in HitKeep for the domain you want to track
  • A team API client or personal API client with access to that site
  • A CloudFront distribution serving the site
  • An S3 bucket for CloudFront standard logs
  • A Lambda function in the same AWS account

Create a site-scoped API token in HitKeep first:

  1. Open Settings → API Clients for a personal token, or Administration → Team → Settings for a team token.
  2. Create a token that can access the target site.
  3. Copy the token immediately.

Guide: API Clients

You will need:

  • HITKEEP_BASE_URL Example: https://cloud.hitkeep.eu
  • HITKEEP_SITE_ID Example: 6d5f9e7b-...
  • HITKEEP_API_TOKEN
  • HITKEEP_HOSTNAME Example: hitkeep.com

In your CloudFront distribution:

  1. Open Monitoring and logging
  2. Enable Standard logging
  3. Choose an S3 bucket and optional prefix
  4. Keep the default tab-delimited log format

HitKeep only needs these fields, which are already present in standard logs:

  • cs(User-Agent)
  • cs-uri-stem
  • cs-uri-query
  • x-host-header
  • sc-status
  • sc-bytes
  • time-taken
  • sc-content-type

CloudFront logs are batched, so AI fetches usually appear in HitKeep with a short delay instead of instantly.

An example Lambda forwarder might look like this:

import { gunzipSync } from "node:zlib";
import { buffer } from "node:stream/consumers";
import { GetObjectCommand, S3Client } from "@aws-sdk/client-s3";
const s3 = new S3Client({});
// for example
const botMatchers = [
"chatgpt-user",
"gptbot",
"claudebot",
"claude-web",
"perplexitybot",
"google-extended",
"googleother",
"google-safety",
"applebot-extended",
"bytespider",
"ccbot",
"meta-externalagent",
"meta-externalfetcher",
"amazonbot",
"cohere-ai",
"youbot",
"ai2bot",
"diffbot",
"timpibot",
"imagesiftbot",
"deepseekbot",
"petalbot",
];
function getRequiredEnv(name) {
const value = process.env[name]?.trim();
if (!value) {
throw new Error(`Missing required environment variable: ${name}`);
}
return value;
}
const hitkeepBaseUrl = getRequiredEnv("HITKEEP_BASE_URL").replace(/\/+$/, "");
const hitkeepSiteId = getRequiredEnv("HITKEEP_SITE_ID");
const hitkeepApiToken = getRequiredEnv("HITKEEP_API_TOKEN");
const defaultHostname = process.env.HITKEEP_HOSTNAME?.trim().toLowerCase() || "";
const requestTimeoutMs = Number.parseInt(process.env.HITKEEP_TIMEOUT_MS ?? "10000", 10);
function safeDecode(value) {
if (!value || value === "-") {
return "";
}
try {
return decodeURIComponent(value);
} catch {
return value;
}
}
function parseInteger(value) {
if (!value || value === "-") {
return undefined;
}
const parsed = Number.parseInt(value, 10);
return Number.isFinite(parsed) ? parsed : undefined;
}
function parseDurationMs(value) {
if (!value || value === "-") {
return undefined;
}
const parsed = Number.parseFloat(value);
if (!Number.isFinite(parsed) || parsed < 0) {
return undefined;
}
return Math.round(parsed * 1000);
}
function looksLikeAIBot(userAgent) {
const normalized = userAgent.trim().toLowerCase();
return botMatchers.some((token) => normalized.includes(token));
}
function normalizePath(stem, query) {
const base = stem && stem !== "-" ? stem : "/";
if (!query || query === "-") {
return base;
}
return `${base}?${query}`;
}
function parseCloudFrontLog(content) {
const lines = content.split("\n").map((line) => line.trim()).filter(Boolean);
const fieldsLine = lines.find((line) => line.startsWith("#Fields:"));
if (!fieldsLine) {
throw new Error("CloudFront log is missing #Fields header");
}
const fieldNames = fieldsLine.replace(/^#Fields:\s*/, "").split(/\s+/);
const records = [];
for (const line of lines) {
if (line.startsWith("#")) {
continue;
}
const values = line.split("\t");
if (values.length !== fieldNames.length) {
continue;
}
const row = Object.fromEntries(fieldNames.map((field, index) => [field, values[index]]));
const userAgent = safeDecode(row["cs(User-Agent)"]);
if (!looksLikeAIBot(userAgent)) {
continue;
}
const path = normalizePath(
safeDecode(row["cs-uri-stem"]),
safeDecode(row["cs-uri-query"]),
);
const statusCode = parseInteger(row["sc-status"]);
if (!statusCode) {
continue;
}
const hostname =
safeDecode(row["x-host-header"]) ||
safeDecode(row["cs(Host)"]) ||
defaultHostname;
const normalizedHostname = hostname ? hostname.toLowerCase() : "";
if (defaultHostname && normalizedHostname && normalizedHostname !== defaultHostname) {
continue;
}
records.push({
path,
hostname: normalizedHostname || undefined,
status_code: statusCode,
content_type: safeDecode(row["sc-content-type"]) || undefined,
response_ms: parseDurationMs(row["time-taken"]),
bytes_served: parseInteger(row["sc-bytes"]),
user_agent: userAgent,
});
}
return records;
}
async function loadLogObject(bucket, key) {
const response = await s3.send(new GetObjectCommand({ Bucket: bucket, Key: key }));
const body = await buffer(response.Body);
const raw = key.endsWith(".gz") ? gunzipSync(body) : body;
return raw.toString("utf8");
}
async function postFetchRecord(record) {
const controller = new AbortController();
const timeout = setTimeout(() => controller.abort(), requestTimeoutMs);
try {
const response = await fetch(
`${hitkeepBaseUrl}/api/sites/${hitkeepSiteId}/ingest/ai-fetch`,
{
method: "POST",
headers: {
Authorization: `Bearer ${hitkeepApiToken}`,
"Content-Type": "application/json",
},
body: JSON.stringify(record),
signal: controller.signal,
},
);
if (!response.ok) {
const message = await response.text();
throw new Error(`HitKeep ingest failed with ${response.status}: ${message}`);
}
} finally {
clearTimeout(timeout);
}
}
export const handler = async (event) => {
let processedFiles = 0;
let forwardedRecords = 0;
for (const s3Record of event.Records ?? []) {
const bucket = s3Record.s3?.bucket?.name;
const key = s3Record.s3?.object?.key
? decodeURIComponent(s3Record.s3.object.key.replace(/\+/g, " "))
: "";
if (!bucket || !key) {
continue;
}
const content = await loadLogObject(bucket, key);
const records = parseCloudFrontLog(content);
for (const record of records) {
await postFetchRecord(record);
forwardedRecords += 1;
}
processedFiles += 1;
}
return {
processedFiles,
forwardedRecords,
};
};

It:

  • downloads each CloudFront log object from S3
  • parses the #Fields header dynamically
  • filters for known AI crawler user agents such as GPTBot, ClaudeBot, PerplexityBot, Google-Extended, and Amazonbot
  • forwards matching requests to HitKeep AI fetch ingest

Use:

  • Node.js 22.x or later

Set these on the function:

HITKEEP_BASE_URL=https://cloud.hitkeep.eu
HITKEEP_SITE_ID=YOUR_SITE_UUID
HITKEEP_API_TOKEN=YOUR_BEARER_TOKEN
HITKEEP_HOSTNAME=hitkeep.com
HITKEEP_TIMEOUT_MS=10000

The function needs:

  • s3:GetObject on the CloudFront log bucket or prefix
  • standard CloudWatch Logs write permissions

Minimal inline policy example:

{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"s3:GetObject"
],
"Resource": [
"arn:aws:s3:::YOUR_LOG_BUCKET/YOUR_PREFIX/*"
]
},
{
"Effect": "Allow",
"Action": [
"logs:CreateLogGroup",
"logs:CreateLogStream",
"logs:PutLogEvents"
],
"Resource": "*"
}
]
}

If you only use the single-file example, you can zip it and create the Lambda function directly:

Terminal window
aws lambda create-function \
--function-name hitkeep-ai-fetch-forwarder \
--runtime nodejs22.x \
--handler index.handler \
--zip-file fileb://ai-fetch-cloudfront-forwarder.zip \
--role arn:aws:iam::123456789012:role/hitkeep-ai-fetch-forwarder \
--timeout 60 \
--memory-size 256

Or update an existing function:

Terminal window
aws lambda update-function-code \
--function-name hitkeep-ai-fetch-forwarder \
--zip-file fileb://ai-fetch-cloudfront-forwarder.zip

Set the environment:

Terminal window
aws lambda update-function-configuration \
--function-name hitkeep-ai-fetch-forwarder \
--environment "Variables={HITKEEP_BASE_URL=https://cloud.hitkeep.eu,HITKEEP_SITE_ID=YOUR_SITE_UUID,HITKEEP_API_TOKEN=YOUR_BEARER_TOKEN,HITKEEP_HOSTNAME=hitkeep.com,HITKEEP_TIMEOUT_MS=10000}"

Configure the CloudFront log bucket to invoke the Lambda on object creation:

  • Event type: s3:ObjectCreated:*
  • Prefix: your log prefix, if any
  • Suffix: .gz

That keeps the function focused on completed log files instead of unrelated bucket objects.

Each matching AI crawler request is transformed into this payload shape:

{
"path": "/guides/analytics/ai-visibility/",
"hostname": "hitkeep.com",
"status_code": 200,
"content_type": "text/html",
"response_ms": 143,
"bytes_served": 48231,
"user_agent": "Mozilla/5.0 (compatible; GPTBot/1.0; +https://openai.com/gptbot)"
}

HitKeep derives:

  • assistant_name
  • assistant_family
  • resource_type

from the user agent and content type automatically.

Use this checklist:

  1. Confirm CloudFront is writing .gz log files into S3.
  2. Confirm the Lambda is triggered for new objects.
  3. Check CloudWatch Logs for forwardedRecords.
  4. Open AI Visibility in HitKeep and wait for the first crawler rows to arrive.

If the function runs but no rows appear:

  • verify the token can access the site
  • verify HITKEEP_SITE_ID is the correct site UUID
  • verify the host is the tracked host
  • confirm the requests are real AI bot user agents and not generic crawlers
  • CloudFront standard logs are delayed and batch-oriented. This is normal.
  • The setup is best for static sites and edge-served origins where you cannot easily instrument an app server directly.
  • If you run behind ALB, nginx, or Caddy on the origin, origin-side logging is usually even better because it can forward records immediately.
  • Lambda and S3 event delivery are at-least-once systems. In rare retry scenarios, duplicate AI fetch rows are possible.