AI Fetch on AWS

If your site is served through AWS CloudFront, S3, or a similar AWS edge path, HitKeep cannot observe AI crawler fetches by itself. You need to forward those server-side fetches into:

POST /api/sites/{id}/ingest/ai-fetch

This guide shows the simplest production setup:

CloudFront standard logs land in S3
An AWS Lambda function parses each log object
AI crawler requests are forwarded to HitKeep with a scoped API token

This is the setup we use for static-site deployments such as hitkeep.com.

What You Need

A site already created in HitKeep for the domain you want to track
A team API client or personal API client with access to that site
A CloudFront distribution serving the site
An S3 bucket for CloudFront standard logs
A Lambda function in the same AWS account

Create The HitKeep Token

Create a site-scoped API token in HitKeep first:

Open Settings → API Clients for a personal token, or Administration → Team → Settings for a team token.
Create a token that can access the target site.
Copy the token immediately.

Guide: API Clients

You will need:

HITKEEP_BASE_URL Example: https://cloud.hitkeep.eu
HITKEEP_SITE_ID Example: 6d5f9e7b-...
HITKEEP_API_TOKEN
HITKEEP_HOSTNAME Example: hitkeep.com

Enable CloudFront Standard Logs

In your CloudFront distribution:

Open Monitoring and logging
Enable Standard logging
Choose an S3 bucket and optional prefix
Keep the default tab-delimited log format

HitKeep only needs these fields, which are already present in standard logs:

cs(User-Agent)
cs-uri-stem
cs-uri-query
x-host-header
sc-status
sc-bytes
time-taken
sc-content-type

CloudFront logs are batched, so AI fetches usually appear in HitKeep with a short delay instead of instantly.

Deploy The Lambda Forwarder

An example Lambda forwarder might look like this:

import { gunzipSync } from "node:zlib";
import { buffer } from "node:stream/consumers";

import { GetObjectCommand, S3Client } from "@aws-sdk/client-s3";

const s3 = new S3Client({});

// for example
const botMatchers = [
  "chatgpt-user",
  "gptbot",
  "claudebot",
  "claude-web",
  "perplexitybot",
  "google-extended",
  "googleother",
  "google-safety",
  "applebot-extended",
  "bytespider",
  "ccbot",
  "meta-externalagent",
  "meta-externalfetcher",
  "amazonbot",
  "cohere-ai",
  "youbot",
  "ai2bot",
  "diffbot",
  "timpibot",
  "imagesiftbot",
  "deepseekbot",
  "petalbot",
];

function getRequiredEnv(name) {
  const value = process.env[name]?.trim();
  if (!value) {
    throw new Error(`Missing required environment variable: ${name}`);
  }
  return value;
}

const hitkeepBaseUrl = getRequiredEnv("HITKEEP_BASE_URL").replace(/\/+$/, "");
const hitkeepSiteId = getRequiredEnv("HITKEEP_SITE_ID");
const hitkeepApiToken = getRequiredEnv("HITKEEP_API_TOKEN");
const defaultHostname = process.env.HITKEEP_HOSTNAME?.trim().toLowerCase() || "";
const requestTimeoutMs = Number.parseInt(process.env.HITKEEP_TIMEOUT_MS ?? "10000", 10);

function safeDecode(value) {
  if (!value || value === "-") {
    return "";
  }

  try {
    return decodeURIComponent(value);
  } catch {
    return value;
  }
}

function parseInteger(value) {
  if (!value || value === "-") {
    return undefined;
  }

  const parsed = Number.parseInt(value, 10);
  return Number.isFinite(parsed) ? parsed : undefined;
}

function parseDurationMs(value) {
  if (!value || value === "-") {
    return undefined;
  }

  const parsed = Number.parseFloat(value);
  if (!Number.isFinite(parsed) || parsed < 0) {
    return undefined;
  }

  return Math.round(parsed * 1000);
}

function looksLikeAIBot(userAgent) {
  const normalized = userAgent.trim().toLowerCase();
  return botMatchers.some((token) => normalized.includes(token));
}

function normalizePath(stem, query) {
  const base = stem && stem !== "-" ? stem : "/";
  if (!query || query === "-") {
    return base;
  }
  return `${base}?${query}`;
}

function parseCloudFrontLog(content) {
  const lines = content.split("\n").map((line) => line.trim()).filter(Boolean);
  const fieldsLine = lines.find((line) => line.startsWith("#Fields:"));
  if (!fieldsLine) {
    throw new Error("CloudFront log is missing #Fields header");
  }

  const fieldNames = fieldsLine.replace(/^#Fields:\s*/, "").split(/\s+/);
  const records = [];

  for (const line of lines) {
    if (line.startsWith("#")) {
      continue;
    }

    const values = line.split("\t");
    if (values.length !== fieldNames.length) {
      continue;
    }

    const row = Object.fromEntries(fieldNames.map((field, index) => [field, values[index]]));
    const userAgent = safeDecode(row["cs(User-Agent)"]);
    if (!looksLikeAIBot(userAgent)) {
      continue;
    }

    const path = normalizePath(
      safeDecode(row["cs-uri-stem"]),
      safeDecode(row["cs-uri-query"]),
    );
    const statusCode = parseInteger(row["sc-status"]);
    if (!statusCode) {
      continue;
    }

    const hostname =
      safeDecode(row["x-host-header"]) ||
      safeDecode(row["cs(Host)"]) ||
      defaultHostname;
    const normalizedHostname = hostname ? hostname.toLowerCase() : "";

    if (defaultHostname && normalizedHostname && normalizedHostname !== defaultHostname) {
      continue;
    }

    records.push({
      path,
      hostname: normalizedHostname || undefined,
      status_code: statusCode,
      content_type: safeDecode(row["sc-content-type"]) || undefined,
      response_ms: parseDurationMs(row["time-taken"]),
      bytes_served: parseInteger(row["sc-bytes"]),
      user_agent: userAgent,
    });
  }

  return records;
}

async function loadLogObject(bucket, key) {
  const response = await s3.send(new GetObjectCommand({ Bucket: bucket, Key: key }));
  const body = await buffer(response.Body);
  const raw = key.endsWith(".gz") ? gunzipSync(body) : body;
  return raw.toString("utf8");
}

async function postFetchRecord(record) {
  const controller = new AbortController();
  const timeout = setTimeout(() => controller.abort(), requestTimeoutMs);

  try {
    const response = await fetch(
      `${hitkeepBaseUrl}/api/sites/${hitkeepSiteId}/ingest/ai-fetch`,
      {
        method: "POST",
        headers: {
          Authorization: `Bearer ${hitkeepApiToken}`,
          "Content-Type": "application/json",
        },
        body: JSON.stringify(record),
        signal: controller.signal,
      },
    );

    if (!response.ok) {
      const message = await response.text();
      throw new Error(`HitKeep ingest failed with ${response.status}: ${message}`);
    }
  } finally {
    clearTimeout(timeout);
  }
}

export const handler = async (event) => {
  let processedFiles = 0;
  let forwardedRecords = 0;

  for (const s3Record of event.Records ?? []) {
    const bucket = s3Record.s3?.bucket?.name;
    const key = s3Record.s3?.object?.key
      ? decodeURIComponent(s3Record.s3.object.key.replace(/\+/g, " "))
      : "";

    if (!bucket || !key) {
      continue;
    }

    const content = await loadLogObject(bucket, key);
    const records = parseCloudFrontLog(content);

    for (const record of records) {
      await postFetchRecord(record);
      forwardedRecords += 1;
    }

    processedFiles += 1;
  }

  return {
    processedFiles,
    forwardedRecords,
  };
};

It:

downloads each CloudFront log object from S3
parses the #Fields header dynamically
filters for known AI crawler user agents such as GPTBot, ClaudeBot, PerplexityBot, Google-Extended, and Amazonbot
forwards matching requests to HitKeep AI fetch ingest

Lambda Runtime

Use:

Node.js 22.x or later

Lambda Environment Variables

Set these on the function:

HITKEEP_BASE_URL=https://cloud.hitkeep.eu
HITKEEP_SITE_ID=YOUR_SITE_UUID
HITKEEP_API_TOKEN=YOUR_BEARER_TOKEN
HITKEEP_HOSTNAME=hitkeep.com
HITKEEP_TIMEOUT_MS=10000

IAM Permissions

The function needs:

s3:GetObject on the CloudFront log bucket or prefix
standard CloudWatch Logs write permissions

Minimal inline policy example:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "s3:GetObject"
      ],
      "Resource": [
        "arn:aws:s3:::YOUR_LOG_BUCKET/YOUR_PREFIX/*"
      ]
    },
    {
      "Effect": "Allow",
      "Action": [
        "logs:CreateLogGroup",
        "logs:CreateLogStream",
        "logs:PutLogEvents"
      ],
      "Resource": "*"
    }
  ]
}

Package And Upload

If you only use the single-file example, you can zip it and create the Lambda function directly:

aws lambda create-function \
  --function-name hitkeep-ai-fetch-forwarder \
  --runtime nodejs22.x \
  --handler index.handler \
  --zip-file fileb://ai-fetch-cloudfront-forwarder.zip \
  --role arn:aws:iam::123456789012:role/hitkeep-ai-fetch-forwarder \
  --timeout 60 \
  --memory-size 256

Or update an existing function:

aws lambda update-function-code \
  --function-name hitkeep-ai-fetch-forwarder \
  --zip-file fileb://ai-fetch-cloudfront-forwarder.zip

Set the environment:

aws lambda update-function-configuration \
  --function-name hitkeep-ai-fetch-forwarder \
  --environment "Variables={HITKEEP_BASE_URL=https://cloud.hitkeep.eu,HITKEEP_SITE_ID=YOUR_SITE_UUID,HITKEEP_API_TOKEN=YOUR_BEARER_TOKEN,HITKEEP_HOSTNAME=hitkeep.com,HITKEEP_TIMEOUT_MS=10000}"

Connect S3 Event Notifications

Configure the CloudFront log bucket to invoke the Lambda on object creation:

Event type: s3:ObjectCreated:*
Prefix: your log prefix, if any
Suffix: .gz

That keeps the function focused on completed log files instead of unrelated bucket objects.

What Gets Sent To HitKeep

Each matching AI crawler request is transformed into this payload shape:

{
  "path": "/guides/analytics/ai-visibility/",
  "hostname": "hitkeep.com",
  "status_code": 200,
  "content_type": "text/html",
  "response_ms": 143,
  "bytes_served": 48231,
  "user_agent": "Mozilla/5.0 (compatible; GPTBot/1.0; +https://openai.com/gptbot)"
}

HitKeep derives:

assistant_name
assistant_family
resource_type

from the user agent and content type automatically.

Verify The Setup

Use this checklist:

Confirm CloudFront is writing .gz log files into S3.
Confirm the Lambda is triggered for new objects.
Check CloudWatch Logs for forwardedRecords.
Open AI Visibility in HitKeep and wait for the first crawler rows to arrive.

If the function runs but no rows appear:

verify the token can access the site
verify HITKEEP_SITE_ID is the correct site UUID
verify the host is the tracked host
confirm the requests are real AI bot user agents and not generic crawlers

Notes And Limits

CloudFront standard logs are delayed and batch-oriented. This is normal.
The setup is best for static sites and edge-served origins where you cannot easily instrument an app server directly.
If you run behind ALB, nginx, or Caddy on the origin, origin-side logging is usually even better because it can forward records immediately.
Lambda and S3 event delivery are at-least-once systems. In rare retry scenarios, duplicate AI fetch rows are possible.