Skip to content
Start In Cloud

Bot and Spam Filtering

Bot traffic, referrer spam, uptime probes, and low-quality crawler noise can ruin otherwise useful analytics. A dashboard full of fake referrals and junk pageviews is not just annoying. It distorts conversion rates, pollutes attribution, and makes SEO decisions worse.

HitKeep now includes an OSS baseline spam-filter pipeline designed for self-hosted analytics operators who want cleaner data without deploying a separate abuse stack.

The current OSS baseline focuses on high-confidence traffic that is safe to drop before it ever reaches the analytics database:

These checks happen at ingest time for both pageview hits and custom events. If a request matches a blocked network or blocked referrer host, HitKeep accepts the HTTP request but does not persist the hit or event.

The biggest benefit is not vanity. It is query quality:

  • top referrers are less likely to be polluted by fake SEO domains
  • organic and referral attribution are easier to trust
  • landing-page and funnel conversion reports stay closer to real user behavior
  • long-tail SEO pages do not get inflated by junk crawler bursts

If you run a content-heavy site, a docs portal, or a marketing site with many low-volume landing pages, this matters a lot. A few hundred fake hits can completely distort the apparent performance of long-tail pages.

Referrer spam works by sending fake or low-value traffic with a misleading Referer header so the target domain appears in analytics dashboards. Classic examples include fake SEO services, gambling domains, and bot-traffic sellers.

HitKeep compiles the Matomo referrer spam list into a local cache and checks the normalized referrer host at pageview ingest time. Custom events do not carry a referrer, so they are only checked against Spamhaus network rules and IP exclusions.

Current behavior:

  • https://spam.example/path is normalized to spam.example
  • www.spam.example is normalized to spam.example
  • same-site referrers are not blocked just because the host appears in a denylist
  • direct traffic is preserved as direct traffic

This makes the filter strict enough to block known spam while avoiding the most obvious false positives for internal navigation.

Some traffic is not merely low-quality referral spam. It originates from networks that Spamhaus designates in its DROP lists as high-confidence abuse space.

HitKeep consumes:

These CIDR network lists are enforced directly at ingest time for both pageviews and custom events. That keeps the Spamhaus side of the OSS baseline simple, deterministic, and well aligned with HitKeep’s offline-first deployment model.

HitKeep now stores the request hostname for each persisted hit. That enables analytics filtering by:

  • hostname
  • referrer_host

Why this matters:

  • sites served behind multiple hostnames can isolate traffic faster
  • proxy or edge misconfiguration becomes easier to spot
  • teams can inspect whether referral noise is concentrated on one hostname

Hostname filtering is query-time filtering. It does not block traffic by itself. It gives you a cleaner way to slice reports once the data is stored.

These systems solve different problems:

Use IP exclusions when you know exactly which traffic you want to suppress:

  • your own office IPs
  • VPN egress addresses
  • staging monitors
  • synthetic uptime checks

Use the built-in OSS spam filter when you want HitKeep to suppress widely known abuse patterns automatically:

  • referrer spam domains
  • DROP-listed malicious networks

In practice, you usually want both.

The OSS baseline is not a one-off static list baked into the product forever.

HitKeep supports:

  • a repo-shipped embedded snapshot so the feature works immediately on fresh installs
  • a compiled local cache file at data/spam-filter.json by default
  • the offline refresh command:
Terminal window
hitkeep update-spam-lists
  • optional leader-side automatic refresh in the running server

The important part for self-hosting and airgapped environments: HitKeep does not require live network fetches at runtime.

For maintainers, the shipped embedded snapshot can be refreshed in-repo with:

Terminal window
make update-default-spam-filter

Default behavior:

  • use the embedded repo snapshot if no cache file exists
  • use the local cache file if you generated one
  • make no outbound feed requests unless you explicitly run the updater command or enable auto-refresh

Relevant flags:

Terminal window
-spam-filter-path=/var/lib/hitkeep/data/spam-filter.json
-spam-filter-auto-update=false
-spam-filter-update-interval=1440

This keeps the runtime model simple:

  • upstream OSS feeds are fetched
  • normalized into one local artifact
  • the artifact is reused by the running server
  • your analytics pipeline does not depend on third-party APIs at query time

For accepted hits and events, HitKeep stores the analytics fields needed for reporting, including the request hostname for pageviews.

For blocked spam hits and events, HitKeep currently:

  • returns 202 Accepted at ingest (both /ingest and /ingest/event)
  • drops the row before persistence
  • logs the reason server-side

HitKeep does not currently store a quarantine table of rejected spam requests. That keeps the live analytics database cleaner and simpler, but it also means the spam filter is intentionally optimized for high-confidence rules.

The OSS baseline is intentionally conservative. It is useful, but it is not a full enterprise anti-bot platform.

Current limits:

  • no challenge/JS/browser fingerprinting system
  • no reverse-DNS verification pipeline for search bots yet
  • no ML scoring or per-request reputation scoring
  • no separate rejected-hit audit table yet

That is a deliberate product choice: self-hosted analytics should not require a security team just to keep dashboards clean.

For most self-hosted deployments, the best baseline is:

  1. Configure trusted proxies correctly.
  2. Add your internal traffic to IP exclusions.
  3. Run hitkeep update-spam-lists during provisioning or release rollout.
  4. Enable auto-refresh only if your deployment is intentionally online and you want unattended feed updates.
  5. Use hostname and referrer_host filters when investigating attribution anomalies.

That gives you a very strong starting point for privacy-friendly analytics bot filtering without adding Redis, Kafka, ClickHouse, or an external threat-intelligence service.