Bot and Spam Filtering

Bot traffic, referrer spam, uptime probes, and low-quality crawler noise can ruin otherwise useful analytics. A dashboard full of fake referrals and junk pageviews is not just annoying. It distorts conversion rates, pollutes attribution, and makes SEO decisions worse.

HitKeep now includes an OSS baseline spam-filter pipeline designed for self-hosted analytics operators who want cleaner data without deploying a separate abuse stack.

What HitKeep Filters

The current OSS baseline focuses on high-confidence traffic that is safe to drop before it ever reaches the analytics database:

Referrer spam hostnames from the Matomo referrer spam list
Known abusive IP networks from Spamhaus DROP and Spamhaus DROPv6
Your own traffic and internal infrastructure through site and global IP exclusions

These checks happen at ingest time for both pageview hits and custom events. If a request matches a blocked network or blocked referrer host, HitKeep accepts the HTTP request but does not persist the hit or event.

What This Means for Analytics Accuracy

The biggest benefit is not vanity. It is query quality:

top referrers are less likely to be polluted by fake SEO domains
organic and referral attribution are easier to trust
landing-page and funnel conversion reports stay closer to real user behavior
long-tail SEO pages do not get inflated by junk crawler bursts

If you run a content-heavy site, a docs portal, or a marketing site with many low-volume landing pages, this matters a lot. A few hundred fake hits can completely distort the apparent performance of long-tail pages.

Referrer Spam Filtering

Referrer spam works by sending fake or low-value traffic with a misleading Referer header so the target domain appears in analytics dashboards. Classic examples include fake SEO services, gambling domains, and bot-traffic sellers.

HitKeep compiles the Matomo referrer spam list into a local cache and checks the normalized referrer host at pageview ingest time. Custom events do not carry a referrer, so they are only checked against Spamhaus network rules and IP exclusions.

Current behavior:

https://spam.example/path is normalized to spam.example
www.spam.example is normalized to spam.example
same-site referrers are not blocked just because the host appears in a denylist
direct traffic is preserved as direct traffic

This makes the filter strict enough to block known spam while avoiding the most obvious false positives for internal navigation.

Spamhaus Network Filtering

Some traffic is not merely low-quality referral spam. It originates from networks that Spamhaus designates in its DROP lists as high-confidence abuse space.

HitKeep consumes:

These CIDR network lists are enforced directly at ingest time for both pageviews and custom events. That keeps the Spamhaus side of the OSS baseline simple, deterministic, and well aligned with HitKeep’s offline-first deployment model.

Hostname Filtering

HitKeep now stores the request hostname for each persisted hit. That enables analytics filtering by:

hostname
referrer_host

Why this matters:

sites served behind multiple hostnames can isolate traffic faster
proxy or edge misconfiguration becomes easier to spot
teams can inspect whether referral noise is concentrated on one hostname

Hostname filtering is query-time filtering. It does not block traffic by itself. It gives you a cleaner way to slice reports once the data is stored.

IP Exclusions vs Spam Filtering

These systems solve different problems:

IP exclusions

Use IP exclusions when you know exactly which traffic you want to suppress:

your own office IPs
VPN egress addresses
staging monitors
synthetic uptime checks

automatic spam filtering

Use the built-in OSS spam filter when you want HitKeep to suppress widely known abuse patterns automatically:

referrer spam domains
DROP-listed malicious networks

In practice, you usually want both.

How the Update Pipeline Works

The OSS baseline is not a one-off static list baked into the product forever.

HitKeep supports:

a repo-shipped embedded snapshot so the feature works immediately on fresh installs
a compiled local cache file at data/spam-filter.json by default
the offline refresh command:

hitkeep update-spam-lists

optional leader-side automatic refresh in the running server

The important part for self-hosting and airgapped environments: HitKeep does not require live network fetches at runtime.

For maintainers, the shipped embedded snapshot can be refreshed in-repo with:

make update-default-spam-filter

Default behavior:

use the embedded repo snapshot if no cache file exists
use the local cache file if you generated one
make no outbound feed requests unless you explicitly run the updater command or enable auto-refresh

Relevant flags:

-spam-filter-path=/var/lib/hitkeep/data/spam-filter.json
-spam-filter-auto-update=false
-spam-filter-update-interval=1440

This keeps the runtime model simple:

upstream OSS feeds are fetched
normalized into one local artifact
the artifact is reused by the running server
your analytics pipeline does not depend on third-party APIs at query time

What HitKeep Stores and What It Does Not

For accepted hits and events, HitKeep stores the analytics fields needed for reporting, including the request hostname for pageviews.

For blocked spam hits and events, HitKeep currently:

returns 202 Accepted at ingest (both /ingest and /ingest/event)
drops the row before persistence
logs the reason server-side

HitKeep does not currently store a quarantine table of rejected spam requests. That keeps the live analytics database cleaner and simpler, but it also means the spam filter is intentionally optimized for high-confidence rules.

Known Limits

The OSS baseline is intentionally conservative. It is useful, but it is not a full enterprise anti-bot platform.

Current limits:

no challenge/JS/browser fingerprinting system
no reverse-DNS verification pipeline for search bots yet
no ML scoring or per-request reputation scoring
no separate rejected-hit audit table yet

That is a deliberate product choice: self-hosted analytics should not require a security team just to keep dashboards clean.

Recommended Setup

For most self-hosted deployments, the best baseline is:

Configure trusted proxies correctly.
Add your internal traffic to IP exclusions.
Run hitkeep update-spam-lists during provisioning or release rollout.
Enable auto-refresh only if your deployment is intentionally online and you want unattended feed updates.
Use hostname and referrer_host filters when investigating attribution anomalies.

That gives you a very strong starting point for privacy-friendly analytics bot filtering without adding Redis, Kafka, ClickHouse, or an external threat-intelligence service.