Bot and Spam Filtering
Bot traffic, referrer spam, uptime probes, and low-quality crawler noise can ruin otherwise useful analytics. A dashboard full of fake referrals and junk pageviews is not just annoying. It distorts conversion rates, pollutes attribution, and makes SEO decisions worse.
HitKeep now includes an OSS baseline spam-filter pipeline designed for self-hosted analytics operators who want cleaner data without deploying a separate abuse stack.
What HitKeep Filters
Section titled “What HitKeep Filters”The current OSS baseline focuses on high-confidence traffic that is safe to drop before it ever reaches the analytics database:
- Referrer spam hostnames from the Matomo referrer spam list
- Known abusive IP networks from Spamhaus DROP and Spamhaus DROPv6
- Your own traffic and internal infrastructure through site and global IP exclusions
These checks happen at ingest time for both pageview hits and custom events. If a request matches a blocked network or blocked referrer host, HitKeep accepts the HTTP request but does not persist the hit or event.
What This Means for Analytics Accuracy
Section titled “What This Means for Analytics Accuracy”The biggest benefit is not vanity. It is query quality:
- top referrers are less likely to be polluted by fake SEO domains
- organic and referral attribution are easier to trust
- landing-page and funnel conversion reports stay closer to real user behavior
- long-tail SEO pages do not get inflated by junk crawler bursts
If you run a content-heavy site, a docs portal, or a marketing site with many low-volume landing pages, this matters a lot. A few hundred fake hits can completely distort the apparent performance of long-tail pages.
Referrer Spam Filtering
Section titled “Referrer Spam Filtering”Referrer spam works by sending fake or low-value traffic with a misleading Referer header so the target domain appears in analytics dashboards. Classic examples include fake SEO services, gambling domains, and bot-traffic sellers.
HitKeep compiles the Matomo referrer spam list into a local cache and checks the normalized referrer host at pageview ingest time. Custom events do not carry a referrer, so they are only checked against Spamhaus network rules and IP exclusions.
Current behavior:
https://spam.example/pathis normalized tospam.examplewww.spam.exampleis normalized tospam.example- same-site referrers are not blocked just because the host appears in a denylist
- direct traffic is preserved as direct traffic
This makes the filter strict enough to block known spam while avoiding the most obvious false positives for internal navigation.
Spamhaus Network Filtering
Section titled “Spamhaus Network Filtering”Some traffic is not merely low-quality referral spam. It originates from networks that Spamhaus designates in its DROP lists as high-confidence abuse space.
HitKeep consumes:
These CIDR network lists are enforced directly at ingest time for both pageviews and custom events. That keeps the Spamhaus side of the OSS baseline simple, deterministic, and well aligned with HitKeep’s offline-first deployment model.
Hostname Filtering
Section titled “Hostname Filtering”HitKeep now stores the request hostname for each persisted hit. That enables analytics filtering by:
hostnamereferrer_host
Why this matters:
- sites served behind multiple hostnames can isolate traffic faster
- proxy or edge misconfiguration becomes easier to spot
- teams can inspect whether referral noise is concentrated on one hostname
Hostname filtering is query-time filtering. It does not block traffic by itself. It gives you a cleaner way to slice reports once the data is stored.
IP Exclusions vs Spam Filtering
Section titled “IP Exclusions vs Spam Filtering”These systems solve different problems:
IP exclusions
Section titled “IP exclusions”Use IP exclusions when you know exactly which traffic you want to suppress:
- your own office IPs
- VPN egress addresses
- staging monitors
- synthetic uptime checks
automatic spam filtering
Section titled “automatic spam filtering”Use the built-in OSS spam filter when you want HitKeep to suppress widely known abuse patterns automatically:
- referrer spam domains
- DROP-listed malicious networks
In practice, you usually want both.
How the Update Pipeline Works
Section titled “How the Update Pipeline Works”The OSS baseline is not a one-off static list baked into the product forever.
HitKeep supports:
- a repo-shipped embedded snapshot so the feature works immediately on fresh installs
- a compiled local cache file at
data/spam-filter.jsonby default - the offline refresh command:
hitkeep update-spam-lists- optional leader-side automatic refresh in the running server
The important part for self-hosting and airgapped environments: HitKeep does not require live network fetches at runtime.
For maintainers, the shipped embedded snapshot can be refreshed in-repo with:
make update-default-spam-filterDefault behavior:
- use the embedded repo snapshot if no cache file exists
- use the local cache file if you generated one
- make no outbound feed requests unless you explicitly run the updater command or enable auto-refresh
Relevant flags:
-spam-filter-path=/var/lib/hitkeep/data/spam-filter.json-spam-filter-auto-update=false-spam-filter-update-interval=1440This keeps the runtime model simple:
- upstream OSS feeds are fetched
- normalized into one local artifact
- the artifact is reused by the running server
- your analytics pipeline does not depend on third-party APIs at query time
What HitKeep Stores and What It Does Not
Section titled “What HitKeep Stores and What It Does Not”For accepted hits and events, HitKeep stores the analytics fields needed for reporting, including the request hostname for pageviews.
For blocked spam hits and events, HitKeep currently:
- returns
202 Acceptedat ingest (both/ingestand/ingest/event) - drops the row before persistence
- logs the reason server-side
HitKeep does not currently store a quarantine table of rejected spam requests. That keeps the live analytics database cleaner and simpler, but it also means the spam filter is intentionally optimized for high-confidence rules.
Known Limits
Section titled “Known Limits”The OSS baseline is intentionally conservative. It is useful, but it is not a full enterprise anti-bot platform.
Current limits:
- no challenge/JS/browser fingerprinting system
- no reverse-DNS verification pipeline for search bots yet
- no ML scoring or per-request reputation scoring
- no separate rejected-hit audit table yet
That is a deliberate product choice: self-hosted analytics should not require a security team just to keep dashboards clean.
Recommended Setup
Section titled “Recommended Setup”For most self-hosted deployments, the best baseline is:
- Configure trusted proxies correctly.
- Add your internal traffic to IP exclusions.
- Run
hitkeep update-spam-listsduring provisioning or release rollout. - Enable auto-refresh only if your deployment is intentionally online and you want unattended feed updates.
- Use
hostnameandreferrer_hostfilters when investigating attribution anomalies.
That gives you a very strong starting point for privacy-friendly analytics bot filtering without adding Redis, Kafka, ClickHouse, or an external threat-intelligence service.