Data Retention and Archiving

You decide how long your analytics data lives — not a cloud vendor’s pricing tier. HitKeep’s retention system follows one rule: data you choose to prune is archived to Parquet first, in an open format you own, before it’s removed from the live database.

Quick Start

# Keep raw hits and events for 365 days; archive older data to /var/lib/hitkeep/archive
export HITKEEP_DATA_RETENTION_DAYS=365
export HITKEEP_ARCHIVE_PATH=/var/lib/hitkeep/archive

./hitkeep

Or as startup flags:

./hitkeep -retention-days=365 -archive-path=/var/lib/hitkeep/archive

See the Configuration Reference for all options.

How It Works

The retention worker runs once daily. For each site with a configured retention policy it will:

Count hits and events older than the retention threshold.
Export those rows to a compressed Parquet file in the archive directory — before touching the live database.
Prune the archived records from hitkeep.db to reclaim disk space.
Leave rollups intact. Aggregated hourly, daily, and monthly rollups are never pruned — they power the trend charts in the dashboard indefinitely.

The result is two data tiers:

Tier	Location	What’s there	Query speed
Hot	`hitkeep.db`	Recent raw hits & events within the retention window	Instant
Cold	Archive directory	Older raw hits & events exported to Parquet	Fast (file scan)

Dashboard trend views always show complete historical data because rollups remain in the hot database regardless of how the raw-data retention is configured.

Per-Site Overrides

Different sites have different requirements. Override the default retention window per site via the API:

curl -X PUT https://your-hitkeep.example/api/sites/{site_id}/retention \
  -H "Content-Type: application/json" \
  -b "hk_token=YOUR_SESSION_COOKIE" \
  -d '{"days": 90}'

A high-traffic site may need only 90 days of raw data. A site subject to statutory record-keeping requirements may need seven years. You set the policy; HitKeep enforces it.

Querying Cold Data

Archived Parquet files are standard open-format files queryable with any compatible tool — no HitKeep license required.

# DuckDB CLI — count page views per month from the archive
duckdb -c "
  SELECT date_trunc('month', timestamp) AS month, count(*) AS hits
  FROM read_parquet('/var/lib/hitkeep/archive/site_*.parquet')
  GROUP BY 1 ORDER BY 1;
"

# Merge hot and cold data in a single query
duckdb -c "
  ATTACH 'hitkeep.db' AS hot;
  SELECT timestamp::date AS day, count(*) AS hits
  FROM (
    SELECT timestamp FROM hot.hits WHERE site_id = 'your-site-id'
    UNION ALL
    SELECT timestamp FROM read_parquet('/var/lib/hitkeep/archive/site_your-site-id_*.parquet')
  )
  GROUP BY 1 ORDER BY 1;
"

The archive naming convention is site_{site_id}_{unix_timestamp}.parquet. Each archival run writes one file per site that had data past the cutoff.

Backup Strategy

The complete HitKeep data footprint is two paths:

Live database: hitkeep.db (the DuckDB file)
Archive: the configured archive directory (Parquet files)

A reliable backup is a periodic file copy of both:

# Example: nightly sync to S3-compatible storage with rclone
rclone copy /var/lib/hitkeep/hitkeep.db remote:my-bucket/hitkeep/live/
rclone sync /var/lib/hitkeep/archive/ remote:my-bucket/hitkeep/archive/

# Or with rsync to a remote host
rsync -az /var/lib/hitkeep/ backup-host:/backups/hitkeep/

Because hitkeep.db is a single file, you can also use filesystem-level snapshots (LVM, ZFS, APFS) for point-in-time consistency.

HitKeep Cloud manages retention policies, automated Parquet archiving, and encrypted off-site backups automatically — in your sovereign region (EU Frankfurt or US Virginia). Join the waitlist →