Skip to content
Start In Cloud

Data Retention and Archiving

You decide how long your analytics data lives — not a cloud vendor’s pricing tier. HitKeep’s retention system follows one rule: data you choose to prune is archived to Parquet first, in an open format you own, before it’s removed from the live database.

Terminal window
# Keep raw hits and events for 365 days; archive older data to /var/lib/hitkeep/archive
export HITKEEP_DATA_RETENTION_DAYS=365
export HITKEEP_ARCHIVE_PATH=/var/lib/hitkeep/archive
./hitkeep

Or as startup flags:

Terminal window
./hitkeep -retention-days=365 -archive-path=/var/lib/hitkeep/archive

See the Configuration Reference for all options.

The retention worker runs once daily. For each site with a configured retention policy it will:

  1. Count hits and events older than the retention threshold.
  2. Export those rows to a compressed Parquet file in the archive directory — before touching the live database.
  3. Prune the archived records from hitkeep.db to reclaim disk space.
  4. Leave rollups intact. Aggregated hourly, daily, and monthly rollups are never pruned — they power the trend charts in the dashboard indefinitely.
flowchart TD
    subgraph RetentionWorker["Retention Worker (daily)"]
        R1[Load site retention policies] --> R2{Hits/events older\nthan cutoff?}
        R2 -->|No| R3[Skip site]
        R2 -->|Yes| R4["COPY ... TO (FORMAT PARQUET)"]
        R4 --> R5[DELETE archived rows]
        R5 --> R6[Rollups untouched]
    end

    subgraph Storage
        HOT[(DuckDB — hitkeep.db\nHot tier)]
        COLD[("Archive directory\nCold tier — Parquet")]
    end

    R1 -.->|query| HOT
    R4 -->|export| COLD
    R5 -->|prune| HOT

The result is two data tiers:

TierLocationWhat’s thereQuery speed
Hothitkeep.dbRecent raw hits & events within the retention windowInstant
ColdArchive directoryOlder raw hits & events exported to ParquetFast (file scan)

Dashboard trend views always show complete historical data because rollups remain in the hot database regardless of how the raw-data retention is configured.

Different sites have different requirements. Override the default retention window per site via the API:

Terminal window
curl -X PUT https://your-hitkeep.example/api/sites/{site_id}/retention \
-H "Content-Type: application/json" \
-b "hk_token=YOUR_SESSION_COOKIE" \
-d '{"days": 90}'

A high-traffic site may need only 90 days of raw data. A site subject to statutory record-keeping requirements may need seven years. You set the policy; HitKeep enforces it.

The retention worker and backup worker can write to local disk or any S3-compatible object store. Both share the same S3 credential configuration.

flowchart LR
    subgraph HitKeep["HitKeep (Leader Node)"]
        RW[Retention Worker]
        BW[Backup Worker]
    end

    subgraph Local["Local Filesystem"]
        LA["/var/lib/hitkeep/archive/"]
        LB["/var/lib/hitkeep/backups/"]
    end

    subgraph S3["S3-Compatible Storage"]
        SA["s3://bucket/archive/"]
        SB["s3://bucket/backups/"]
    end

    RW -->|"HITKEEP_ARCHIVE_PATH\n(local)"| LA
    RW -->|"HITKEEP_ARCHIVE_PATH\n(s3://)"| SA
    BW -->|"HITKEEP_BACKUP_PATH\n(local)"| LB
    BW -->|"HITKEEP_BACKUP_PATH\n(s3://)"| SB

Instead of writing Parquet files to a local directory, HitKeep can archive directly to any S3-compatible object store. Set HITKEEP_ARCHIVE_PATH to an s3:// URL and configure credentials.

Terminal window
export HITKEEP_ARCHIVE_PATH=s3://my-analytics-bucket/hitkeep/archive
export HITKEEP_S3_ACCESS_KEY_ID=AKIAIOSFODNN7EXAMPLE
export HITKEEP_S3_SECRET_ACCESS_KEY=wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
export HITKEEP_S3_REGION=eu-west-1
./hitkeep

On EC2, ECS, or Lambda with an attached IAM role, no explicit keys are needed. HitKeep falls back to the AWS SDK default credential chain automatically.

Terminal window
export HITKEEP_ARCHIVE_PATH=s3://my-analytics-bucket/hitkeep/archive
export HITKEEP_S3_REGION=eu-west-1
./hitkeep
# Logs: "S3 archive enabled" mode="credential chain" region="eu-west-1"
Terminal window
export HITKEEP_ARCHIVE_PATH=s3://hitkeep-archive/data
export HITKEEP_S3_ACCESS_KEY_ID=minioadmin
export HITKEEP_S3_SECRET_ACCESS_KEY=minioadmin
export HITKEEP_S3_ENDPOINT=localhost:9000
export HITKEEP_S3_URL_STYLE=path
export HITKEEP_S3_USE_SSL=false
export HITKEEP_S3_REGION=us-east-1
./hitkeep
Terminal window
export HITKEEP_ARCHIVE_PATH=s3://my-r2-bucket/hitkeep/archive
export HITKEEP_S3_ACCESS_KEY_ID=your-r2-access-key
export HITKEEP_S3_SECRET_ACCESS_KEY=your-r2-secret-key
export HITKEEP_S3_ENDPOINT=your-account-id.r2.cloudflarestorage.com
export HITKEEP_S3_REGION=auto
./hitkeep

See the Configuration Reference for the full list of S3 settings.

Archived Parquet files are standard open-format files queryable with any compatible tool — no HitKeep license required.

flowchart LR
    subgraph Query["DuckDB CLI or any Parquet tool"]
        Q["SELECT ... FROM"]
    end

    subgraph Hot["Hot Tier"]
        DB[("hitkeep.db\nhits / events")]
    end

    subgraph Cold["Cold Tier"]
        PQ[("archive/\nsite_*.parquet")]
    end

    Q -->|"ATTACH 'hitkeep.db'"| DB
    Q -->|"read_parquet('...')"| PQ
    DB -->|"UNION ALL"| RESULT["Combined\nresult set"]
    PQ -->|"UNION ALL"| RESULT
Terminal window
# DuckDB CLI — count page views per month from the archive
duckdb -c "
SELECT date_trunc('month', timestamp) AS month, count(*) AS hits
FROM read_parquet('/var/lib/hitkeep/archive/site_*.parquet')
GROUP BY 1 ORDER BY 1;
"
Terminal window
# Merge hot and cold data in a single query
duckdb -c "
ATTACH 'hitkeep.db' AS hot;
SELECT timestamp::date AS day, count(*) AS hits
FROM (
SELECT timestamp FROM hot.hits WHERE site_id = 'your-site-id'
UNION ALL
SELECT timestamp FROM read_parquet('/var/lib/hitkeep/archive/site_your-site-id_*.parquet')
)
GROUP BY 1 ORDER BY 1;
"

The archive naming convention is site_{site_id}_{unix_timestamp}.parquet. Each archival run writes one file per site that had data past the cutoff.

HitKeep includes a built-in backup worker that periodically exports all live databases to Parquet snapshots using DuckDB’s EXPORT DATABASE. This covers both the shared hitkeep.db and any per-tenant databases.

For a dedicated operational guide, see Backups and Restore and S3 Backups.

flowchart TD
    START([HitKeep starts]) --> CHECK{HITKEEP_BACKUP_PATH\nset?}
    CHECK -->|No| DISABLED[Backups disabled — no-op]
    CHECK -->|Yes| INIT["Wait 30 seconds\n(let DB settle)"]
    INIT --> RUN

    subgraph RUN["Run Backup Cycle"]
        direction TB
        S3{S3 path?}
        S3 -->|Yes| HTTPFS["Load httpfs +\nconfigure S3 secret"]
        S3 -->|No| CHECKPOINT
        HTTPFS --> CHECKPOINT["CHECKPOINT shared DB"]
        CHECKPOINT --> EXPORT_SHARED["EXPORT DATABASE\nshared → {path}/shared/{timestamp}/"]
        EXPORT_SHARED --> TENANTS["List non-default tenant IDs"]
        TENANTS --> LOOP["For each tenant:\nCHECKPOINT + EXPORT DATABASE\n→ {path}/tenants/{id}/{timestamp}/"]
        LOOP --> PRUNE{"Local path?"}
        PRUNE -->|Yes| PRUNE_LOCAL["Delete oldest snapshots\nbeyond retention count"]
        PRUNE -->|No| PRUNE_S3["Log: use S3 lifecycle policies"]
    end

    RUN --> WAIT["Sleep backup-interval minutes"]
    WAIT --> RUN
Terminal window
# Local backups — every 60 minutes, keep 24 snapshots
export HITKEEP_BACKUP_PATH=/var/lib/hitkeep/backups
./hitkeep
# Logs: "Local backup enabled" path="/var/lib/hitkeep/backups" interval_min=60 retention=24
Terminal window
# S3 backups — every 30 minutes, keep 48 snapshots
export HITKEEP_BACKUP_PATH=s3://my-bucket/hitkeep/backups
export HITKEEP_BACKUP_INTERVAL=30
export HITKEEP_BACKUP_RETENTION=48
./hitkeep

The worker runs on the leader node only. The first backup is taken 30 seconds after startup, then at the configured interval.

Each backup is a timestamped directory containing the output of DuckDB’s EXPORT DATABASE (a schema.sql file plus Parquet data files for each table).

graph TD
    subgraph BackupPath["{backup-path}/"]
        subgraph Shared["shared/"]
            S1["2026-03-02T120000Z/\nschema.sql + *.parquet"]
            S2["2026-03-02T130000Z/\nschema.sql + *.parquet"]
        end
        subgraph Tenants["tenants/"]
            subgraph T1["{tenant-id-1}/"]
                T1S1["2026-03-02T120000Z/\nschema.sql + *.parquet"]
                T1S2["2026-03-02T130000Z/\nschema.sql + *.parquet"]
            end
            subgraph T2["{tenant-id-2}/"]
                T2S1["2026-03-02T120000Z/\nschema.sql + *.parquet"]
                T2S2["2026-03-02T130000Z/\nschema.sql + *.parquet"]
            end
        end
    end

    subgraph Sources["Live Databases"]
        MAIN[("hitkeep.db\n(shared — identity,\nconfig, default tenant)")]
        TD1[("tenants/{id-1}/hitkeep.db")]
        TD2[("tenants/{id-2}/hitkeep.db")]
    end

    MAIN -.->|EXPORT DATABASE| Shared
    TD1 -.->|EXPORT DATABASE| T1
    TD2 -.->|EXPORT DATABASE| T2

Or as a directory tree:

{backup-path}/
├── shared/
│ ├── 2026-03-02T120000Z/ ← schema.sql + *.parquet
│ └── 2026-03-02T130000Z/
└── tenants/
└── {tenant-id}/
├── 2026-03-02T120000Z/
└── 2026-03-02T130000Z/

For local backups, snapshots beyond the retention count are automatically deleted (oldest first). For S3 backups, configure S3 lifecycle policies on your bucket to manage snapshot retention.

Use the hitkeep recover restore-backup command to import a snapshot into fresh databases. HitKeep must be stopped before running a restore (DuckDB allows only one writer at a time).

flowchart TD
    CMD["hitkeep recover restore-backup\n-from {path} [-snapshot {ts}]"] --> SRC{Source type?}

    SRC -->|Local| FIND["Find latest snapshot\n(lexicographic sort)"]
    SRC -->|S3| REQ["-snapshot required"]

    FIND --> SUMMARY
    REQ --> SUMMARY["Print summary:\nsource, target, tenants"]
    SUMMARY --> CONFIRM{"-yes flag or\nuser confirms?"}
    CONFIRM -->|No| ABORT([Aborted])
    CONFIRM -->|Yes| SHARED

    subgraph SHARED["Restore Shared DB"]
        SH1{"Existing\nhitkeep.db?"}
        SH1 -->|Yes| SH2["Rename → .pre-restore.{ts}"]
        SH1 -->|No| SH3
        SH2 --> SH3["Open fresh empty DuckDB"]
        SH3 --> SH4["IMPORT DATABASE\nfrom shared/{snapshot}/"]
    end

    SHARED --> DISC["Discover tenant backups\nunder tenants/*/"]

    subgraph TENANT["For Each Tenant"]
        T1{"Existing tenant\nhitkeep.db?"}
        T1 -->|Yes| T2["Rename → .pre-restore.{ts}"]
        T1 -->|No| T3
        T2 --> T3["Open fresh empty DuckDB"]
        T3 --> T4["IMPORT DATABASE\nfrom tenants/{id}/{snapshot}/"]
    end

    DISC --> TENANT
    TENANT --> DONE(["Restore complete\n→ start hitkeep normally"])
Terminal window
# Restore the latest local snapshot
./hitkeep recover restore-backup \
-from /var/lib/hitkeep/backups \
-yes
Terminal window
# Restore a specific snapshot
./hitkeep recover restore-backup \
-from /var/lib/hitkeep/backups \
-snapshot 2026-03-02T120000Z \
-db /var/lib/hitkeep/data/hitkeep.db \
-data-path /var/lib/hitkeep/data \
-yes
Terminal window
# Restore from S3 (snapshot timestamp required)
./hitkeep recover restore-backup \
-from s3://my-bucket/hitkeep/backups \
-snapshot 2026-03-02T120000Z \
-yes

The restore process:

  1. Finds the requested snapshot (or the latest one for local sources).
  2. Renames existing database files as a safety net (.pre-restore.{timestamp}).
  3. Imports the snapshot into temporary DuckDB files, checkpoints them, and only then promotes them into place.
  4. Discovers and restores any tenant databases from the backup.

On the next normal hitkeep startup, the migration system will apply any schema changes if HitKeep has been upgraded since the backup was taken.

See the Configuration Reference for all backup settings.

For users who prefer external tooling, the complete HitKeep data footprint is:

  • Live data tree: the full data-path directory, including the shared control plane and any tenant-local databases
  • Archive: the configured archive directory (Parquet files)

A reliable backup is a periodic file copy of both:

Terminal window
# Example: nightly sync to S3-compatible storage with rclone
rclone sync /var/lib/hitkeep/data/ remote:my-bucket/hitkeep/data/
rclone sync /var/lib/hitkeep/archive/ remote:my-bucket/hitkeep/archive/
Terminal window
# Or with rsync to a remote host
rsync -az /var/lib/hitkeep/ backup-host:/backups/hitkeep/

Because HitKeep stores its live state on the local filesystem, you can also use filesystem-level snapshots (LVM, ZFS, APFS) for point-in-time consistency — but snapshot the full data-path, not just the root database file.

The following diagram shows how data moves through HitKeep — from ingestion to hot storage, through retention archiving to cold storage, and how backups and restores fit into the picture.

flowchart TB
    subgraph Ingestion
        BROWSER["Browser / Server"] -->|"HTTP POST /ingest"| HTTP["Go HTTP Server"]
        HTTP -->|publish| NSQ{{"Embedded NSQ"}}
        NSQ -->|"consume batches"| WORKER["Ingest Worker"]
    end

    WORKER -->|INSERT| HOT

    subgraph HotStorage["Hot Storage (DuckDB)"]
        HOT[("hitkeep.db\nShared DB")]
        TENANT_DB[("tenants/{id}/hitkeep.db\nPer-tenant DBs")]
    end

    subgraph Workers["Background Workers (Leader only)"]
        direction LR
        RETENTION["Retention Worker\n(daily)"]
        BACKUP["Backup Worker\n(configurable interval)"]
        ROLLUP["Rollup Worker"]
    end

    HOT --- Workers

    subgraph ColdStorage["Cold Storage"]
        ARCHIVE[("Archive\nParquet per-site\nretention exports")]
        SNAPSHOTS[("Backups\nParquet snapshots\nfull DB exports")]
    end

    RETENTION -->|"COPY ... TO\n(FORMAT PARQUET)"| ARCHIVE
    RETENTION -->|"DELETE\npruned rows"| HOT

    BACKUP -->|"EXPORT DATABASE\nshared + all tenants"| SNAPSHOTS

    subgraph Destinations["Storage Destinations"]
        LOCAL["Local Filesystem"]
        S3["S3 / MinIO / R2"]
    end

    ARCHIVE --> LOCAL
    ARCHIVE --> S3
    SNAPSHOTS --> LOCAL
    SNAPSHOTS --> S3

    subgraph Recovery["Disaster Recovery"]
        RESTORE["hitkeep recover\nrestore-backup"]
    end

    SNAPSHOTS -.->|"IMPORT DATABASE"| RESTORE
    RESTORE -.->|"recreate"| HOT
    RESTORE -.->|"recreate"| TENANT_DB

The key insight: retention archives and database backups serve different purposes. Retention exports are per-site, incremental Parquet files for long-term analytical querying. Database backups are full point-in-time snapshots for disaster recovery. Both can target local disk or S3.

HitKeep Cloud manages retention policies, automated Parquet archiving, and encrypted off-site backups automatically — in your sovereign region (EU Frankfurt or US Virginia). Start with HitKeep Cloud →