HitKeep Data Retention and Parquet Archiving

You decide how long your analytics data lives — not a cloud vendor’s pricing tier. HitKeep’s retention system follows one rule: data you choose to prune is archived to Parquet first, in an open format you own, before it’s removed from the live database.

Retention fact	HitKeep behavior
Raw retention	Configurable globally and per site
Archive format	Parquet
Archive trigger	Rows are archived before pruning from the live database
Backup format	DuckDB `EXPORT DATABASE`, including `schema.sql` and Parquet table files
Storage target	Local path or S3-compatible path when configured
Broader runtime facts	See Facts and Limits

Quick Start

# Keep raw hits and events for 365 days; archive older data to /var/lib/hitkeep/archive
export HITKEEP_DATA_RETENTION_DAYS=365
export HITKEEP_ARCHIVE_PATH=/var/lib/hitkeep/archive

./hitkeep

Or as startup flags:

./hitkeep --data-retention-days=365 --archive-path=/var/lib/hitkeep/archive

See the Configuration Reference for all options.

How It Works

The retention worker runs once daily. For each site with a configured retention policy it will:

Count hits and events older than the retention threshold.
Export those rows to a compressed Parquet file in the archive directory — before touching the live database.
Prune the archived records from hitkeep.db to reclaim disk space.
Leave rollups intact. Aggregated hourly, daily, and monthly rollups are never pruned — they power the trend charts in the dashboard indefinitely.

flowchart TD
    subgraph RetentionWorker["Retention Worker (daily)"]
        R1[Load site retention policies] --> R2{Hits/events older\nthan cutoff?}
        R2 -->|No| R3[Skip site]
        R2 -->|Yes| R4["COPY ... TO (FORMAT PARQUET)"]
        R4 --> R5[DELETE archived rows]
        R5 --> R6[Rollups untouched]
    end

    subgraph Storage
        HOT[(DuckDB hot tier\nshared hitkeep.db + tenant DBs)]
        COLD[("Archive directory\nCold tier — Parquet")]
    end

    R1 -.->|query| HOT
    R4 -->|export| COLD
    R5 -->|prune| HOT

The result is two data tiers:

Tier	Location	What’s there	Query speed
Hot	`hitkeep.db`	Recent raw hits & events within the retention window	Instant
Cold	Archive directory	Older raw hits & events exported to Parquet	Fast (file scan)

Archived hit rows keep the same derived IP metadata columns as live hit exports: region, city, provider, ASN, and ASN organization. Raw visitor IP addresses are not added to retention archives.

Dashboard trend views always show complete historical data because rollups remain in the hot database regardless of how the raw-data retention is configured.

Per-Site Overrides

Different sites have different requirements. Override the default retention window per site via the API:

curl -X PUT https://your-hitkeep.example/api/sites/{site_id}/retention \
  -H "Content-Type: application/json" \
  -b "hk_token=YOUR_SESSION_COOKIE" \
  -d '{"days": 90}'

A high-traffic site may need only 90 days of raw data. A site subject to statutory record-keeping requirements may need seven years. You set the policy; HitKeep enforces it.

Archive Destination Overview

The retention worker and backup worker can write to local disk or any S3-compatible object store. Both share the same S3 credential configuration.

flowchart LR
    subgraph HitKeep["HitKeep (Leader Node)"]
        RW[Retention Worker]
        BW[Backup Worker]
    end

    subgraph Local["Local Filesystem"]
        LA["/var/lib/hitkeep/archive/"]
        LB["/var/lib/hitkeep/backups/"]
    end

    subgraph S3["S3-Compatible Storage"]
        SA["s3://bucket/archive/"]
        SB["s3://bucket/backups/"]
    end

    RW -->|"HITKEEP_ARCHIVE_PATH\n(local)"| LA
    RW -->|"HITKEEP_ARCHIVE_PATH\n(s3://)"| SA
    BW -->|"HITKEEP_BACKUP_PATH\n(local)"| LB
    BW -->|"HITKEEP_BACKUP_PATH\n(s3://)"| SB

Archiving to S3

Instead of writing Parquet files to a local directory, HitKeep can archive directly to any S3-compatible object store. Set HITKEEP_ARCHIVE_PATH to an s3:// URL and configure credentials.

AWS S3 with Static Keys

export HITKEEP_ARCHIVE_PATH=s3://my-analytics-bucket/hitkeep/archive
export HITKEEP_S3_ACCESS_KEY_ID=AKIAIOSFODNN7EXAMPLE
export HITKEEP_S3_SECRET_ACCESS_KEY=wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
export HITKEEP_S3_REGION=eu-west-1

./hitkeep

AWS S3 with IAM Role (Credential Chain)

On EC2, ECS, or Lambda with an attached IAM role, no explicit keys are needed. HitKeep falls back to the AWS SDK default credential chain automatically.

export HITKEEP_ARCHIVE_PATH=s3://my-analytics-bucket/hitkeep/archive
export HITKEEP_S3_REGION=eu-west-1

./hitkeep
# Logs: "S3 archive enabled" mode="credential chain" region="eu-west-1"

MinIO (Custom Endpoint)

export HITKEEP_ARCHIVE_PATH=s3://hitkeep-archive/data
export HITKEEP_S3_ACCESS_KEY_ID=minioadmin
export HITKEEP_S3_SECRET_ACCESS_KEY=minioadmin
export HITKEEP_S3_ENDPOINT=localhost:9000
export HITKEEP_S3_URL_STYLE=path
export HITKEEP_S3_USE_SSL=false
export HITKEEP_S3_REGION=us-east-1

./hitkeep

Cloudflare R2

export HITKEEP_ARCHIVE_PATH=s3://my-r2-bucket/hitkeep/archive
export HITKEEP_S3_ACCESS_KEY_ID=your-r2-access-key
export HITKEEP_S3_SECRET_ACCESS_KEY=your-r2-secret-key
export HITKEEP_S3_ENDPOINT=your-account-id.r2.cloudflarestorage.com
export HITKEEP_S3_REGION=auto

./hitkeep

See the Configuration Reference for the full list of S3 settings.

Querying Cold Data

Archived Parquet files are standard open-format files queryable with any compatible tool — no HitKeep license required.

flowchart LR
    subgraph Query["DuckDB CLI or any Parquet tool"]
        Q["SELECT ... FROM"]
    end

    subgraph Hot["Hot Tier"]
        DB[("hitkeep.db\nhits / events")]
    end

    subgraph Cold["Cold Tier"]
        PQ[("archive/\nsite_*.parquet")]
    end

    Q -->|"ATTACH 'hitkeep.db'"| DB
    Q -->|"read_parquet('...')"| PQ
    DB -->|"UNION ALL"| RESULT["Combined\nresult set"]
    PQ -->|"UNION ALL"| RESULT

# DuckDB CLI — count page views per month from the archive
duckdb -c "
  SELECT date_trunc('month', timestamp) AS month, count(*) AS hits
  FROM read_parquet('/var/lib/hitkeep/archive/site_*.parquet')
  GROUP BY 1 ORDER BY 1;
"

# Merge hot and cold data in a single query
duckdb -c "
  ATTACH 'hitkeep.db' AS hot;
  SELECT timestamp::date AS day, count(*) AS hits
  FROM (
    SELECT timestamp FROM hot.hits WHERE site_id = 'your-site-id'
    UNION ALL
    SELECT timestamp FROM read_parquet('/var/lib/hitkeep/archive/site_your-site-id_*.parquet')
  )
  GROUP BY 1 ORDER BY 1;
"

The archive naming convention is site_{site_id}_{unix_timestamp}.parquet. Each archival run writes one file per site that had data past the cutoff.

Database Backups

HitKeep includes a built-in backup worker that periodically exports all live databases to Parquet snapshots using DuckDB’s EXPORT DATABASE. This covers both the shared hitkeep.db and any per-tenant databases.

For a dedicated operational guide, see Backups and Restore and S3 Backups.

Backup Worker Lifecycle

flowchart TD
    START([HitKeep starts]) --> CHECK{HITKEEP_BACKUP_PATH\nset?}
    CHECK -->|No| DISABLED[Backups disabled — no-op]
    CHECK -->|Yes| INIT["Wait 30 seconds\n(let DB settle)"]
    INIT --> RUN

    subgraph RUN["Run Backup Cycle"]
        direction TB
        S3{S3 path?}
        S3 -->|Yes| HTTPFS["Load httpfs +\nconfigure S3 secret"]
        S3 -->|No| CHECKPOINT
        HTTPFS --> CHECKPOINT["CHECKPOINT shared DB"]
        CHECKPOINT --> EXPORT_SHARED["EXPORT DATABASE\nshared → {path}/shared/{timestamp}/"]
        EXPORT_SHARED --> TENANTS["List non-default tenant IDs"]
        TENANTS --> LOOP["For each tenant:\nCHECKPOINT + EXPORT DATABASE\n→ {path}/tenants/{id}/{timestamp}/"]
        LOOP --> PRUNE{"Local path?"}
        PRUNE -->|Yes| PRUNE_LOCAL["Delete oldest snapshots\nbeyond retention count"]
        PRUNE -->|No| PRUNE_S3["Log: use S3 lifecycle policies"]
    end

    RUN --> WAIT["Sleep backup-interval minutes"]
    WAIT --> RUN

Enabling Backups

# Local backups — every 60 minutes, keep 24 snapshots
export HITKEEP_BACKUP_PATH=/var/lib/hitkeep/backups

./hitkeep
# Logs: "Local backup enabled" path="/var/lib/hitkeep/backups" interval_min=60 retention=24

# S3 backups — every 30 minutes, keep 48 snapshots
export HITKEEP_BACKUP_PATH=s3://my-bucket/hitkeep/backups
export HITKEEP_BACKUP_INTERVAL=30
export HITKEEP_BACKUP_RETENTION=48

./hitkeep

The worker runs on the leader node only. The first backup is taken 30 seconds after startup, then at the configured interval.

Backup Layout

Each backup is a timestamped directory containing the output of DuckDB’s EXPORT DATABASE (a schema.sql file plus Parquet data files for each table).

graph TD
    subgraph BackupPath["{backup-path}/"]
        subgraph Shared["shared/"]
            S1["2026-03-02T120000Z/\nschema.sql + *.parquet"]
            S2["2026-03-02T130000Z/\nschema.sql + *.parquet"]
        end
        subgraph Tenants["tenants/"]
            subgraph T1["{tenant-id-1}/"]
                T1S1["2026-03-02T120000Z/\nschema.sql + *.parquet"]
                T1S2["2026-03-02T130000Z/\nschema.sql + *.parquet"]
            end
            subgraph T2["{tenant-id-2}/"]
                T2S1["2026-03-02T120000Z/\nschema.sql + *.parquet"]
                T2S2["2026-03-02T130000Z/\nschema.sql + *.parquet"]
            end
        end
    end

    subgraph Sources["Live Databases"]
        MAIN[("hitkeep.db\n(shared — identity,\nconfig, default tenant)")]
        TD1[("tenants/{id-1}/hitkeep.db")]
        TD2[("tenants/{id-2}/hitkeep.db")]
    end

    MAIN -.->|EXPORT DATABASE| Shared
    TD1 -.->|EXPORT DATABASE| T1
    TD2 -.->|EXPORT DATABASE| T2

Or as a directory tree:

{backup-path}/
├── shared/
│   ├── 2026-03-02T120000Z/    ← schema.sql + *.parquet
│   └── 2026-03-02T130000Z/
└── tenants/
    └── {tenant-id}/
        ├── 2026-03-02T120000Z/
        └── 2026-03-02T130000Z/

Snapshot Pruning

For local backups, snapshots beyond the retention count are automatically deleted (oldest first). For S3 backups, configure S3 lifecycle policies on your bucket to manage snapshot retention.

Restoring from a Backup

Use the hitkeep recover restore-backup command to import a snapshot into fresh databases. HitKeep must be stopped before running a restore (DuckDB allows only one writer at a time).

flowchart TD
    CMD["hitkeep recover restore-backup\n-from {path} [-snapshot {ts}]"] --> SRC{Source type?}

    SRC -->|Local| FIND["Find latest snapshot\n(lexicographic sort)"]
    SRC -->|S3| REQ["-snapshot required"]

    FIND --> SUMMARY
    REQ --> SUMMARY["Print summary:\nsource, target, tenants"]
    SUMMARY --> CONFIRM{"-yes flag or\nuser confirms?"}
    CONFIRM -->|No| ABORT([Aborted])
    CONFIRM -->|Yes| SHARED

    subgraph SHARED["Restore Shared DB"]
        SH1{"Existing\nhitkeep.db?"}
        SH1 -->|Yes| SH2["Rename → .pre-restore.{ts}"]
        SH1 -->|No| SH3
        SH2 --> SH3["Open fresh empty DuckDB"]
        SH3 --> SH4["IMPORT DATABASE\nfrom shared/{snapshot}/"]
    end

    SHARED --> DISC["Discover tenant backups\nunder tenants/*/"]

    subgraph TENANT["For Each Tenant"]
        T1{"Existing tenant\nhitkeep.db?"}
        T1 -->|Yes| T2["Rename → .pre-restore.{ts}"]
        T1 -->|No| T3
        T2 --> T3["Open fresh empty DuckDB"]
        T3 --> T4["IMPORT DATABASE\nfrom tenants/{id}/{snapshot}/"]
    end

    DISC --> TENANT
    TENANT --> DONE(["Restore complete\n→ start hitkeep normally"])

# Restore the latest local snapshot
./hitkeep recover restore-backup \
  -from /var/lib/hitkeep/backups \
  -yes

# Restore a specific snapshot
./hitkeep recover restore-backup \
  -from /var/lib/hitkeep/backups \
  -snapshot 2026-03-02T120000Z \
  -db /var/lib/hitkeep/data/hitkeep.db \
  -data-path /var/lib/hitkeep/data \
  -yes

# Restore from S3 (snapshot timestamp required)
./hitkeep recover restore-backup \
  -from s3://my-bucket/hitkeep/backups \
  -snapshot 2026-03-02T120000Z \
  -yes

The restore process:

Finds the requested snapshot (or the latest one for local sources).
Renames existing database files as a safety net (.pre-restore.{timestamp}).
Imports the snapshot into temporary DuckDB files, checkpoints them, and only then promotes them into place.
Discovers and restores any tenant databases from the backup.

On the next normal hitkeep startup, the migration system will apply any schema changes if HitKeep has been upgraded since the backup was taken.

See the Configuration Reference for all backup settings.

Backup Strategy (Manual)

For users who prefer external tooling, the complete HitKeep data footprint is:

Live data tree: the full data-path directory, including the shared control plane and any tenant-local databases
Archive: the configured archive directory (Parquet files)

A reliable backup is a periodic file copy of both:

# Example: nightly sync to S3-compatible storage with rclone
rclone sync /var/lib/hitkeep/data/ remote:my-bucket/hitkeep/data/
rclone sync /var/lib/hitkeep/archive/ remote:my-bucket/hitkeep/archive/

# Or with rsync to a remote host
rsync -az /var/lib/hitkeep/ backup-host:/backups/hitkeep/

Because HitKeep stores its live state on the local filesystem, you can also use filesystem-level snapshots (LVM, ZFS, APFS) for point-in-time consistency — but snapshot the full data-path, not just the root database file.

Complete Data Lifecycle

The following diagram shows how data moves through HitKeep — from ingestion to hot storage, through retention archiving to cold storage, and how backups and restores fit into the picture.

flowchart TB
    subgraph Ingestion
        BROWSER["Browser / Server"] -->|"POST /ingest\nPOST /ingest/event\nPOST /api/ingest/server/*"| HTTP["Go HTTP Server"]
        AIFETCH["AI fetch forwarder"] -->|"POST /api/sites/{id}/ingest/ai-fetch"| HTTP
        HTTP -->|"publish pageviews/events"| NSQ{{"Embedded NSQ"}}
        NSQ -->|"consume batches"| WORKER["Ingest Consumer"]
        HTTP -->|"direct AI fetch write"| RESOLVER
    end

    WORKER -->|"resolve site → tenant"| RESOLVER["Tenant Store Manager"]
    RESOLVER -->|write| HOT
    RESOLVER -->|write| TENANT_DB

    subgraph HotStorage["Hot Storage (DuckDB)"]
        HOT[("hitkeep.db\nShared DB")]
        TENANT_DB[("tenants/{id}/hitkeep.db\nPer-tenant DBs")]
    end

    subgraph Workers["Background Workers (Leader only)"]
        direction LR
        RETENTION["Retention Worker\n(daily)"]
        BACKUP["Backup Worker\n(configurable interval)"]
        ROLLUP["Rollup Worker"]
    end

    HOT --- Workers
    TENANT_DB --- Workers

    subgraph ColdStorage["Cold Storage"]
        ARCHIVE[("Archive\nParquet per-site\nretention exports")]
        SNAPSHOTS[("Backups\nParquet snapshots\nfull DB exports")]
    end

    RETENTION -->|"COPY ... TO\n(FORMAT PARQUET)"| ARCHIVE
    RETENTION -->|"DELETE\npruned rows"| HOT
    RETENTION -->|"DELETE\npruned rows"| TENANT_DB

    BACKUP -->|"EXPORT DATABASE\nshared + all tenants"| SNAPSHOTS

    subgraph Destinations["Storage Destinations"]
        LOCAL["Local Filesystem"]
        S3["S3 / MinIO / R2"]
    end

    ARCHIVE --> LOCAL
    ARCHIVE --> S3
    SNAPSHOTS --> LOCAL
    SNAPSHOTS --> S3

    subgraph Recovery["Disaster Recovery"]
        RESTORE["hitkeep recover\nrestore-backup"]
    end

    SNAPSHOTS -.->|"IMPORT DATABASE"| RESTORE
    RESTORE -.->|"recreate"| HOT
    RESTORE -.->|"recreate"| TENANT_DB

The key insight: retention archives and database backups serve different purposes. Retention exports are per-site, incremental Parquet files for long-term analytical querying. Database backups are full point-in-time snapshots for disaster recovery. Both can target local disk or S3.

HitKeep Cloud manages retention policies, automated Parquet archiving, and encrypted off-site backups in your chosen managed region (EU Frankfurt or US Virginia). Start with HitKeep Cloud →