From ca5329a369caca74541967cd194b4e865329fb10 Mon Sep 17 00:00:00 2001 From: Nick Woolmer <29717167+nwoolmer@users.noreply.github.com> Date: Mon, 23 Feb 2026 17:35:24 +0000 Subject: [PATCH 1/3] docs: add NFS replication transport guidance and tuning for GCP Expand GCP deployment docs with Filestore, NetApp Volumes, and GKE PersistentVolume guidance for NFS-based replication. Update replication tuning docs with sub-200ms config profiles and replica poll interval setting. Co-Authored-By: Claude Opus 4.6 --- documentation/deployment/gcp.md | 59 ++++++++++++++++++++--- documentation/high-availability/tuning.md | 44 +++++++++++++++-- 2 files changed, 94 insertions(+), 9 deletions(-) diff --git a/documentation/deployment/gcp.md b/documentation/deployment/gcp.md index 7e3deb987..be59034a8 100644 --- a/documentation/deployment/gcp.md +++ b/documentation/deployment/gcp.md @@ -27,9 +27,14 @@ We recommend starting with `C-Series` instances, and reviewing other instance ty You should deploy using an `x86_64` Linux distribution, such as Ubuntu. -For storage, we recommend using [Hyperdisk Balanced](https://cloud.google.com/compute/docs/disks/hyperdisks) disks, +For storage, we recommend using [Hyperdisk Balanced](https://cloud.google.com/compute/docs/disks/hyperdisks) disks, and provisioning them at `5000 IOPS/300 MBps` until you have tested your workload. +:::warning +Hyperdisk Balanced is not supported on all machine types. N2 instances do not +support Hyperdisk. Use N4, C3, or C4 series instances with Hyperdisk Balanced. +::: + `Hyperdisk Extreme` generally requires much higher `vCPU` counts - for example, it cannot be used on `C3` machines smaller than `88 vCPUs`. @@ -38,19 +43,61 @@ smaller than `88 vCPUs`. ### Google Filestore -Google Filestore is a `NAS` solution offering an `NFS` API to talk to arbitrary volumes. +Google Filestore is a managed NFS service that can be used as a replication +transport layer in QuestDB Enterprise. + +Filestore should **not** be used as primary storage for QuestDB. However, it +is well-suited for replication when low latency is required. The `fs::` +transport over NFS provides sub-200ms replication lag with +[aggressive tuning](/docs/high-availability/tuning/), compared to ~1s+ with +object store transport (GCS). + +To use Filestore for replication: + +1. Create a Filestore instance in the same region as your QuestDB VMs +2. Mount the NFS share on both primary and replica nodes +3. Configure the `fs::` transport in `server.conf`: + +```ini +replication.object.store=fs::root=/mnt/questdb-repl/final;atomic_write_dir=/mnt/questdb-repl/scratch; +``` + +Use the [backup](/docs/operations/backup/) feature to manage WAL file retention +on the NFS mount. -This should **not** be used as primary storage for QuestDB. It could be used for replication in QuestDB Enterprise, -but `Google Cloud Storage` is likely simpler and cheaper to use. +On GKE, expose the Filestore share as a `PersistentVolume` with +`ReadWriteMany` access mode using the +[Filestore CSI driver](https://cloud.google.com/kubernetes-engine/docs/how-to/persistent-volumes/filestore-csi-driver), +so both primary and replica pods can mount it simultaneously. + +:::note +Filestore Zonal and Basic SSD tiers may require a +[quota increase](https://cloud.google.com/docs/quotas/view-manage) before use. +Basic HDD is typically available by default. +::: ### Google Cloud Storage -QuestDB supports `Google Cloud Storage` as its replication object-store in the Enterprise edition. +QuestDB supports Google Cloud Storage as its replication object store in the +Enterprise edition. GCS is the simplest and cheapest replication transport, but +has higher latency (~1s+) due to object store API overhead. -To get started, create a bucket for the database to use. Then follow the +To get started, create a bucket for the database to use. Then follow the [Enterprise Quick Start](/docs/getting-started/enterprise-quick-start/) steps to create a connection string and configure QuestDB. +### NetApp Volumes + +[NetApp Volumes](https://cloud.google.com/netapp/volumes/docs/discover/overview) +is a managed NFS service on GCP backed by NetApp ONTAP. Like Filestore, it can +be used as a low-latency replication transport via the `fs::` prefix. The +QuestDB configuration is identical to Filestore. + +:::note +NetApp Volumes requires enabling the `netapp.googleapis.com` API and may +require separate quota allocation. +::: + ### Minimum specification - **Instance**: `c3-standard-4` or `c3d-standard-4` `(4 vCPUs, 16 GB RAM)` diff --git a/documentation/high-availability/tuning.md b/documentation/high-availability/tuning.md index 3403c064c..a4d7badc3 100644 --- a/documentation/high-availability/tuning.md +++ b/documentation/high-availability/tuning.md @@ -25,15 +25,31 @@ reduced network traffic depending on your needs. ## Quick reference -**For low latency**: +**For low latency (sub-200ms)**: ```ini +# Primary cairo.wal.segment.rollover.size=262144 -replication.primary.throttle.window.duration=1000 +replication.primary.throttle.window.duration=50 replication.primary.sequencer.part.txn.count=5000 + +# Replica +replication.replica.poll.interval=50 +``` + +**For low latency (sub-500ms)**: +```ini +# Primary +cairo.wal.segment.rollover.size=524288 +replication.primary.throttle.window.duration=100 +replication.primary.sequencer.part.txn.count=5000 + +# Replica +replication.replica.poll.interval=100 ``` **For network efficiency**: ```ini +# Primary cairo.wal.segment.rollover.size=2097152 replication.primary.throttle.window.duration=60000 replication.primary.sequencer.part.txn.count=1000 @@ -99,7 +115,9 @@ fill up before upload, reducing redundant uploads (write amplification). | Value | Behavior | |-------|----------| -| `1000` (1s) | Lowest latency, most uploads. | +| `50` (50ms) | Ultra-low latency. Best with NFS transport. | +| `100` (100ms) | Low latency. Good balance for NFS transport. | +| `1000` (1s) | Low latency for object store transport. | | `10000` (10s) | Default. Balanced. | | `60000` (60s) | 1 minute delay OK. Fewer uploads. | | `300000` (5 min) | Cost-sensitive. Batches more data. | @@ -107,6 +125,26 @@ fill up before upload, reducing redundant uploads (write amplification). This is your **maximum replication latency tolerance**. QuestDB still actively manages replication to prevent backlogs during bursts. +### Replica poll interval + +```ini +replication.replica.poll.interval=1000 # 1 second (default) +``` + +How often the replica checks the transport layer for new data. This setting +is configured on the **replica** node. + +| Value | Behavior | +|-------|----------| +| `50` (50ms) | Ultra-low latency. Pair with aggressive primary settings. | +| `100` (100ms) | Low latency. Good for NFS transport. | +| `1000` (1s) | Default. Balanced. | + +:::note +Reducing the poll interval below the throttle window duration has diminishing +returns, since the replica cannot consume data faster than the primary produces it. +::: + ### Sequencer part size ```ini From 623a2f1ba643a6c3577236a928bd9e1985fd78af Mon Sep 17 00:00:00 2001 From: Nick Woolmer <29717167+nwoolmer@users.noreply.github.com> Date: Tue, 24 Feb 2026 19:01:42 +0000 Subject: [PATCH 2/3] docs: revamp replication tuning page structure and pricing guidance Restructure the page so the most actionable content is at the top: - Add "three settings that matter" summary table - Add copy-paste configuration profiles (sub-200ms, sub-500ms, default, network efficiency) - Add transport cost vs latency section with cloud-agnostic breakeven formula - Move detailed settings docs to a reference section lower on the page - Add advanced settings table for power-user knobs - Add GCS-specific latency floor note - Fix screenshot titles to match "balanced" profile naming - Fix compression section to cross-link configurable settings Co-Authored-By: Claude Opus 4.6 --- documentation/high-availability/tuning.md | 196 ++++++++++++++++------ 1 file changed, 146 insertions(+), 50 deletions(-) diff --git a/documentation/high-availability/tuning.md b/documentation/high-availability/tuning.md index a4d7badc3..23c90fcff 100644 --- a/documentation/high-availability/tuning.md +++ b/documentation/high-availability/tuning.md @@ -12,56 +12,159 @@ import { EnterpriseNote } from "@site/src/components/EnterpriseNote" Tune replication for lower latency or reduced network costs. -Replication tuning lets you balance **latency** against **network costs**. By -default, QuestDB uses balanced settings. You can tune for lower latency or -reduced network traffic depending on your needs. +Three settings control replication latency. The main decision is your transport +layer — **object store** (S3, GCS, Azure Blob) is simplest and cheapest at rest, +while **NFS** (EFS, Filestore, Azure Files, NetApp) removes per-operation costs +and unlocks sub-second latency. Pick a transport, choose a profile below, and +restart. -## When to tune +## The three settings that matter -| Goal | Approach | -|------|----------| -| **Low latency** | Smaller WAL segments, shorter throttle windows. | -| **Lower network costs** | Larger WAL segments, longer throttle windows. | +| Setting | Node | Default | What it does | +|---------|------|---------|-------------| +| `replication.primary.throttle.window.duration` | Primary | `10000` (10s) | Maximum time before an incomplete WAL segment is flushed | +| `replication.replica.poll.interval` | Replica | `1000` (1s) | How often the replica checks for new data | +| `cairo.wal.segment.rollover.size` | Primary | `2097152` (2 MiB) | Max WAL segment size before rollover | -## Quick reference +A segment is uploaded when **either** the size limit or the throttle window is +reached, whichever comes first. Under heavy write load, segments fill and flush +well before the throttle window expires. Under light load, the throttle window +controls when the partially-filled segment is flushed. + +## Configuration profiles + +### Sub-200ms latency (NFS transport) -**For low latency (sub-200ms)**: ```ini # Primary cairo.wal.segment.rollover.size=262144 replication.primary.throttle.window.duration=50 -replication.primary.sequencer.part.txn.count=5000 # Replica replication.replica.poll.interval=50 ``` -**For low latency (sub-500ms)**: +### Sub-500ms latency (NFS or object store) + ```ini # Primary cairo.wal.segment.rollover.size=524288 replication.primary.throttle.window.duration=100 -replication.primary.sequencer.part.txn.count=5000 # Replica replication.replica.poll.interval=100 ``` -**For network efficiency**: +### Default / balanced + +No configuration needed. The defaults are: + +- `replication.primary.throttle.window.duration=10000` (10s) +- `replication.replica.poll.interval=1000` (1s) +- `cairo.wal.segment.rollover.size=2097152` (2 MiB) + +### Network efficiency + ```ini # Primary cairo.wal.segment.rollover.size=2097152 replication.primary.throttle.window.duration=60000 -replication.primary.sequencer.part.txn.count=1000 ``` +## Choosing a transport: cost vs latency + +{/* Pricing sources — verify periodically against your cloud provider: + GCS: https://cloud.google.com/storage/pricing + Filestore: https://cloud.google.com/filestore/pricing + NetApp (GCP): https://cloud.google.com/netapp/volumes/pricing + AWS S3: https://aws.amazon.com/s3/pricing/ + AWS EFS: https://aws.amazon.com/efs/pricing/ + Azure Blob: https://azure.microsoft.com/en-us/pricing/details/storage/blobs/ + Azure Files: https://azure.microsoft.com/en-us/pricing/details/storage/files/ + Azure NetApp: https://azure.microsoft.com/en-us/pricing/details/netapp/ +*/} + +### Object store (S3, GCS, Azure Blob) + +- **Per-request pricing**: every WAL upload is a write op, every replica poll is + a read op +- Lower latency settings = more ops = higher cost +- Best for: simplest setup, low storage cost, moderate latency tolerance +- Storage cost: ~$20/TB/month across major clouds + +:::note[GCP users] +Replication over GCS has a latency floor of roughly 1 second. If you need +sub-second replication on GCP, use an NFS transport such as Filestore or +NetApp Volumes instead. +::: + +### NFS / managed file storage (EFS, Filestore, Azure Files, NetApp) + +- **Fixed monthly cost** regardless of how aggressively you tune +- No per-operation charges — poll every 50ms at no extra cost +- Best for: low-latency requirements, high-throughput ingestion +- Storage cost: ~$60–300/TB/month depending on service tier and provider +- NFS is usually priced by provisioned capacity, not usage — you pay for the + full volume whether it's 10% or 100% full + +### The cost tradeoff + +The storage cost gap (object store at ~$20/TB vs NFS at $60–300/TB) looks large, +but the replication working set — WAL files in transit — is typically well under +1 TB. At that scale the per-TB premium is modest in absolute terms. + +The real cost difference is **operations**. With object store, every flush and +every poll is a billable request. Each actively-written table generates one write +op per throttle window and one read op per poll interval. Across major clouds, +write ops typically cost ~$5/million and read ops ~$0.40/million. + +**Object store ops cost per active table:** + +| Throttle / poll interval | Ops cost per table per month | +|---|---| +| 50ms / 50ms | ~$280 | +| 100ms / 100ms | ~$140 | +| 1s / 1s | ~$14 | +| 10s / 1s (default) | ~$2 | + +Multiply by the number of tables being actively written to. With 10 tables at +100ms intervals, that's ~$1,400/month in API charges alone. With NFS, that same +configuration costs nothing extra. + +The rough breakeven: + +> **ops cost per month** ≈ active tables × $14,000 / interval_ms +> +> If that exceeds the NFS premium over object storage (typically $40–180/TB/mo +> × your working set in TB), **NFS is cheaper**. + +At default settings with a handful of tables, object store wins easily. Once you +push below ~200ms intervals or have many actively-written tables, NFS pays for +itself on API savings alone — and you get lower latency as a bonus. + +:::note +For long-term data retention (cold/archive tier), object storage is always +significantly cheaper and should be used regardless of your replication +transport choice. +::: + +### Summary + +| | Object store | NFS / file storage | +|---|---|---| +| Pricing model | Per-request + per-GB stored | Fixed monthly (provisioned) | +| Storage cost | ~$20/TB/mo | ~$60–300/TB/mo | +| Cost of aggressive tuning | Higher (more ops) | No change | +| Setup complexity | Low | Medium (mount on all nodes) | +| Best for | Default settings, few tables | Sub-second latency, many tables | + ## How replication works Understanding the data flow helps you tune effectively: -1. **Ingestion** - Data is written to Write-Ahead Log (WAL) segments -2. **Upload** - WAL segments are uploaded to object storage -3. **Download** - Replicas download and apply WAL segments +1. **Ingestion** — Data is written to Write-Ahead Log (WAL) segments +2. **Upload** — WAL segments are flushed to the transport (object store or NFS) +3. **Download** — Replicas poll the transport and apply new WAL segments The key insight: **smaller, more frequent uploads = lower latency but more network traffic**. Larger, less frequent uploads = higher latency but lower @@ -69,7 +172,7 @@ costs. -## Settings explained +## Settings reference ### WAL segment size @@ -91,8 +194,10 @@ costs. cairo.wal.segment.rollover.size=2097152 ``` -Controls when WAL segments are closed and uploaded. Smaller segments upload -sooner (lower latency) but create more files. +Controls the size threshold at which WAL segments are closed and uploaded. +Smaller segments upload sooner (lower latency) but create more files. Works in +tandem with the throttle window — whichever limit is hit first triggers the +upload. | Value | Behavior | |-------|----------| @@ -110,8 +215,9 @@ Tiering requires files over 128 KiB. replication.primary.throttle.window.duration=10000 # 10 seconds (default) ``` -Maximum time before uploading incomplete segments. Longer windows let segments -fill up before upload, reducing redundant uploads (write amplification). +Maximum time before uploading an incomplete segment. If a segment hasn't reached +the rollover size within this window, it is flushed anyway. Longer windows let +segments fill up before upload, reducing redundant uploads (write amplification). | Value | Behavior | |-------|----------| @@ -145,41 +251,31 @@ Reducing the poll interval below the throttle window duration has diminishing returns, since the replica cannot consume data faster than the primary produces it. ::: -### Sequencer part size - -```ini -replication.primary.sequencer.part.txn.count=5000 -``` - -Controls how many transactions are grouped into each sequencer part file. - -Instead of uploading the entire transaction log on every replication cycle -(which grows indefinitely), the sequencer is split into fixed-size part files. -Only new or changed parts are uploaded, significantly reducing network overhead. +## Advanced settings -| Value | Effect | -|-------|--------| -| Lower (e.g. `1000`) | Smaller part files, more frequent new parts, more object storage requests, faster incremental uploads. | -| Higher (e.g. `5000`) | Larger part files, fewer parts, fewer object storage requests, larger per-upload size. | +These settings are available for power users but rarely need adjustment: -Default is `5000` (each part ~34-68 KiB compressed). - -:::warning -This setting is **fixed at table creation**. You cannot change it for existing -tables. -::: +| Setting | Default | Description | +|---------|---------|-------------| +| `replication.primary.sequencer.part.txn.count` | `5000` | Transactions per sequencer part file. Lower values mean smaller parts and faster incremental uploads but more storage requests. **Fixed at table creation** — cannot be changed for existing tables. | +| `replication.primary.compression.level` | `1` | Zstd compression level for WAL uploads. Higher values reduce transfer size at the cost of CPU. | +| `replication.primary.compression.threads` | `2` | Number of threads used for compressing WAL data before upload. | +| `replication.requests.max.concurrent` | `32` | Maximum concurrent replication requests (uploads and downloads). | +| `replication.requests.retry.attempts` | `3` | Number of retry attempts for failed replication requests. | +| `replication.requests.retry.interval` | `500` | Milliseconds between retry attempts. | ## Compression (reference) -WAL data is compressed before upload. This isn't tunable, but useful for -estimating storage and network requirements: +WAL data is compressed before upload (the level and thread count are configurable +in [Advanced settings](#advanced-settings) above). The typical ratios are useful +for estimating storage and network requirements: -| Data type | Compression ratio | -|-----------|-------------------| +| Data type | Typical compression ratio | +|-----------|---------------------------| | WAL segments | ~8x | | Sequencer parts | ~6x | -For example, a 2 MiB WAL segment becomes ~256 KiB in object storage. +For example, a 2 MiB WAL segment becomes ~256 KiB in the transport layer. ## Next steps From 77bc8912380f80cef8d256075ae1346ad6fd5c81 Mon Sep 17 00:00:00 2001 From: Nick Woolmer <29717167+nwoolmer@users.noreply.github.com> Date: Tue, 24 Feb 2026 19:13:54 +0000 Subject: [PATCH 3/3] docs: escape dollar signs to prevent LaTeX rendering in tuning page Co-Authored-By: Claude Opus 4.6 --- documentation/high-availability/tuning.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/documentation/high-availability/tuning.md b/documentation/high-availability/tuning.md index 23c90fcff..5a25a20a9 100644 --- a/documentation/high-availability/tuning.md +++ b/documentation/high-availability/tuning.md @@ -109,14 +109,14 @@ NetApp Volumes instead. ### The cost tradeoff -The storage cost gap (object store at ~$20/TB vs NFS at $60–300/TB) looks large, +The storage cost gap (object store at ~\$20/TB vs NFS at \$60–300/TB) looks large, but the replication working set — WAL files in transit — is typically well under 1 TB. At that scale the per-TB premium is modest in absolute terms. The real cost difference is **operations**. With object store, every flush and every poll is a billable request. Each actively-written table generates one write op per throttle window and one read op per poll interval. Across major clouds, -write ops typically cost ~$5/million and read ops ~$0.40/million. +write ops typically cost ~\$5/million and read ops ~\$0.40/million. **Object store ops cost per active table:** @@ -153,7 +153,7 @@ transport choice. | | Object store | NFS / file storage | |---|---|---| | Pricing model | Per-request + per-GB stored | Fixed monthly (provisioned) | -| Storage cost | ~$20/TB/mo | ~$60–300/TB/mo | +| Storage cost | ~\$20/TB/mo | ~\$60–300/TB/mo | | Cost of aggressive tuning | Higher (more ops) | No change | | Setup complexity | Low | Medium (mount on all nodes) | | Best for | Default settings, few tables | Sub-second latency, many tables |