Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
59 changes: 53 additions & 6 deletions documentation/deployment/gcp.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,9 +27,14 @@ We recommend starting with `C-Series` instances, and reviewing other instance ty

You should deploy using an `x86_64` Linux distribution, such as Ubuntu.

For storage, we recommend using [Hyperdisk Balanced](https://cloud.google.com/compute/docs/disks/hyperdisks) disks,
For storage, we recommend using [Hyperdisk Balanced](https://cloud.google.com/compute/docs/disks/hyperdisks) disks,
and provisioning them at `5000 IOPS/300 MBps` until you have tested your workload.

:::warning
Hyperdisk Balanced is not supported on all machine types. N2 instances do not
support Hyperdisk. Use N4, C3, or C4 series instances with Hyperdisk Balanced.
:::

`Hyperdisk Extreme` generally requires much higher `vCPU` counts - for example, it cannot be used on `C3` machines
smaller than `88 vCPUs`.

Expand All @@ -38,19 +43,61 @@ smaller than `88 vCPUs`.

### Google Filestore

Google Filestore is a `NAS` solution offering an `NFS` API to talk to arbitrary volumes.
Google Filestore is a managed NFS service that can be used as a replication
transport layer in QuestDB Enterprise.

Filestore should **not** be used as primary storage for QuestDB. However, it
is well-suited for replication when low latency is required. The `fs::`
transport over NFS provides sub-200ms replication lag with
[aggressive tuning](/docs/high-availability/tuning/), compared to ~1s+ with
object store transport (GCS).

To use Filestore for replication:

1. Create a Filestore instance in the same region as your QuestDB VMs
2. Mount the NFS share on both primary and replica nodes
3. Configure the `fs::` transport in `server.conf`:

```ini
replication.object.store=fs::root=/mnt/questdb-repl/final;atomic_write_dir=/mnt/questdb-repl/scratch;
```

Use the [backup](/docs/operations/backup/) feature to manage WAL file retention
on the NFS mount.

This should **not** be used as primary storage for QuestDB. It could be used for replication in QuestDB Enterprise,
but `Google Cloud Storage` is likely simpler and cheaper to use.
On GKE, expose the Filestore share as a `PersistentVolume` with
`ReadWriteMany` access mode using the
[Filestore CSI driver](https://cloud.google.com/kubernetes-engine/docs/how-to/persistent-volumes/filestore-csi-driver),
so both primary and replica pods can mount it simultaneously.

:::note
Filestore Zonal and Basic SSD tiers may require a
[quota increase](https://cloud.google.com/docs/quotas/view-manage) before use.
Basic HDD is typically available by default.
:::

### Google Cloud Storage

QuestDB supports `Google Cloud Storage` as its replication object-store in the Enterprise edition.
QuestDB supports Google Cloud Storage as its replication object store in the
Enterprise edition. GCS is the simplest and cheapest replication transport, but
has higher latency (~1s+) due to object store API overhead.

To get started, create a bucket for the database to use. Then follow the
To get started, create a bucket for the database to use. Then follow the
[Enterprise Quick Start](/docs/getting-started/enterprise-quick-start/) steps to create a connection string and
configure QuestDB.

### NetApp Volumes

[NetApp Volumes](https://cloud.google.com/netapp/volumes/docs/discover/overview)
is a managed NFS service on GCP backed by NetApp ONTAP. Like Filestore, it can
be used as a low-latency replication transport via the `fs::` prefix. The
QuestDB configuration is identical to Filestore.

:::note
NetApp Volumes requires enabling the `netapp.googleapis.com` API and may
require separate quota allocation.
:::

### Minimum specification

- **Instance**: `c3-standard-4` or `c3d-standard-4` `(4 vCPUs, 16 GB RAM)`
Expand Down
222 changes: 178 additions & 44 deletions documentation/high-availability/tuning.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,48 +12,167 @@ import { EnterpriseNote } from "@site/src/components/EnterpriseNote"
Tune replication for lower latency or reduced network costs.
</EnterpriseNote>

Replication tuning lets you balance **latency** against **network costs**. By
default, QuestDB uses balanced settings. You can tune for lower latency or
reduced network traffic depending on your needs.
Three settings control replication latency. The main decision is your transport
layer — **object store** (S3, GCS, Azure Blob) is simplest and cheapest at rest,
while **NFS** (EFS, Filestore, Azure Files, NetApp) removes per-operation costs
and unlocks sub-second latency. Pick a transport, choose a profile below, and
restart.

## When to tune
## The three settings that matter

| Goal | Approach |
|------|----------|
| **Low latency** | Smaller WAL segments, shorter throttle windows. |
| **Lower network costs** | Larger WAL segments, longer throttle windows. |
| Setting | Node | Default | What it does |
|---------|------|---------|-------------|
| `replication.primary.throttle.window.duration` | Primary | `10000` (10s) | Maximum time before an incomplete WAL segment is flushed |
| `replication.replica.poll.interval` | Replica | `1000` (1s) | How often the replica checks for new data |
| `cairo.wal.segment.rollover.size` | Primary | `2097152` (2 MiB) | Max WAL segment size before rollover |

## Quick reference
A segment is uploaded when **either** the size limit or the throttle window is
reached, whichever comes first. Under heavy write load, segments fill and flush
well before the throttle window expires. Under light load, the throttle window
controls when the partially-filled segment is flushed.

## Configuration profiles

### Sub-200ms latency (NFS transport)

**For low latency**:
```ini
# Primary
cairo.wal.segment.rollover.size=262144
replication.primary.throttle.window.duration=1000
replication.primary.sequencer.part.txn.count=5000
replication.primary.throttle.window.duration=50

# Replica
replication.replica.poll.interval=50
```

**For network efficiency**:
### Sub-500ms latency (NFS or object store)

```ini
# Primary
cairo.wal.segment.rollover.size=524288
replication.primary.throttle.window.duration=100

# Replica
replication.replica.poll.interval=100
```

### Default / balanced

No configuration needed. The defaults are:

- `replication.primary.throttle.window.duration=10000` (10s)
- `replication.replica.poll.interval=1000` (1s)
- `cairo.wal.segment.rollover.size=2097152` (2 MiB)

### Network efficiency

```ini
# Primary
cairo.wal.segment.rollover.size=2097152
replication.primary.throttle.window.duration=60000
replication.primary.sequencer.part.txn.count=1000
```

## Choosing a transport: cost vs latency

{/* Pricing sources — verify periodically against your cloud provider:
GCS: https://cloud.google.com/storage/pricing
Filestore: https://cloud.google.com/filestore/pricing
NetApp (GCP): https://cloud.google.com/netapp/volumes/pricing
AWS S3: https://aws.amazon.com/s3/pricing/
AWS EFS: https://aws.amazon.com/efs/pricing/
Azure Blob: https://azure.microsoft.com/en-us/pricing/details/storage/blobs/
Azure Files: https://azure.microsoft.com/en-us/pricing/details/storage/files/
Azure NetApp: https://azure.microsoft.com/en-us/pricing/details/netapp/
*/}

### Object store (S3, GCS, Azure Blob)

- **Per-request pricing**: every WAL upload is a write op, every replica poll is
a read op
- Lower latency settings = more ops = higher cost
- Best for: simplest setup, low storage cost, moderate latency tolerance
- Storage cost: ~$20/TB/month across major clouds

:::note[GCP users]
Replication over GCS has a latency floor of roughly 1 second. If you need
sub-second replication on GCP, use an NFS transport such as Filestore or
NetApp Volumes instead.
:::

### NFS / managed file storage (EFS, Filestore, Azure Files, NetApp)

- **Fixed monthly cost** regardless of how aggressively you tune
- No per-operation charges — poll every 50ms at no extra cost
- Best for: low-latency requirements, high-throughput ingestion
- Storage cost: ~$60–300/TB/month depending on service tier and provider
- NFS is usually priced by provisioned capacity, not usage — you pay for the
full volume whether it's 10% or 100% full

### The cost tradeoff

The storage cost gap (object store at ~\$20/TB vs NFS at \$60–300/TB) looks large,
but the replication working set — WAL files in transit — is typically well under
1 TB. At that scale the per-TB premium is modest in absolute terms.

The real cost difference is **operations**. With object store, every flush and
every poll is a billable request. Each actively-written table generates one write
op per throttle window and one read op per poll interval. Across major clouds,
write ops typically cost ~\$5/million and read ops ~\$0.40/million.

**Object store ops cost per active table:**

| Throttle / poll interval | Ops cost per table per month |
|---|---|
| 50ms / 50ms | ~$280 |
| 100ms / 100ms | ~$140 |
| 1s / 1s | ~$14 |
| 10s / 1s (default) | ~$2 |

Multiply by the number of tables being actively written to. With 10 tables at
100ms intervals, that's ~$1,400/month in API charges alone. With NFS, that same
configuration costs nothing extra.

The rough breakeven:

> **ops cost per month** ≈ active tables × $14,000 / interval_ms
>
> If that exceeds the NFS premium over object storage (typically $40–180/TB/mo
> × your working set in TB), **NFS is cheaper**.

At default settings with a handful of tables, object store wins easily. Once you
push below ~200ms intervals or have many actively-written tables, NFS pays for
itself on API savings alone — and you get lower latency as a bonus.

:::note
For long-term data retention (cold/archive tier), object storage is always
significantly cheaper and should be used regardless of your replication
transport choice.
:::

### Summary

| | Object store | NFS / file storage |
|---|---|---|
| Pricing model | Per-request + per-GB stored | Fixed monthly (provisioned) |
| Storage cost | ~\$20/TB/mo | ~\$60–300/TB/mo |
| Cost of aggressive tuning | Higher (more ops) | No change |
| Setup complexity | Low | Medium (mount on all nodes) |
| Best for | Default settings, few tables | Sub-second latency, many tables |

## How replication works

Understanding the data flow helps you tune effectively:

1. **Ingestion** - Data is written to Write-Ahead Log (WAL) segments
2. **Upload** - WAL segments are uploaded to object storage
3. **Download** - Replicas download and apply WAL segments
1. **Ingestion** Data is written to Write-Ahead Log (WAL) segments
2. **Upload** WAL segments are flushed to the transport (object store or NFS)
3. **Download** Replicas poll the transport and apply new WAL segments

The key insight: **smaller, more frequent uploads = lower latency but more
network traffic**. Larger, less frequent uploads = higher latency but lower
costs.

<Screenshot
alt="Network traffic with default settings"
title="Default settings: optimized for latency"
title="Default settings: balanced latency and throughput"
height={360}
src="images/guides/replication-tuning/one_row_sec_defaults.webp"
width={1072}
Expand All @@ -67,16 +186,18 @@ costs.
width={1072}
/>

## Settings explained
## Settings reference

### WAL segment size

```ini
cairo.wal.segment.rollover.size=2097152
```

Controls when WAL segments are closed and uploaded. Smaller segments upload
sooner (lower latency) but create more files.
Controls the size threshold at which WAL segments are closed and uploaded.
Smaller segments upload sooner (lower latency) but create more files. Works in
tandem with the throttle window — whichever limit is hit first triggers the
upload.

| Value | Behavior |
|-------|----------|
Expand All @@ -94,54 +215,67 @@ Tiering requires files over 128 KiB.
replication.primary.throttle.window.duration=10000 # 10 seconds (default)
```

Maximum time before uploading incomplete segments. Longer windows let segments
fill up before upload, reducing redundant uploads (write amplification).
Maximum time before uploading an incomplete segment. If a segment hasn't reached
the rollover size within this window, it is flushed anyway. Longer windows let
segments fill up before upload, reducing redundant uploads (write amplification).

| Value | Behavior |
|-------|----------|
| `1000` (1s) | Lowest latency, most uploads. |
| `50` (50ms) | Ultra-low latency. Best with NFS transport. |
| `100` (100ms) | Low latency. Good balance for NFS transport. |
| `1000` (1s) | Low latency for object store transport. |
| `10000` (10s) | Default. Balanced. |
| `60000` (60s) | 1 minute delay OK. Fewer uploads. |
| `300000` (5 min) | Cost-sensitive. Batches more data. |

This is your **maximum replication latency tolerance**. QuestDB still actively
manages replication to prevent backlogs during bursts.

### Sequencer part size
### Replica poll interval

```ini
replication.primary.sequencer.part.txn.count=5000
replication.replica.poll.interval=1000 # 1 second (default)
```

Controls how many transactions are grouped into each sequencer part file.
How often the replica checks the transport layer for new data. This setting
is configured on the **replica** node.

Instead of uploading the entire transaction log on every replication cycle
(which grows indefinitely), the sequencer is split into fixed-size part files.
Only new or changed parts are uploaded, significantly reducing network overhead.
| Value | Behavior |
|-------|----------|
| `50` (50ms) | Ultra-low latency. Pair with aggressive primary settings. |
| `100` (100ms) | Low latency. Good for NFS transport. |
| `1000` (1s) | Default. Balanced. |

| Value | Effect |
|-------|--------|
| Lower (e.g. `1000`) | Smaller part files, more frequent new parts, more object storage requests, faster incremental uploads. |
| Higher (e.g. `5000`) | Larger part files, fewer parts, fewer object storage requests, larger per-upload size. |
:::note
Reducing the poll interval below the throttle window duration has diminishing
returns, since the replica cannot consume data faster than the primary produces it.
:::

Default is `5000` (each part ~34-68 KiB compressed).
## Advanced settings

:::warning
This setting is **fixed at table creation**. You cannot change it for existing
tables.
:::
These settings are available for power users but rarely need adjustment:

| Setting | Default | Description |
|---------|---------|-------------|
| `replication.primary.sequencer.part.txn.count` | `5000` | Transactions per sequencer part file. Lower values mean smaller parts and faster incremental uploads but more storage requests. **Fixed at table creation** — cannot be changed for existing tables. |
| `replication.primary.compression.level` | `1` | Zstd compression level for WAL uploads. Higher values reduce transfer size at the cost of CPU. |
| `replication.primary.compression.threads` | `2` | Number of threads used for compressing WAL data before upload. |
| `replication.requests.max.concurrent` | `32` | Maximum concurrent replication requests (uploads and downloads). |
| `replication.requests.retry.attempts` | `3` | Number of retry attempts for failed replication requests. |
| `replication.requests.retry.interval` | `500` | Milliseconds between retry attempts. |

## Compression (reference)

WAL data is compressed before upload. This isn't tunable, but useful for
estimating storage and network requirements:
WAL data is compressed before upload (the level and thread count are configurable
in [Advanced settings](#advanced-settings) above). The typical ratios are useful
for estimating storage and network requirements:

| Data type | Compression ratio |
|-----------|-------------------|
| Data type | Typical compression ratio |
|-----------|---------------------------|
| WAL segments | ~8x |
| Sequencer parts | ~6x |

For example, a 2 MiB WAL segment becomes ~256 KiB in object storage.
For example, a 2 MiB WAL segment becomes ~256 KiB in the transport layer.

## Next steps

Expand Down