Sequin replication stalls and never recovers

Sequin database source occasionally gets stuck at an errored state with the message:
```
[postgres_replication_slot]: Sequin is connected, but has not received a heartbeat from the database's replication slot. Either Sequin is crashing or the replication process has stalled for some reason.
```

However, there is another replication slot `abc_sub` which is working totally fine.

```sql
> SELECT slot_name, confirmed_flush_lsn, active_pid, pg_size_pretty( pg_wal_lsn_diff(pg_current_wal_lsn(), confirmed_flush_lsn) ) AS replication_lag FROM pg_replication_slots;
+-------------+---------------------+------------+-----------------+
| slot_name   | confirmed_flush_lsn | active_pid | replication_lag |
|-------------+---------------------+------------+-----------------|
| abc_sub  | 1525/140003B8       | 5850       | 592 bytes       |
| sequin_slot | 1514/6002A288       | 22112      | 67 GB           |
+-------------+---------------------+------------+-----------------+
```

The `sequin` container never exhausts the allocated CPU and RAM so it's not a resource issue either. The logs from `sequin` don't show any errors apart from the following -
```
[error] [08:05:05.719] [SlotProducer] replication connect failed: ERROR 55006 (object_in_use) replication slot "sequin_slot" is active for PID 19053 line=264 pid=<0.5680.0> file=lib/sequin/runtime/slot_producer/slot_producer.ex domain=elixir application=sequin account_id=eb2a3aa6-9b46-41d3-aafd-b89e4365db75 database_id=fdb099f9-14f6-4890-beb6-3c48ae87bdb1 replication_id=d644c92b-d398-4a30-a57c-254687e51e38 
[error] [08:15:05.682] [SlotProcessorServer] Heartbeat verification failed (no messages or heartbeat received in last 10 min) line=382 pid=<0.5691.0> file=lib/sequin/runtime/slot_processor_server.ex domain=elixir application=sequin account_id=eb2a3aa6-9b46-41d3-aafd-b89e4365db75 database_id=fdb099f9-14f6-4890-beb6-3c48ae87bdb1 replication_id=d644c92b-d398-4a30-a57c-254687e51e38 heartbeat_id=ef48d2d0-c8a3-43c7-8bee-89b61807fad6 
[error] [08:15:05.685] GenServer {:replication, {Sequin.Runtime.SlotProcessorServer, "d644c92b-d398-4a30-a57c-254687e51e38"}} terminating
[error] [08:25:05.716] [SlotProcessorServer] Heartbeat verification failed (no messages or heartbeat received in last 10 min) line=382 pid=<0.6350.0> file=lib/sequin/runtime/slot_processor_server.ex domain=elixir application=sequin account_id=eb2a3aa6-9b46-41d3-aafd-b89e4365db75 database_id=fdb099f9-14f6-4890-beb6-3c48ae87bdb1 replication_id=d644c92b-d398-4a30-a57c-254687e51e38 heartbeat_id=c65d7624-c33f-4faa-81c5-f05ef8675db8 
[error] [08:25:05.718] GenServer {:replication, {Sequin.Runtime.SlotProcessorServer, "d644c92b-d398-4a30-a57c-254687e51e38"}} terminating
[error] [08:33:07.580] Sink consumer IDs do not match monitored sink consumer IDs.
```

The postgres logs just show `sequin` trying to re-connect -
```
LOG:  unexpected EOF on standby connection
STATEMENT:  START_REPLICATION SLOT sequin_slot LOGICAL 0/0 (proto_version '1', publication_names 'abc_pub', messages 'true')
LOG:  starting logical decoding for slot "sequin_slot"
DETAIL:  Streaming transactions committing after 1525/2C07F490, reading WAL from 1525/2C069230.
STATEMENT:  START_REPLICATION SLOT sequin_slot LOGICAL 0/0 (proto_version '1', publication_names 'abc_pub', messages 'true')
LOG:  logical decoding found consistent point at 1525/2C069230
DETAIL:  Logical decoding will begin using saved snapshot.
STATEMENT:  START_REPLICATION SLOT sequin_slot LOGICAL 0/0 (proto_version '1', publication_names 'abc_pub', messages 'true')
```

Restarting the sequin instance doesn't works either, we have to drop and re-create the slot each time. This has become a recurring issue for us.

---

**Version:** v0.14.6

**Evironment:** Running through docker, config is passed as an env -
```hcl
  sequin_config_yaml = yamlencode({
    account = {
      name = "abc"
    }
    users = [
      {
        email    = var.SEQUIN_ADMIN_USER
        password = var.SEQUIN_ADMIN_PASSWORD
      }
    ]
    databases = [
      {
        name     = "abc-source"
        username = local.sequin_source_db_credentials[0]
        password = local.sequin_source_db_credentials[1]
        hostname = local.sequin_source_db_host_port[0]
        port     = tonumber(local.sequin_source_db_host_port[1])
        database = local.sequin_source_db_host_port_db[1]
        ssl      = true
        slot = {
          name                 = "sequin_slot"
          create_if_not_exists = true
        }
        publication = {
          name                 = "abc_pub"
          create_if_not_exists = false
        }
      }
    ]
    sinks = [
      {
        name     = "abc-to-nats"
        database = "abc-source"
        status   = "active"
        source = {
          include_schemas = ["public"]
          exclude_tables  = ["public.json_web_tokens"]
        }
        destination = {
          type     = "nats"
          host     = aws_route53_record.nats_dns_record.name
          port     = 4222
          username = "nats"
          password = random_password.nats_password.result
          tls      = true
        }
      }
    ]
  })

```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sequin replication stalls and never recovers #2141

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Sequin replication stalls and never recovers #2141

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions