Sequin database source occasionally gets stuck at an errored state with the message:
[postgres_replication_slot]: Sequin is connected, but has not received a heartbeat from the database's replication slot. Either Sequin is crashing or the replication process has stalled for some reason.
However, there is another replication slot abc_sub which is working totally fine.
> SELECT slot_name, confirmed_flush_lsn, active_pid, pg_size_pretty( pg_wal_lsn_diff(pg_current_wal_lsn(), confirmed_flush_lsn) ) AS replication_lag FROM pg_replication_slots;
+-------------+---------------------+------------+-----------------+
| slot_name | confirmed_flush_lsn | active_pid | replication_lag |
|-------------+---------------------+------------+-----------------|
| abc_sub | 1525/140003B8 | 5850 | 592 bytes |
| sequin_slot | 1514/6002A288 | 22112 | 67 GB |
+-------------+---------------------+------------+-----------------+
The sequin container never exhausts the allocated CPU and RAM so it's not a resource issue either. The logs from sequin don't show any errors apart from the following -
[error] [08:05:05.719] [SlotProducer] replication connect failed: ERROR 55006 (object_in_use) replication slot "sequin_slot" is active for PID 19053 line=264 pid=<0.5680.0> file=lib/sequin/runtime/slot_producer/slot_producer.ex domain=elixir application=sequin account_id=eb2a3aa6-9b46-41d3-aafd-b89e4365db75 database_id=fdb099f9-14f6-4890-beb6-3c48ae87bdb1 replication_id=d644c92b-d398-4a30-a57c-254687e51e38
[error] [08:15:05.682] [SlotProcessorServer] Heartbeat verification failed (no messages or heartbeat received in last 10 min) line=382 pid=<0.5691.0> file=lib/sequin/runtime/slot_processor_server.ex domain=elixir application=sequin account_id=eb2a3aa6-9b46-41d3-aafd-b89e4365db75 database_id=fdb099f9-14f6-4890-beb6-3c48ae87bdb1 replication_id=d644c92b-d398-4a30-a57c-254687e51e38 heartbeat_id=ef48d2d0-c8a3-43c7-8bee-89b61807fad6
[error] [08:15:05.685] GenServer {:replication, {Sequin.Runtime.SlotProcessorServer, "d644c92b-d398-4a30-a57c-254687e51e38"}} terminating
[error] [08:25:05.716] [SlotProcessorServer] Heartbeat verification failed (no messages or heartbeat received in last 10 min) line=382 pid=<0.6350.0> file=lib/sequin/runtime/slot_processor_server.ex domain=elixir application=sequin account_id=eb2a3aa6-9b46-41d3-aafd-b89e4365db75 database_id=fdb099f9-14f6-4890-beb6-3c48ae87bdb1 replication_id=d644c92b-d398-4a30-a57c-254687e51e38 heartbeat_id=c65d7624-c33f-4faa-81c5-f05ef8675db8
[error] [08:25:05.718] GenServer {:replication, {Sequin.Runtime.SlotProcessorServer, "d644c92b-d398-4a30-a57c-254687e51e38"}} terminating
[error] [08:33:07.580] Sink consumer IDs do not match monitored sink consumer IDs.
The postgres logs just show sequin trying to re-connect -
LOG: unexpected EOF on standby connection
STATEMENT: START_REPLICATION SLOT sequin_slot LOGICAL 0/0 (proto_version '1', publication_names 'abc_pub', messages 'true')
LOG: starting logical decoding for slot "sequin_slot"
DETAIL: Streaming transactions committing after 1525/2C07F490, reading WAL from 1525/2C069230.
STATEMENT: START_REPLICATION SLOT sequin_slot LOGICAL 0/0 (proto_version '1', publication_names 'abc_pub', messages 'true')
LOG: logical decoding found consistent point at 1525/2C069230
DETAIL: Logical decoding will begin using saved snapshot.
STATEMENT: START_REPLICATION SLOT sequin_slot LOGICAL 0/0 (proto_version '1', publication_names 'abc_pub', messages 'true')
Restarting the sequin instance doesn't works either, we have to drop and re-create the slot each time. This has become a recurring issue for us.
Version: v0.14.6
Evironment: Running through docker, config is passed as an env -
sequin_config_yaml = yamlencode({
account = {
name = "abc"
}
users = [
{
email = var.SEQUIN_ADMIN_USER
password = var.SEQUIN_ADMIN_PASSWORD
}
]
databases = [
{
name = "abc-source"
username = local.sequin_source_db_credentials[0]
password = local.sequin_source_db_credentials[1]
hostname = local.sequin_source_db_host_port[0]
port = tonumber(local.sequin_source_db_host_port[1])
database = local.sequin_source_db_host_port_db[1]
ssl = true
slot = {
name = "sequin_slot"
create_if_not_exists = true
}
publication = {
name = "abc_pub"
create_if_not_exists = false
}
}
]
sinks = [
{
name = "abc-to-nats"
database = "abc-source"
status = "active"
source = {
include_schemas = ["public"]
exclude_tables = ["public.json_web_tokens"]
}
destination = {
type = "nats"
host = aws_route53_record.nats_dns_record.name
port = 4222
username = "nats"
password = random_password.nats_password.result
tls = true
}
}
]
})
Sequin database source occasionally gets stuck at an errored state with the message:
However, there is another replication slot
abc_subwhich is working totally fine.The
sequincontainer never exhausts the allocated CPU and RAM so it's not a resource issue either. The logs fromsequindon't show any errors apart from the following -The postgres logs just show
sequintrying to re-connect -Restarting the sequin instance doesn't works either, we have to drop and re-create the slot each time. This has become a recurring issue for us.
Version: v0.14.6
Evironment: Running through docker, config is passed as an env -