Skip to content

Conversation

@tiationg-kho
Copy link
Contributor

Issue #, if available:
#1205

Description of changes:
Add new env var for sqsMsgVisibilityTimeoutSec to SQS mode.

Quick revisit how NTH listen to ASG launch event:

  • ASG launch event come → NTH start process → change to internal interruption event
    • if node ready → sent continue to life cycle hook → delete SQS msg
    • if node not ready → cancel internal event → wait SQS vis timeout → same ASG launch event come again → (after SQS reach max receive count, msg goto DLQ)

So we better let SQS visibility timeout become configurable. Which can mitigate the 1205 issue.

Solution:

  1. Recommend to set a higher sqsMsgVisibilityTimeoutSec value like 40 seconds to override the default 20 seconds in our sqs-monitor.
  2. Also, if applicable, recommend to increase SQS maxReceiveCount for DLQ like 5.

How you tested your changes:
Environment (Linux / Windows):
Kubernetes Version:

Test1:

  • default sqsMsgVisibilityTimeoutSec (20sec) and do not set DLQ for SQS -> succeed after 3 tries

Test2:

  • set sqsMsgVisibilityTimeoutSec as 10sec and SQS maxReceiveCount for DLQ as 3 -> fail after 3 tries

Test3:

  • set sqsMsgVisibilityTimeoutSec as 10sec and SQS maxReceiveCount for DLQ as 10 -> succeed after 5 tries

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@tiationg-kho tiationg-kho requested a review from a team as a code owner December 30, 2025 01:24
@tiationg-kho
Copy link
Contributor Author

Hi @LikithaVemulapalli @stevegocoding, could you help me take a look?

func (m SQSMonitor) receiveQueueMessages(qURL string) ([]*sqs.Message, error) {
visibilityTimeout := m.SqsMsgVisibilityTimeoutSec
if visibilityTimeout <= 0 {
visibilityTimeout = 20
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we import the default constant from config instead of directly adding 20 here for single source of truth

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

flag.BoolVar(&config.UseAPIServerCacheToListPods, "use-apiserver-cache", getBoolEnv(useAPIServerCache, false), "If true, leverage the k8s apiserver's index on pod's spec.nodeName to list pods on a node, instead of doing an etcd quorum read.")
flag.IntVar(&config.HeartbeatInterval, "heartbeat-interval", getIntEnv(heartbeatIntervalKey, -1), "The time period in seconds between consecutive heartbeat signals. Valid range: 30-3600 seconds (30 seconds to 1 hour).")
flag.IntVar(&config.HeartbeatUntil, "heartbeat-until", getIntEnv(heartbeatUntilKey, -1), "The duration in seconds over which heartbeat signals are sent. Valid range: 60-172800 seconds (1 minute to 48 hours).")
flag.IntVar(&config.SqsMsgVisibilityTimeoutSec, "sqs-msg-visibility-timeout-sec", getIntEnv(sqsMsgVisibilityTimeoutSecConfigKey, sqsMsgVisibilityTimeoutSecDefault), "Duration in seconds that a message is hidden from other consumers after being retrieved from the SQS queue by sqs-monitor.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe we should add validation for this SQS visibility timeout field, the range can be from 0 to 43200 sec which is 12 hrs, SQS throws an error after 12 hrs, so could you include the range in all the comments for this field and also validation to throw error if config.SqsMsgVisibilityTimeoutSec < 0 || config.SqsMsgVisibilityTimeoutSec > 43200 and also the default per this doc is 30 seconds.
For reference: https://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/sqs-visibility-timeout.html

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

| `topologySpreadConstraints` | [Topology Spread Constraints](https://kubernetes.io/docs/concepts/scheduling-eviction/topology-spread-constraints/) for pod scheduling. Useful with a highly available deployment to reduce the risk of running multiple replicas on the same Node | `[]` |
| `heartbeatInterval` | The time period in seconds between consecutive heartbeat signals. Valid range: 30-3600 seconds (30 seconds to 1 hour). | `-1` |
| `heartbeatUntil` | The duration in seconds over which heartbeat signals are sent. Valid range: 60-172800 seconds (1 minute to 48 hours). | `-1` |
| `sqsMsgVisibilityTimeoutSec` | Duration in seconds that a message is hidden from other consumers after being retrieved from the SQS queue by sqs-monitor. | `20` |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

update comment here as well after including the range

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants