Skip to content

Conversation

@jasonlmfong
Copy link
Member

@jasonlmfong jasonlmfong commented Dec 5, 2025

ts: limit the recording of child metrics to a small set of metrics

This change adds an explicit list of metrics which we perform child metrics collection for.
This helps bound the performance impact even when the clsuter setting is turned on.

Epic: CRDB-55079
Release: None

--

ts: record histogram child metrics

When child metrics collection was performed, it only picked up gauges and counters.
This commit adds the ability to record histograms as well.
This was done through exposing child histograms snapshots which were previously not accessible.

Epic: CRDB-55079
Release: None

--

ts: memoize child metric name transformation

Curently the child metric name transformations are being performed every polling cycle.
The memoization will ensure every subsequent poll does not need to allocate memory and cpu to transform the names.

Epic: CRDB-55079
Release: None

@jasonlmfong jasonlmfong changed the title Jf/ts record limited set child metrics ts record limited set child metrics Dec 5, 2025
@jasonlmfong jasonlmfong changed the title ts record limited set child metrics ts: record limited set child metrics Dec 5, 2025
@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Collaborator

@dhartunian dhartunian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dhartunian reviewed 2 of 2 files at r1, 4 of 4 files at r2.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @jasonlmfong)


-- commits line 4 at r1:
nit: commit message wrap.


-- commits line 12 at r2:
can you add more detail about why a bunch of histogram wrangling code is necessary? Seems like it's related to child histograms not exposing their internals (because they weren't needed before) and the manual work involved in calculating the percentiles, right?

nit: commit message wrap.


-- commits line 18 at r3:
nit: grammar. also consider rewording. "memoize child metric name transformation" might be clearer.


-- commits line 20 at r3:
Be clear about what motivated the introduction of the cache, and what you expect it to improve. This way someone can understand why this code is here in the future. Otherwise, it seems unmotivated on the surface especially since it's not linked to a ticket.


pkg/server/status/recorder_test.go line 1170 at r1 (raw file):

		{"5metrics_100children", 5, 100},
		{"10metrics_100children", 10, 100},
		{"10metrics_1024children", 10, 1024},

what is the purpose of doing different combos for benchmarks? shouldn't 1 benchmark scale to different number of child metrics?


pkg/server/status/recorder.go line 197 at r3 (raw file):

	// childMetricNameCache caches the encoded names for child metrics to avoid
	// rebuilding them on every recording. Uses sync.Map for lock-free reads.
	childMetricNameCache sync.Map

linter will tell you this, but you need to use syncutil.Map which supports generics.


pkg/server/status/recorder.go line 476 at r3 (raw file):

			timestampNanos: now.UnixNano(),
		}
		recorder.recordChangefeedChildMetrics(&data, &mr.childMetricNameCache)

did you consider putting the childMetricNameCache in the registryRecorder struct so you don't have to pass it around?


pkg/server/status/recorder.go line 960 at r3 (raw file):

func hashLabels(labels []*prometheusgo.LabelPair) uint64 {
	var h uint64 = 14695981039346656037 // FNV-1a offset basis
	const prime = 1099511628211

surprising to me that you have to do this manually. is there no util or stdlib thing that can accumulate a hash from a bunch of strings? you should use hash/maphash or hash/fnv.


pkg/server/status/recorder.go line 1122 at r3 (raw file):

			} else {
				metricName = prom.GetName(false /* useStaticLabels */) + metric.EncodeLabeledName(childMetric)
			}

can you extract the shared logic for the hash lookup? seems identical as the histogram one above.


pkg/server/status/recorder_test.go line 366 at r2 (raw file):

		// Check for changefeed.stage.downstream_client_send.latency with scope labels
		if strings.Contains(ts.Name, "cr.node.changefeed.stage.downstream_client_send.latency") {

this is a weird construction for the test to conditionally evaluate so many varieties of cases. can this be encoded as a set of test case structs?


pkg/server/status/recorder.go line 1037 at r2 (raw file):

				}
				childMetricsCount++
			})

this code feels like it should be implemented by the child metric structs as a method, no? or is it specialized to this output. maybe this is just how we do non-child histogram recording?

Copy link
Member Author

@jasonlmfong jasonlmfong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @dhartunian)


pkg/server/status/recorder.go line 476 at r3 (raw file):

Previously, dhartunian (David Hartunian) wrote…

did you consider putting the childMetricNameCache in the registryRecorder struct so you don't have to pass it around?

the registryRecorder is being recreated when we call GetTimeSeriesData() so the cache doesn't live until the next iteration

Copy link
Member Author

@jasonlmfong jasonlmfong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @dhartunian)


pkg/server/status/recorder.go line 1037 at r2 (raw file):

Previously, dhartunian (David Hartunian) wrote…

this code feels like it should be implemented by the child metric structs as a method, no? or is it specialized to this output. maybe this is just how we do non-child histogram recording?

we also do non-child histogram recordings in a similar fashion, it's under the extractValue() function

@jasonlmfong jasonlmfong force-pushed the jf/ts-record-limited-set-child-metrics branch from 8116a2e to cb3c2c5 Compare December 5, 2025 19:18
@github-actions
Copy link

github-actions bot commented Dec 5, 2025

Potential Bug(s) Detected

The three-stage Claude Code analysis has identified potential bug(s) in this PR that may warrant investigation.

Next Steps:
Please review the detailed findings in the workflow run.

Note: When viewing the workflow output, scroll to the bottom to find the Final Analysis Summary.

After you review the findings, please tag the issue as follows:

  • If the detected issue is real or was helpful in any way, please tag the issue with O-AI-Review-Real-Issue-Found
  • If the detected issue was not helpful in any way, please tag the issue with O-AI-Review-Not-Helpful

@github-actions github-actions bot added the o-AI-Review-Potential-Issue-Detected AI reviewer found potential issue. Never assign manually—auto-applied by GH action only. label Dec 5, 2025
This change adds an explicit list of metrics which we perform child metrics collection for.
This helps bound the performance impact even when the clsuter setting is turned on.

Epic: CRDB-55079
Release: None
When child metrics collection was performed, it only picked up gauges and counters.
This commit adds the ability to record histograms as well.
This was done through exposing child histograms snapshots which were previously not accessible.

Epic: CRDB-55079
Release: None
Curently the child metric name transformations are being performed every polling cycle.
The memoization will ensure every subsequent poll does not need to allocate memory and cpu to transform the names.

Epic: CRDB-55079
Release: None
@jasonlmfong jasonlmfong force-pushed the jf/ts-record-limited-set-child-metrics branch from cb3c2c5 to 6bf6467 Compare December 5, 2025 20:30
@jasonlmfong jasonlmfong marked this pull request as ready for review December 5, 2025 20:31
@jasonlmfong jasonlmfong requested review from a team as code owners December 5, 2025 20:31
@jasonlmfong jasonlmfong requested review from Abhinav1299, aa-joshi, alyshanjahani-crl, arjunmahishi and dhartunian and removed request for a team December 5, 2025 20:31
@jasonlmfong jasonlmfong changed the title ts: record limited set child metrics ts: update the recording of changefeed child metrics Dec 5, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

o-AI-Review-Potential-Issue-Detected AI reviewer found potential issue. Never assign manually—auto-applied by GH action only.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants