Skip to content

Conversation

@rdettai-sk
Copy link
Collaborator

Description

Currently, the search permit provider handles only one query at a time. This is good for the overall query latency when queries are relatively short. It is catastrophic in any concurrent setup where some queries are slow, because all the other queries will be starved from resources until the large query completes.

This PR creates a secondary search permit provider that is independently configured. "Fast" leaf queries are routed to one provider and "slow" ones to the other. Query duration is naively inferred from the number of targetted splits, and the threshold for being routed to one provider or the other can be configured using SearcherConfig.secondary_targeted_split_count_threshold.

One limitation of the current implementation of this PR is that for a given root search, various leaf searches might fall on different sides of the threshold. We could fix this by applying the threashold on the root and add the targeted provider in the leaf request.

How was this PR tested?

We ran this setup on our highly concurrent production environment. It helped stabilize query performance drastically.

.get_permits(permit_sizes)
.await;
let permit_provider = if is_broad_search {
&searcher_context.secondary_search_permit_provider
Copy link
Collaborator

@fulmicoton fulmicoton Dec 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with this logic, short request cannot use the broad search request permits. I don't think this is good.

(do you want a supermarket where people with less than 10 articles cannot go to the regular cashier)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. We just don't want to queue them there if there is some broad search going on.

Before making this change, what do you think about the fact that the permit provider is chosen on the leafs? I would prefer that for a large search, all leaf searches were tagged as "broad" in the root. Currently, the timeout cannot be enforced on the root because we don't know yet what timeout will be applied to its leaf searches. This requires a change in the RPC so I wanted to be sure the the change will be merged upstream before doing it.

@fulmicoton-dd
Copy link
Collaborator

@rdettai-sk I think we should not merge this PR. At least at the moment.
I understand that it is useful to you.

The trouble is that it is a hacky solution (the problem I pointed out, but also others: Do we really want to not use the entire permit pool for a long request if there are no request running? Would a priority system be more suited?), this adds complexity (it has many new property in the configuration for instance) to the project. After the PR is merged, moving away for a better solution would require coordination.

I understand this is a rather "unfair": we would happily merge a hack like this if it was scratching our customer's itch.

@rdettai-sk
Copy link
Collaborator Author

I agree this solution isn't perfect. I also understand that adding new configurations that are likely to become irrelevant when moving to a better solution creates an undesirable maintenance burden.

Do you have plans to work on the search scheduler soon? Do you never run into situations where the searchers become irresponsive because of this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants