-
Notifications
You must be signed in to change notification settings - Fork 3.8k
fix: source jvm.config from peon.sh for K8s peon containers #19364
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
Yomanz
wants to merge
5
commits into
apache:master
Choose a base branch
from
widgetbot-io:fix/peon-sh-jvm-config
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
5 commits
Select commit
Hold shift + click to select a range
3fb8c65
fix: source jvm.config from peon.sh for K8s peon containers
Yomanz 206ab1d
test: add K8s integration test for peon jvm.config sourcing
Yomanz 2f7ee8a
fix: spelling mistake
Yomanz 272f1b9
fix: change from deprecated URL.URL
Yomanz f62f9e7
test: add active script-level test for peon.sh jvm.config sourcing
Yomanz File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
202 changes: 202 additions & 0 deletions
202
...rc/test/java/org/apache/druid/testing/embedded/k8s/KubernetesPeonJvmConfigDockerTest.java
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,202 @@ | ||
| /* | ||
| * Licensed to the Apache Software Foundation (ASF) under one | ||
| * or more contributor license agreements. See the NOTICE file | ||
| * distributed with this work for additional information | ||
| * regarding copyright ownership. The ASF licenses this file | ||
| * to you under the Apache License, Version 2.0 (the | ||
| * "License"); you may not use this file except in compliance | ||
| * with the License. You may obtain a copy of the License at | ||
| * | ||
| * http://www.apache.org/licenses/LICENSE-2.0 | ||
| * | ||
| * Unless required by applicable law or agreed to in writing, | ||
| * software distributed under the License is distributed on an | ||
| * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY | ||
| * KIND, either express or implied. See the License for the | ||
| * specific language governing permissions and limitations | ||
| * under the License. | ||
| */ | ||
|
|
||
| package org.apache.druid.testing.embedded.k8s; | ||
|
|
||
| import com.fasterxml.jackson.core.type.TypeReference; | ||
| import com.fasterxml.jackson.databind.ObjectMapper; | ||
| import io.fabric8.kubernetes.api.model.OwnerReference; | ||
| import io.fabric8.kubernetes.api.model.Pod; | ||
| import io.fabric8.kubernetes.client.KubernetesClient; | ||
| import io.fabric8.kubernetes.client.LocalPortForward; | ||
| import org.apache.druid.common.utils.IdUtils; | ||
| import org.apache.druid.indexing.common.task.NoopTask; | ||
| import org.apache.druid.query.DruidMetrics; | ||
| import org.junit.jupiter.api.Assertions; | ||
| import org.junit.jupiter.api.Disabled; | ||
| import org.junit.jupiter.api.Test; | ||
|
|
||
| import java.io.InputStream; | ||
| import java.net.HttpURLConnection; | ||
| import java.net.URI; | ||
| import java.net.URL; | ||
| import java.util.List; | ||
| import java.util.Map; | ||
|
|
||
| /** | ||
| * Regression test for https://github.com/apache/druid/issues/18791. | ||
| * | ||
| * Verifies that {@code distribution/docker/peon.sh} sources options from | ||
| * {@code jvm.config} and that those options reach the peon JVM as system | ||
| * properties. Before the fix, {@code peon.sh} silently ignored | ||
| * {@code jvm.config}, so any JVM flags set there — including memory limits | ||
| * users had configured to prevent OOMs — never applied. | ||
| * | ||
| * <p>The test uses an operator manifest that injects | ||
| * {@code -Ddruid.test.peon.jvmconfig.marker=true} into the cluster-level | ||
| * {@code jvm.options}. The Druid operator writes these to {@code jvm.config} | ||
| * on each node, including the overlord. When a peon is launched via the | ||
| * {@code K8sTaskAdapter}, it inherits the overlord's pod spec (including the | ||
| * mounted {@code jvm.config}); {@code peon.sh} then sources that file and | ||
| * prepends its contents to {@code JAVA_OPTS}. The marker therefore appears in | ||
| * the peon JVM's system properties, which this test asserts by querying | ||
| * {@code /status/properties} on the peon pod. | ||
| */ | ||
| @Disabled("requires charts.datainfra.io chart, see https://github.com/apache/druid/pull/19047") | ||
| public class KubernetesPeonJvmConfigDockerTest extends BaseKubernetesTaskRunnerDockerTest | ||
| { | ||
| private static final String MARKER_KEY = "druid.test.peon.jvmconfig.marker"; | ||
| private static final String MARKER_VALUE = "true"; | ||
| private static final String MARKER_MANIFEST = | ||
| "manifests/druid-service-with-operator-peonjvmconfig.yaml"; | ||
|
|
||
| /** | ||
| * Matches {@code DruidK8sConstants.PORT} but duplicated here to avoid | ||
| * pulling the whole {@code druid-kubernetes-overlord-extensions} module in | ||
| * as a test-scope dep just for one integer. | ||
| */ | ||
| private static final int PEON_HTTP_PORT = 8100; | ||
|
|
||
| private static final long PEON_POD_READY_TIMEOUT_MILLIS = 180_000L; | ||
| private static final long PROPERTIES_POLL_TIMEOUT_MILLIS = 60_000L; | ||
|
|
||
| private static final ObjectMapper MAPPER = new ObjectMapper(); | ||
| private static final TypeReference<Map<String, String>> MAP_TYPE = new TypeReference<>() {}; | ||
|
|
||
| @Override | ||
| protected boolean useSharedInformers() | ||
| { | ||
| return false; | ||
| } | ||
|
|
||
| @Override | ||
| protected String getManifestTemplate() | ||
| { | ||
| return MARKER_MANIFEST; | ||
| } | ||
|
|
||
| @Test | ||
| public void test_peonSourcesJvmConfigMarker() throws Exception | ||
| { | ||
| final String taskId = IdUtils.getRandomId(); | ||
| // Keep the peon alive long enough to discover its pod, port-forward, and hit the status endpoint. | ||
| final long runDurationMillis = 240_000L; | ||
|
|
||
| cluster.callApi().onLeaderOverlord( | ||
| o -> o.runTask( | ||
| taskId, | ||
| new NoopTask(taskId, null, dataSource, runDurationMillis, 0L, null) | ||
| ) | ||
| ); | ||
|
|
||
| try { | ||
| eventCollector.latchableEmitter().waitForEvent( | ||
| event -> event.hasMetricName(NoopTask.EVENT_STARTED) | ||
| .hasDimension(DruidMetrics.TASK_ID, taskId) | ||
| ); | ||
|
|
||
| final KubernetesClient client = k3sCluster.getKubernetesClient(); | ||
| final Pod peonPod = waitForReadyPeonPod(client); | ||
|
|
||
| try (LocalPortForward portForward = client.pods() | ||
| .inNamespace(K3sClusterResource.DRUID_NAMESPACE) | ||
| .withName(peonPod.getMetadata().getName()) | ||
| .portForward(PEON_HTTP_PORT)) { | ||
| final Map<String, String> peonProperties = pollForStatusProperties(portForward.getLocalPort()); | ||
| Assertions.assertEquals( | ||
| MARKER_VALUE, | ||
| peonProperties.get(MARKER_KEY), | ||
| "Expected jvm.config marker to reach peon JVM as a system property. " | ||
| + "This is a regression: peon.sh must source $SERVICE_CONF_DIR/jvm.config." | ||
| ); | ||
| } | ||
| } | ||
| finally { | ||
| try { | ||
| cluster.callApi().onLeaderOverlord(o -> o.cancelTask(taskId)); | ||
| } | ||
| catch (Exception ignore) { | ||
| // Best-effort cleanup. | ||
| } | ||
| } | ||
| } | ||
|
|
||
| private Pod waitForReadyPeonPod(KubernetesClient client) throws InterruptedException | ||
| { | ||
| final long deadline = System.currentTimeMillis() + PEON_POD_READY_TIMEOUT_MILLIS; | ||
| while (System.currentTimeMillis() < deadline) { | ||
| for (Pod pod : client.pods().inNamespace(K3sClusterResource.DRUID_NAMESPACE).list().getItems()) { | ||
| if (!ownedByJob(pod)) { | ||
| continue; | ||
| } | ||
| if (isReady(pod)) { | ||
| return pod; | ||
| } | ||
| } | ||
| Thread.sleep(2_000L); | ||
| } | ||
| throw new AssertionError( | ||
| "No Job-owned pod became Ready within " | ||
| + (PEON_POD_READY_TIMEOUT_MILLIS / 1000L) + "s — expected a peon pod to appear" | ||
| ); | ||
| } | ||
|
|
||
| private static boolean ownedByJob(Pod pod) | ||
| { | ||
| final List<OwnerReference> owners = pod.getMetadata().getOwnerReferences(); | ||
| return owners != null && owners.stream().anyMatch(o -> "Job".equals(o.getKind())); | ||
| } | ||
|
|
||
| private static boolean isReady(Pod pod) | ||
| { | ||
| return pod.getStatus() != null | ||
| && pod.getStatus().getConditions() != null | ||
| && pod.getStatus().getConditions().stream().anyMatch( | ||
| c -> "Ready".equals(c.getType()) && "True".equals(c.getStatus()) | ||
| ); | ||
| } | ||
|
|
||
| private Map<String, String> pollForStatusProperties(int localPort) throws InterruptedException | ||
| { | ||
| final long deadline = System.currentTimeMillis() + PROPERTIES_POLL_TIMEOUT_MILLIS; | ||
| Exception lastException = null; | ||
| while (System.currentTimeMillis() < deadline) { | ||
| try { | ||
| final URL url = URI.create("http://localhost:" + localPort + "/status/properties").toURL(); | ||
| final HttpURLConnection conn = (HttpURLConnection) url.openConnection(); | ||
| conn.setConnectTimeout(2_000); | ||
| conn.setReadTimeout(2_000); | ||
| if (conn.getResponseCode() == 200) { | ||
| try (InputStream is = conn.getInputStream()) { | ||
| return MAPPER.readValue(is, MAP_TYPE); | ||
| } | ||
| } | ||
| } | ||
| catch (Exception e) { | ||
| lastException = e; | ||
| } | ||
| Thread.sleep(1_000L); | ||
| } | ||
| final String suffix = lastException == null ? "" : " Last error: " + lastException; | ||
| throw new AssertionError( | ||
| "Peon /status/properties did not return 200 within " | ||
| + (PROPERTIES_POLL_TIMEOUT_MILLIS / 1000L) + "s." + suffix | ||
| ); | ||
| } | ||
| } | ||
115 changes: 115 additions & 0 deletions
115
embedded-tests/src/test/resources/manifests/druid-service-with-operator-peonjvmconfig.yaml
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,115 @@ | ||
| apiVersion: "druid.apache.org/v1alpha1" | ||
| kind: "Druid" | ||
| metadata: | ||
| name: test-cluster-${service} | ||
| spec: | ||
| image: ${image} | ||
| startScript: /druid.sh | ||
| scalePvcSts: true | ||
| rollingDeploy: true | ||
| defaultProbes: false | ||
| podLabels: | ||
| environment: stage | ||
| release: alpha | ||
| podAnnotations: | ||
| dummy: k8s_extn_needs_atleast_one_annotation | ||
| volumes: | ||
| - name: mysqlconnector | ||
| emptyDir: { } | ||
| securityContext: | ||
| fsGroup: 0 | ||
| runAsUser: 0 | ||
| runAsGroup: 0 | ||
| containerSecurityContext: | ||
| privileged: true | ||
| commonConfigMountPath: "/opt/druid/conf/druid/cluster/_common" | ||
| common.runtime.properties: | | ||
| ${commonRuntimeProperties} | ||
| jvm.options: |- | ||
| -server | ||
| -Djava.net.preferIPv4Stack=true | ||
| -XX:MaxDirectMemorySize=10240g | ||
| -Duser.timezone=UTC | ||
| -Dfile.encoding=UTF-8 | ||
| -Dlog4j.debug | ||
| -Djava.util.logging.manager=org.apache.logging.log4j.jul.LogManager | ||
| -Ddruid.test.peon.jvmconfig.marker=true | ||
| log4j.config: |- | ||
| <?xml version="1.0" encoding="UTF-8" ?> | ||
| <Configuration status="WARN"> | ||
| <Appenders> | ||
| <Console name="Console" target="SYSTEM_OUT"> | ||
| <PatternLayout pattern="%d{HH:mm:ss.SSS} [%t] %-5level %logger{36} - %msg%n"/> | ||
| </Console> | ||
| <File name="FileAppender" fileName="log/${sys:druid.node.type}.log"> | ||
| <PatternLayout pattern="%d{yyyy-MM-dd HH:mm:ss.SSS} [%t] %-5level %logger{36} - %msg%n"/> | ||
| </File> | ||
| </Appenders> | ||
| <Loggers> | ||
| <Root level="info"> | ||
| <AppenderRef ref="Console"/> | ||
| <AppenderRef ref="FileAppender"/> | ||
| </Root> | ||
| </Loggers> | ||
| </Configuration> | ||
| env: | ||
| - name: POD_NAME | ||
| valueFrom: | ||
| fieldRef: | ||
| fieldPath: metadata.name | ||
| - name: POD_NAMESPACE | ||
| valueFrom: | ||
| fieldRef: | ||
| fieldPath: metadata.namespace | ||
| nodes: | ||
| ${service}s: | ||
| nodeType: ${service} | ||
| priorityClassName: system-cluster-critical | ||
| druid.port: ${port} | ||
| services: | ||
| - spec: | ||
| type: NodePort | ||
| ports: | ||
| - name: http | ||
| port: ${port} | ||
| targetPort: ${port} | ||
| nodePort: ${port} | ||
| replicas: 1 | ||
| nodeConfigMountPath: "/opt/druid/conf/druid/cluster/${serviceFolder}" | ||
| runtime.properties: | | ||
| ${nodeRuntimeProperties} | ||
| livenessProbe: | ||
| failureThreshold: 10 | ||
| httpGet: | ||
| path: /status/health | ||
| port: ${port} | ||
| initialDelaySeconds: 5 | ||
| periodSeconds: 10 | ||
| successThreshold: 1 | ||
| timeoutSeconds: 5 | ||
| readinessProbe: | ||
| failureThreshold: 20 | ||
| httpGet: | ||
| path: /status/health | ||
| port: ${port} | ||
| initialDelaySeconds: 5 | ||
| periodSeconds: 10 | ||
| successThreshold: 1 | ||
| timeoutSeconds: 5 | ||
| startUpProbe: | ||
| failureThreshold: 20 | ||
| httpGet: | ||
| path: /status/health | ||
| port: ${port} | ||
| initialDelaySeconds: 60 | ||
| periodSeconds: 30 | ||
| successThreshold: 1 | ||
| timeoutSeconds: 10 | ||
| volumeMounts: | ||
| - mountPath: /druid/data | ||
| name: druid-shared-storage | ||
| volumes: | ||
| - name: druid-shared-storage | ||
| hostPath: | ||
| path: /druid/shared-storage | ||
| type: DirectoryOrCreate |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
P2 Regression coverage is permanently disabled
The new regression test that verifies peon.sh sources jvm.config is annotated with @disabled, so the fix has no active automated coverage. Since this bug is specifically in container startup behavior, leaving the only regression test disabled means the same breakage can return without CI noticing. Please either make this test runnable in the existing embedded test setup or add a narrower active test around the script/config generation path.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just pushed something to cover this