[VL][Delta] Delta CI pipeline#12278
Conversation
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
Adds a GitHub Actions pipeline to build a Gluten Velox bundle and run Delta Lake’s spark module unit tests against it, including automation to patch a Delta checkout and adjust its build config for the CI run.
Changes:
- Introduces a new
delta_spark_ut.ymlworkflow with native build, bundle assembly, and sharded Delta test execution. - Adds a
setup-delta.shhelper to clone Delta, inject the bundle jar into the correct sbt projectlib/, patchDeltaSQLCommandTest, and disable a scalastyle header check.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
| .github/workflows/util/delta-spark-ut/setup-delta.sh | Automates cloning/patching Delta and adjusting scalastyle to make Gluten-enabled tests run. |
| .github/workflows/delta_spark_ut.yml | Defines the CI workflow to build Gluten artifacts and execute Delta Spark unit tests in shards. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
Can we remove the duplicated tests in Gluten's codebase, if they are covered by the new way? |
|
I guess so. Idk exactly what are these duplicates, if they are exact copies, and from which versions. |
philo-he
left a comment
There was a problem hiding this comment.
@felipepessoto, thanks for the PR.
| cancel-in-progress: true | ||
|
|
||
| jobs: | ||
| build-native-lib-centos-7: |
There was a problem hiding this comment.
Is this job a duplicate of the one in velox_backend_x86.yml? If so, we might consider moving the Delta tests into that file to reuse the native artifact it builds, since that artifact can't be shared across different workflows.
This is also a consideration for reducing our GHA usage, see #12288
There was a problem hiding this comment.
@philo-he if reducing CI usage is a priority, introducing Delta CI can be a big problem, Delta tests run for many hours and requires 4~8 shards to run a reasonable time (2-3h).
86acf61 to
e190fc0
Compare
| if args.failures_out: | ||
| write_entries(args.failures_out, failed) | ||
| if args.ran_out: | ||
| write_entries(args.ran_out, passed | failed) |
| for pattern in ("**/TEST-*.xml", "**/target/**/*.xml"): | ||
| xml_files.extend(glob.glob(os.path.join(reports_dir, pattern), recursive=True)) | ||
| xml_files = sorted(set(xml_files)) |
| sed -i \ | ||
| 's|<check level="error" class="org.scalastyle.file.HeaderMatchesChecker" enabled="true">|<check level="error" class="org.scalastyle.file.HeaderMatchesChecker" enabled="false">|' \ | ||
| "$SCALASTYLE_CONFIG" |
| run: | | ||
| set -euo pipefail | ||
| mkdir -p bundle-out | ||
| # Match the renamed fat jar produced by package/pom.xml's copy-fat-jar | ||
| # exec. The version part may bump (e.g. 1.7.0-SNAPSHOT -> 1.8.0-SNAPSHOT), | ||
| # so glob the version suffix. | ||
| jar=$(ls package/target/gluten-velox-bundle-spark${{ env.GLUTEN_BUNDLE_SPARK_VERSION }}_${{ env.GLUTEN_BUNDLE_SCALA_VERSION }}-linux_amd64-*.jar | head -n 1) | ||
| if [ -z "$jar" ] || [ ! -f "$jar" ]; then | ||
| echo "ERROR: Could not find Gluten bundle jar under package/target/" >&2 | ||
| ls -la package/target/ || true | ||
| exit 1 | ||
| fi |
| def build_header_index(all_suites): | ||
| """Map every plausible console header string to its FQN(s).""" | ||
| index = {} | ||
| for fqn in all_suites: | ||
| simple = fqn_to_simple(fqn) | ||
| for key in (fqn, fqn + ":", simple, simple + ":"): | ||
| index.setdefault(key, set()).add(fqn) | ||
| return index |
| xml_files = [] | ||
| # ScalaTest's -u reporter and Maven surefire both write `TEST-<suite>.xml` | ||
| # under a `target/.../*-reports/` dir. Restrict the secondary glob to | ||
| # `target/` so we never parse Delta's own XML *test resources* (which live | ||
| # under src/test/resources and are not reports). The <testsuite>-root guard | ||
| # below is a final safety net. | ||
| for pattern in ("**/TEST-*.xml", "**/target/**/*.xml"): | ||
| xml_files.extend(glob.glob(os.path.join(reports_dir, pattern), recursive=True)) | ||
| xml_files = sorted(set(xml_files)) |
| env: | ||
| ACTIONS_ALLOW_USE_UNSECURE_NODE_VERSION: true | ||
| MVN_CMD: 'build/mvn -ntp' |
45aa815 to
40a39b7
Compare
40a39b7 to
0d1841c
Compare
| set -euo pipefail | ||
| mkdir -p bundle-out | ||
| # Match the renamed fat jar produced by package/pom.xml's copy-fat-jar | ||
| # exec. The version part may bump (e.g. 1.7.0-SNAPSHOT -> 1.8.0-SNAPSHOT), | ||
| # so glob the version suffix. | ||
| jar=$(ls package/target/gluten-velox-bundle-spark${{ env.GLUTEN_BUNDLE_SPARK_VERSION }}_${{ env.GLUTEN_BUNDLE_SCALA_VERSION }}-linux_amd64-*.jar | head -n 1) | ||
| if [ -z "$jar" ] || [ ! -f "$jar" ]; then | ||
| echo "ERROR: Could not find Gluten bundle jar under package/target/" >&2 | ||
| ls -la package/target/ || true | ||
| exit 1 | ||
| fi |
| set -euo pipefail | ||
| GLUTEN_JAR=$(ls "$GITHUB_WORKSPACE"/gluten-bundle/gluten-velox-bundle-spark*_*-linux_amd64-*.jar | head -n 1) | ||
| echo "Using Gluten bundle: $GLUTEN_JAR" | ||
| bash "$GITHUB_WORKSPACE/.github/workflows/util/delta-spark-ut/setup-delta.sh" \ |
| sed -i \ | ||
| 's|<check level="error" class="org.scalastyle.file.HeaderMatchesChecker" enabled="true">|<check level="error" class="org.scalastyle.file.HeaderMatchesChecker" enabled="false">|' \ | ||
| "$SCALASTYLE_CONFIG" |
| if not parsed_any: | ||
| eprint( | ||
| "WARNING: no JUnit <testsuite> elements found under {}".format(reports_dir) | ||
| ) | ||
|
|
| passed, failed, skipped = parse_reports(args.reports_dir) | ||
|
|
||
| # Always emit this shard's artifacts for the aggregation job. | ||
| if args.failures_out: | ||
| write_entries(args.failures_out, failed) | ||
| if args.ran_out: | ||
| write_entries(args.ran_out, passed | failed) |
| VELOX_USER_CHECK_NOT_NULL( | ||
| inputColumnType, | ||
| "Nested field reference into a non-struct type (e.g. an array or map element) is not supported."); |
| statsResultAttrs, | ||
| StatisticsInputNode(dataCols)) | ||
| val projOp = ProjectExec(statsResultAttrs, aggOp) | ||
| val offloads = Seq(OffloadOthers()).map(_.toStrcitRule()) |
| statsResultAttrs, | ||
| StatisticsInputNode(dataCols)) | ||
| val projOp = ProjectExec(statsResultAttrs, aggOp) | ||
| val offloads = Seq(OffloadOthers()).map(_.toStrcitRule()) |
| val aggregates = statsColExpr.collect { | ||
| case ae: AggregateExpression if ae.aggregateFunction.isInstanceOf[DeclarativeAggregate] => | ||
| ae | ||
| } | ||
| val statsAttrs = aggregates.flatMap(_.aggregateFunction.aggBufferAttributes) | ||
| val statsResultAttrs = aggregates.flatMap(_.aggregateFunction.inputAggBufferAttributes) |
| val aggregates = statsColExpr.collect { | ||
| case ae: AggregateExpression if ae.aggregateFunction.isInstanceOf[DeclarativeAggregate] => | ||
| ae | ||
| } | ||
| val statsAttrs = aggregates.flatMap(_.aggregateFunction.aggBufferAttributes) | ||
| val statsResultAttrs = aggregates.flatMap(_.aggregateFunction.inputAggBufferAttributes) |
| # PARTIAL BASELINE -- seeded from 15 of 16 shards of run 27490052632 | ||
| # (894 known failures across 18003 tests run). Shard 2 hung and is NOT yet | ||
| # included, so its tests will surface as regressions until shard 2 is collected | ||
| # and merged (see MANIFEST/README). Re-run shard 2, then aggregate all 16 and | ||
| # replace this file. |
| - 'backends-velox/src-delta40/**/DeltaSQLCommandTest.scala' | ||
|
|
||
| env: | ||
| ACTIONS_ALLOW_USE_UNSECURE_NODE_VERSION: true |
| VELOX_USER_CHECK_NOT_NULL( | ||
| inputColumnType, | ||
| "Nested field reference into a non-struct type (e.g. an array or map element) is not supported."); |
| sed -i \ | ||
| 's|<check level="error" class="org.scalastyle.file.HeaderMatchesChecker" enabled="true">|<check level="error" class="org.scalastyle.file.HeaderMatchesChecker" enabled="false">|' \ | ||
| "$SCALASTYLE_CONFIG" |
| val projOp = ProjectExec(statsResultAttrs, aggOp) | ||
| val offloads = Seq(OffloadOthers()).map(_.toStrcitRule()) | ||
| val config = GlutenConfig.get | ||
| val transformRule = HeuristicTransform.WithRewrites( | ||
| Validators.newValidator(config, offloads), | ||
| Seq(PullOutPreProject), | ||
| offloads) |
| val projOp = ProjectExec(statsResultAttrs, aggOp) | ||
| val offloads = Seq(OffloadOthers()).map(_.toStrcitRule()) | ||
| val config = GlutenConfig.get | ||
| val transformRule = HeuristicTransform.WithRewrites( | ||
| Validators.newValidator(config, offloads), | ||
| Seq(PullOutPreProject), | ||
| offloads) |
| VELOX_USER_CHECK_NOT_NULL( | ||
| inputColumnType, | ||
| "Nested field reference into a non-struct type (e.g. an array or map element) is not supported."); |
| if ! git clone --depth 1 --branch "$DELTA_REF" https://github.com/delta-io/delta.git "$DELTA_DIR"; then | ||
| echo "Shallow clone of ref '${DELTA_REF}' failed, falling back to full clone." | ||
| rm -rf "$DELTA_DIR" | ||
| git clone https://github.com/delta-io/delta.git "$DELTA_DIR" | ||
| git -C "$DELTA_DIR" checkout "$DELTA_REF" | ||
| fi |
| private def canOffloadStats(dataCols: Seq[Attribute], statsColExpr: Expression): Boolean = { | ||
| try { | ||
| val aggregates = statsColExpr.collect { | ||
| case ae: AggregateExpression if ae.aggregateFunction.isInstanceOf[DeclarativeAggregate] => | ||
| ae | ||
| } | ||
| val statsAttrs = aggregates.flatMap(_.aggregateFunction.aggBufferAttributes) | ||
| val statsResultAttrs = aggregates.flatMap(_.aggregateFunction.inputAggBufferAttributes) | ||
| val aggOp = SortAggregateExec( | ||
| None, | ||
| isStreaming = false, | ||
| None, | ||
| Seq.empty, | ||
| aggregates, | ||
| statsAttrs, | ||
| 0, | ||
| statsResultAttrs, | ||
| StatisticsInputNode(dataCols)) | ||
| val projOp = ProjectExec(statsResultAttrs, aggOp) |
|
Run Gluten Clickhouse CI on x86 |
Adds a GitHub Actions workflow `delta_spark_ut.yml` (plus a helper
`util/delta-spark-ut/setup-delta.sh`) that runs delta-io/delta's `spark`
sbt module unit tests with a Gluten Velox bundle built from this
repository's source on the classpath. The pipeline lets us catch
regressions in Gluten against the latest Delta release before they
reach users.
Triggers:
* `workflow_dispatch` with overridable `delta_ref` (default v4.2.0),
`spark_version` (default 4.1), and `test_parallelism` (default 1).
* `pull_request` when the workflow files, `gluten-delta/**`, or the
reused `backends-velox/.../DeltaSQLCommandTest.scala` change.
Three jobs:
1. `build-native-lib-centos-7` -- builds the Velox/Gluten native
libraries in the `apache/gluten:vcpkg-centos-7-gcc13` container
(x86_64), reusing `dev/ci-velox-buildstatic-centos-7.sh` and the
ccache layout other pipelines already use. Uploads the cpp/build
tree and the Arrow jars from the local m2 repo as artifacts.
2. `build-gluten-bundle` -- in `apache/gluten:centos-9-jdk17`,
downloads the native artifacts and runs `mvn clean install` with
`-Pspark-4.1 -Pscala-2.13 -Pjava-17 -Pbackends-velox -Pdelta`
(install, not package, so the gluten-delta jar lands in m2 before
the shaded fat jar is built; `-Dmaven.compiler.release=17`
defeats any user settings.xml pinning release=1.8). Uploads the
resulting `gluten-velox-bundle-spark4.1_2.13-linux_amd64-*.jar`.
3. `delta-spark-test` -- in `apache/gluten:centos-9-jdk17`, sharded
8-ways (matching Delta's upstream `spark_test.yaml`), runs
`setup-delta.sh` and then `sbt spark/test`. `timeout-minutes: 300`
to fail faster than the 6h GH job timeout. Test reports are
uploaded unconditionally; on failure we also upload `hs_err_pid*`
and `/tmp/*.hprof(.gz)` heap dumps for post-mortem analysis.
Non-obvious design decisions captured in the code (and worth flagging
for reviewers):
* **Bundle goes only in `delta/spark-unified/lib/`, NOT `delta/spark/lib/`.**
sbt auto-scans `<baseDirectory>/lib` via `unmanagedBase`, and
`unmanagedJars` are project-scoped (not inherited by dependents). In
Delta v4.2.0 the layout is
- `sparkV1 = project in file("spark")` -> `spark/lib`
- `spark = project in file("spark-unified")` -> `spark-unified/lib`
Placing the bundle only under `spark-unified/lib/` exposes it to the
unified `spark` project's Compile and Test classpaths (where the
forked test JVM loads `org.apache.gluten.GlutenPlugin` by name) while
keeping it off `sparkV1`'s Compile classpath. The latter matters
because the bundle pulls extra symbols under `org.apache.spark.sql`
into scope, which would collide with Delta's own
`MergeOutputGeneration.scala` (it imports both `org.apache.spark.sql._`
and `org.apache.spark.sql.delta.ClassicColumnConversions._`).
* **`JAVA_TOOL_OPTIONS` carries the Netty + JDK-17 `--add-opens` set,
but NOT `-Xmx`.** The flag set mirrors `extraJavaTestArgs` from
Gluten's own root `pom.xml`. The crucial non-Delta-default flag is
`-Dio.netty.tryReflectionSetAccessible=true`: without it Gluten's
bundled Arrow allocator throws
`UnsupportedOperationException: sun.misc.Unsafe or DirectByteBuffer
... not available` on first direct-buffer allocation. Delta's own
`Test/javaOptions` (see `project/CrossSparkVersions.scala`
`java17TestSettings`) already sets the base `--add-opens` but not
the Netty property -- their own suites just don't load Arrow this way.
`-Xmx` is intentionally NOT in `JAVA_TOOL_OPTIONS`: it is processed
BEFORE the command line, so Delta's explicit `-Xmx1024m` would still
win (last `-Xmx` wins).
* **Forked test JVM heap is bumped to `-Xmx6G` via
`set spark / Test / javaOptions ++= Seq("-Xmx6G", ...)`.** `++=`
appends to Delta's own seq, so our `-Xmx6G` lands AFTER `-Xmx1024m`
and wins. Delta v4.2.0's 1G cap is far too tight once Gluten +
Velox + Arrow + JNI are loaded -- a DV+CDC merge suite OOMs inside
`RoaringBitmapArray.extendBitmaps` at that heap size. We also turn
on `HeapDumpOnOutOfMemoryError` with `HeapDumpPath=/tmp/` so future
OOMs come with a dump. `_JAVA_OPTIONS` is avoided because it would
also override the sbt launcher heap and defeat budget accounting.
* **sbt launcher heap is `-J-Xmx4G`** -- sbt only compiles tests and
orchestrates the test fork, so 4G is comfortable and leaves more of
the 16 GB GH runner for the forked test JVM.
* **Single shared sbt/Ivy/Coursier cache across all shards** -- all
shards have the same dependency tree, so a single cache key (with
parallel saves resolved first-write-wins by GH) gives a better
storage / hit-rate tradeoff than 8 isolated caches.
* **`yum install` does NOT include `curl`** -- the
`apache/gluten:centos-9-jdk17` image ships `curl-minimal`, which
provides the `curl` command (used by `sbt-launch-lib.bash` to
download `sbt-launch.jar`) but conflicts with the full `curl`
package. Installing the full package would fail with
"package curl-minimal... conflicts with curl".
* **`DeltaSQLCommandTest` is reused, not duplicated.** `setup-delta.sh`
overwrites Delta's own `DeltaSQLCommandTest.scala` with the existing
copy under `backends-velox/src-delta40/...`. That file registers the
Gluten plugin via typed `GlutenConfig` / `VeloxDeltaConfig`
references, which resolve at test-compile time because the bundle is
already on the unified `spark` project's Test classpath.
* **Delta's scalastyle `HeaderMatchesChecker` is disabled in the cloned
Delta's `scalastyle-config.xml`.** Our reused
`DeltaSQLCommandTest.scala` carries Gluten's ASF-only license header,
which does not match Delta's expected regex (ASF + Spark-mod block +
Delta copyright). `HeaderMatchesChecker` is a file-level checker that
does not honor `// scalastyle:off`, so it has to be disabled at
config level. The config is wired in via
`ThisBuild / scalastyleConfig` in `project/Checkstyle.scala`, so a
single edit covers every sbt sub-project.
Scope is intentionally limited to Velox + x86_64 to keep the matrix
small. ClickHouse and aarch64 can be added later if needed.
Co-authored-by: Copilot
Running delta-io/delta's spark suite against the Gluten Velox bundle produces many expected failures. This adds a committed known-failures baseline and a per-shard gate so CI is green when only baseline failures occur and red on a genuine regression, enabling incremental fixes. - Inject ScalaTest's -u JUnit XML reporter (Delta only configures the console reporter, so no machine-readable per-test results existed). - Capture sbt's exit so expected test failures don't fail the step; fail loudly only when zero reports are produced (compile/launch failure). - compare-test-results.py classifies each test vs known-failures.txt (regression / expected / now-passing) in enforce/seed/aggregate modes. - Add update_baseline + fail_on_fixed inputs and an aggregate job that emits a ready-to-commit baseline artifact. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
At 8-way sharding, ~5 of the 8 delta-spark-test shards consistently exceed the 300-minute job timeout: Delta's generated merge/CDC/DV suites are very slow under Gluten. Observed in run 16 (shards 0,1,2,3,6 all hit the 300-min cap and were cancelled, while 4,5,7 finished in <130 min), and reproduced on the current run. Cancelled shards never upload their results, so the known-failures baseline could only ever capture a subset of suites. Delta's GreedyHashStrategy balances high-duration suites across shards by estimated duration, so doubling NUM_SHARDS 8 -> 16 roughly halves per-shard wall time and lets every shard finish (and report) within the timeout. Bump timeout-minutes 300 -> 350 for extra margin. TEST_PARALLELISM_COUNT stays 1 (each forked test JVM uses ~8G, so >1 would OOM the runner). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…array/map SubstraitVeloxExprConverter::toVeloxExpr(FieldReference) resolves a nested struct_field chain by calling asRowType() on each child to descend a level. When the reference traverses a non-struct child -- e.g. a field nested under an array, as exercised by Delta's "nested data support - analysis error - updating array type" UpdateSuite test -- asRowType() returns null and the next loop iteration dereferenced that null RowType, crashing the whole forked JVM with a SIGSEGV in libvelox.so (gluten::SubstraitVeloxExprConverter::toVeloxExpr). A SIGSEGV is not catchable, so plan validation could not fall back and the test process died outright. Replace the unchecked navigation with VELOX_USER_CHECK guards (field-index bounds + non-null struct child) that throw a VeloxUserError. The plan validator already wraps expression conversion in try/catch(VeloxException), so the query now falls back to vanilla execution cleanly instead of crashing. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Seeds .github/workflows/util/delta-spark-ut/known-failures.txt with the 894 known failures aggregated from 15 of the 16 shards of run 27490052632 (the run with the toVeloxExpr crash fix), so the gate starts enforcing instead of seeding. Shard 2 hung (timeout) and is not yet included -- it is clearly marked PARTIAL in the file header; shard 2's tests will surface as regressions on this run and will be merged in once shard 2 is collected. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ations GlutenDeltaJobStatsTracker builds a SortAggregateExec -> ProjectExec plan for the per-file statistics aggregation, runs Gluten's HeuristicTransform, then unconditionally casts the result to a WholeStageTransformer. When the stats aggregation cannot be offloaded to Velox -- e.g. min/max over TIMESTAMP_NTZ, as exercised by Delta's DataSkippingDeltaV1Suite "data skipping on TIMESTAMP_NTZ near Long.MaxValue" -- the projection stays a vanilla ProjectExec and the cast throws java.lang.ClassCastException: ProjectExec cannot be cast to WholeStageTransformer in the per-task tracker constructor, failing the write. Decide on the driver whether the aggregation actually offloads: add canOffloadStats(), which dry-runs the same transform pipeline once and checks whether it collapses into a WholeStageTransformer. If it does not, route the DeltaJobStatisticsTracker to the existing GlutenDeltaJobStatsFallbackTracker (columnar-to-row + the original Delta tracker, which produces correct stats for any type) instead of the native tracker. Evaluating this on the driver also avoids the per-task constructor allocating a single-thread executor and a NativePlanEvaluator before the cast. Applied to both the Delta 3.x (src-delta33) and Delta 4.x (src-delta40) copies. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…e OOM Some shards were failing with `java.lang.OutOfMemoryError: Java heap space` in DV+CDC merge suites (a large RoaringBitmapArray.extendBitmaps allocation). At -Xmx6G the forked test JVM OOMed; the OOM then corrupted Velox memory-manager state and aborted the JVM via `terminate` (SIGABRT) mid-shard, so every remaining suite in that shard was reported aborted -- a false "regression" cascade in the known-failures gate. Raise the forked test JVM heap from 6G to 8G (the safe ceiling on the 16 GB ubuntu-22.04 runner: 8G heap + ~2G Velox off-heap + ~1G JVM overhead leaves headroom under 16G). Heap-dump-on-OOM stays enabled so a recurrence -- which would indicate an allocation >8G, i.e. a pathological test to identify and exclude -- can be analyzed. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Shard 2 (and intermittently other shards) hangs indefinitely after a suite's last test -- silent until the 350-min job timeout with zero diagnostics. ScalaTest's failAfter only wraps individual test BODIES, so a wedge in suite teardown/afterAll, or in a non-interruptible native Velox/JNI call that ignores Thread.interrupt(), has no timeout. Add a background watchdog to the Delta test step: if the test output stays silent for 15 min, it locates the sbt.ForkMain test JVM (scanning /proc, and reading sbt's @argfile since recent sbt keeps the main class out of the cmdline) and captures up to three jstack thread dumps -- to the job log (so they survive a timeout cancellation) and to a per-shard artifact. This makes the chronic hang diagnosable instead of an opaque timeout. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
132d7f0 to
363a779
Compare
| val aggregates = statsColExpr.collect { | ||
| case ae: AggregateExpression if ae.aggregateFunction.isInstanceOf[DeclarativeAggregate] => | ||
| ae | ||
| } |
| statsResultAttrs, | ||
| StatisticsInputNode(dataCols)) | ||
| val projOp = ProjectExec(statsResultAttrs, aggOp) | ||
| val offloads = Seq(OffloadOthers()).map(_.toStrcitRule()) |
| val aggregates = statsColExpr.collect { | ||
| case ae: AggregateExpression if ae.aggregateFunction.isInstanceOf[DeclarativeAggregate] => | ||
| ae | ||
| } |
| VELOX_USER_CHECK_NOT_NULL( | ||
| inputColumnType, | ||
| "Nested field reference into a non-struct type (e.g. an array or map element) is not supported."); |
| - 'backends-velox/src-delta40/**/DeltaSQLCommandTest.scala' | ||
|
|
||
| env: | ||
| ACTIONS_ALLOW_USE_UNSECURE_NODE_VERSION: true |
| sed -i \ | ||
| 's|<check level="error" class="org.scalastyle.file.HeaderMatchesChecker" enabled="true">|<check level="error" class="org.scalastyle.file.HeaderMatchesChecker" enabled="false">|' \ | ||
| "$SCALASTYLE_CONFIG" |
Run apache#12 shard 2 hung ~63 min in DeletionVectorsSuite with ZERO thread dumps captured. Root cause: find_fork() probed only /proc cmdline/@argfile for sbt.ForkMain; recent sbt keeps the main class out of both, so it matched nothing and the watchdog silently did nothing. - Locate the fork via `jps -l` (reads the main class from hsperfdata, robust to the @argfile launch) with the /proc scan as fallback. - Safety net: if the fork still can't be pinpointed, jstack EVERY JVM so a hang is never left undiagnosed. - Wrap jstack in `timeout` and log failures, and dump per-pid files for the artifact. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…e aggregations" TEMPORARY -- validation only. Per the pipeline philosophy (errors are expected, only hangs/crashes that block CI progress matter), check whether the ClassCast fix is still needed after the rebase: apache/main now has apache#12229 (basic TIMESTAMP_NTZ support), which may make the min/max-over-TIMESTAMP_NTZ stats aggregation offloadable -- so the ProjectExec->WholeStageTransformer cast no longer throws, and the leaked single-thread executor that could wedge teardown no longer allocates. Push without the fix and watch CI: if it progresses (no hang) the fix is redundant; if shard(s) hang, restore it. This reverts commit 1dde2b3. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Validation result (run apache#13, ClassCast fix removed): shard 2 HUNG at the ClassCastException (ProjectExec cannot be cast to WholeStageTransformer) in DataSkipping TIMESTAMP_NTZ -- it never reached later suites. Per the pipeline criterion (errors ok, hangs are not), removing the fix blocks CI, so the fix is required after all (apache#12229 TIMESTAMP_NTZ support did NOT make the stats aggregation offloadable -- the ClassCastException still occurs without the fix). This restores commit 1dde2b3 (reverts the temporary validation revert 55de9e0). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…eartbeat The watchdog captured ZERO dumps in runs apache#12 and apache#13 despite shard 2 hanging for 30-60 min. Root cause: the step shell runs with `bash -eo pipefail`, inherited by the watchdog subshell. When fork detection returned non-zero (jps not listing sbt.ForkMain, /proc miss), errexit SILENTLY killed the watchdog before it dumped. - `set +e +o pipefail` inside the subshell -- a diagnostic must never abort on a failed probe. (Verified locally: the subshell now survives a no-match probe.) - Dump via `kill -QUIT` so the JVM prints its thread dump to its own stderr, which sbt relays into the job log through the SAME stream as test output (a separately spawned jstack child's stdout can be buffered/lost). Verified locally that SIGQUIT yields a full thread dump. jstack still written to a file for the artifact. - Dump EVERY JVM when the fork can't be pinpointed (safety net). - Startup "armed" line + ~5-min heartbeat so the watchdog is self-verifying -- you can see it is alive and how long output has been silent, before any hang. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
| statsResultAttrs, | ||
| StatisticsInputNode(dataCols)) | ||
| val projOp = ProjectExec(statsResultAttrs, aggOp) | ||
| val offloads = Seq(OffloadOthers()).map(_.toStrcitRule()) |
| statsResultAttrs, | ||
| StatisticsInputNode(dataCols)) | ||
| val projOp = ProjectExec(statsResultAttrs, aggOp) | ||
| val offloads = Seq(OffloadOthers()).map(_.toStrcitRule()) |
| val aggregates = statsColExpr.collect { | ||
| case ae: AggregateExpression if ae.aggregateFunction.isInstanceOf[DeclarativeAggregate] => | ||
| ae | ||
| } |
| val aggregates = statsColExpr.collect { | ||
| case ae: AggregateExpression if ae.aggregateFunction.isInstanceOf[DeclarativeAggregate] => | ||
| ae | ||
| } |
What changes are proposed in this pull request?
Fix #9296
I wanted to create this PR to start discussing this, so we can have an idea of how it would work, if this is worth, etc.
Tests are failing, it can help uncover bugs.
How was this patch tested?
CI
Was this patch authored or co-authored using generative AI tooling?
Generated-by: Copilot