Fix ICE with associative arrays#24
Open
vindexbit wants to merge 3 commits into
Open
Conversation
1af60d5 to
a942e3d
Compare
72a117b to
6b3c935
Compare
fa96339 to
ccfbcf3
Compare
49c1e33 to
6627615
Compare
90ec293 to
354db3b
Compare
525618a to
9723a85
Compare
jpf91
pushed a commit
that referenced
this pull request
May 20, 2026
An anti-pattern found in compiled code when predicated tails were enabled for basic block SLP vectorization was triggered by byte-reversing patterns in source code, such as: uint8_t *dst; int size; dst[0] = size >> 24; dst[1] = size >> 16; dst[2] = size >> 8; dst[3] = size >> 0; which would previously have compiled to: rev w1, w1 str w1, [x0] but (with tail-predication) was vectorized as: mov z31.b, w1 ptrue p7.s, vl4 fmov s30, w1 sshr v29.2s, v30.2s, #8 insr z31.s, s29 sshr v30.2s, v30.2s, #16 insr z31.s, s30 fmov s30, w1 sshr v30.2s, v30.2s, #24 insr z31.s, s30 st1b {z31.s}, p7, [x0] One reason is that the SLP pass runs before the store-merging pass gets a chance to coalesce 4 stores into 1 and substitute a 32 bit bswap implementation. Even ignoring that, costing of the vectorized version (cost: 4) compared to the scalar version (also 4) was not realistic: _2 1 times vector_store costs 1 in body node 0x32ee6d0 1 times vec_construct costs 3 in prologue There were a couple of contributing issues: 1. the cost of mask construction for the vector_store (ptrue) was omitted for BB SLP, whereas the loop vectorizer explicitly charges for it. 2. the cost of vec_construct (elements / 2 + 1) did not incorporate any GPR-to-SIMD register transfer costs (mov, fmov). Since the supposed cost of the vectorised code only just reached parity with the scalar code, addressing either of the above issues would be sufficient to prevent vectorisation (in this specific case). It is also less risky than changing the order of passes, and less hacky than teaching the SLP pass about store-merging. This commit addresses only the second issue, by adding code in vector_costs::add_stmt_cost to charge scalar_to_vec_cost for each element of an external def of kind vec_construct (with specific exceptions noted below). This cost is added to the base cost already charged by aarch64_builtin_vectorization_cost for a vec_construct (which is assumed to cover the cost of the INSR or equivalent instructions). This is justifiable because SIMD-to-SIMD insertions into a vector register generally have lower latency and higher throughput than GPR-to-SIMD insertions. The basic structure of the code was copied from commit 90d693b which modified the x86 backend in a similar way, but adapted to use a hash_set<tree> instead of TREE_VISITED to guard against charging twice or more for the same scalar op feeding an external def. This commit assumes that constructing a vector from memory is no more costly than the equivalent set of scalar loads (or at least that any difference is incorporated in the cost returned by aarch64_builtin_vectorization_cost for vec_construct). It also assumes that constructing a vector from scalar values of floating point type, from a BIT_FIELD_REF/lastb that extracts from a vector register, or from the result of a call to an inbuilt reduction function, does not incur GPR-to-SIMD register transfer costs because such scalars are typically already in FP/SIMD registers on AArch64. gcc/ChangeLog: * config/aarch64/aarch64.cc (aarch64_call_scalar_result_in_simd_reg_p): New function to determine probabilistically whether a gcall produces a scalar result in a SIMD/FP register. (aarch64_scalar_op_to_vec_p): New function to determine whether or not to add scalar_to_vec_cost per scalar operand from which a vector is to be constructed. (aarch64_external_adjust_stmt_cost): New function to adjust the cost of an SLP tree node for a vec_construct that is fed by values defined outside the vectorized region. (aarch64_vector_costs::add_stmt_cost): Call the new aarch64_external_adjust_stmt_cost function if we have an SLP node and a vector type. gcc/testsuite/ChangeLog: * gcc.target/aarch64/sve/vec_construct_1.c: New test. * gcc.target/aarch64/sve/vec_construct_2.c: New test. * gcc.target/aarch64/sve/vec_construct_3.c: New test. * gcc.target/aarch64/sve/vec_construct_4.c: New test. * gcc.target/aarch64/sve/vec_construct_5.c: New test. * gcc.target/aarch64/vec-construct-1.c: New test. * gcc.target/aarch64/vec-construct-10.c: New test. * gcc.target/aarch64/vec-construct-11.c: New test. * gcc.target/aarch64/vec-construct-12.c: New test. * gcc.target/aarch64/vec-construct-2.c: New test. * gcc.target/aarch64/vec-construct-3.c: New test. * gcc.target/aarch64/vec-construct-4.c: New test. * gcc.target/aarch64/vec-construct-5.c: New test. * gcc.target/aarch64/vec-construct-6.c: New test. * gcc.target/aarch64/vec-construct-7.c: New test. * gcc.target/aarch64/vec-construct-8.c: New test. * gcc.target/aarch64/vec-construct-9.c: New test.
b6596a4 to
854d79d
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This simple example could not compile with the
-freleaseflag:Error:
The problem was in checking the borders.