[WIP] HIVE-29668: Add -rebuildIndexes utility to reconstruct backend Metastore indexes by soumyakanti3578 · Pull Request #6545 · apache/hive

soumyakanti3578 · 2026-06-15T22:22:03Z

What changes were proposed in this pull request?

Support -rebuildIndexes through schematool for Postgres, Oracle, MySql/MariaDB, and MSSQL

Why are the changes needed?

Gives users the ability to drop and recreate indexes easily through schematool.

Does this PR introduce any user-facing change?

Yes, adds a new option to schematool

How was this patch tested?

mvn test -pl standalone-metastore/metastore-server -Dtest.groups="" -Dtest="org.apache.hadoop.hive.metastore.tools.schematool.TestSchemaToolTaskRebuildIndexes,org.apache.hadoop.hive.metastore.tools.schematool.TestPostgresIndexRebuilder,org.apache.hadoop.hive.metastore.tools.schematool.TestMySQLIndexRebuilder,org.apache.hadoop.hive.metastore.tools.schematool.TestOracleIndexRebuilder,org.apache.hadoop.hive.metastore.tools.schematool.TestMSSQLIndexRebuilder"

…ore indexes

maCloudera · 2026-06-17T11:35:21Z

+  @Override
+  public void rebuildIndex(IndexInfo index) throws HiveMetaException {
+    PgDdl ddl = ddlMap.get(index.indexName());
+    executeRebuild(index, ddl.dropDdl(), ddl.createDdl());


I think, drop+create ddl should be atomic.
For constraint-backed indexes (PKs and UNIQUE constraints), the Postgres path runs two separate DDL statements: DROP CONSTRAINT followed by ADD CONSTRAINT. If the connection has autocommit enabled and the second statement fails, the constraint is permanently dropped with no rollback. It would be safer to ensure these two steps run inside an explicit BEGIN/COMMIT block, or autocommit=false.

Thanks, I have fixed this in the latest commit.

maCloudera · 2026-06-17T11:35:31Z

      .withDescription("Create table for Hive warehouse/compute logs")
      .create("createLogsTable");
+    Option rebuildIndexesOpt = new Option("rebuildIndexes",
+        "Detect and rebuild corrupt indexes in the metastore backend DB (Postgres only).");


I think we are supporting all DB's, we can remove this "Postgres only" otherwise it can confuse.

Yes, somehow missed this after my POC with postgres only. Thanks for noticing!

maCloudera · 2026-06-17T11:38:41Z

+  }
+
+  @Override
+  public @NotNull String toString() {


Do we need @NotNull annotation here ? I think, record will never return null.

Removed it.

sonarqubecloud · 2026-06-17T21:53:02Z

Quality Gate passed

Issues
12 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
1.6% Duplication on New Code

See analysis details on SonarQube Cloud

InvisibleProgrammer · 2026-06-19T11:14:53Z

  public static final String DB_MSSQL = "mssql";
  public static final String DB_MYSQL = "mysql";
-  public static final String DB_POSTGRACE = "postgres";
+  public static final String DB_POSTGRES = "postgres";


Kudo for this rename.

InvisibleProgrammer · 2026-06-19T11:41:54Z

Hi, firstly, I have to say I love seeing maybe the first idea that accepts the fact that different databases have their different needs and considers individual implementations for each database engine that Hive supports.

I have two high-level questions in my mind about the solution:

Firstly, the HMS connection is just a connection string. There is no guarantee that the user uses that database only for Hive. Fix me if I'm wrong but I think that change can potentially have huge impact: especially for large databases, rebuilding all the indexes takes time. Honestly, it can take a lot of time. And if there is only one customer that shares the database with any kind of other software, Hive will impact their behaviour as well.
I know it is unlikely but as it is possible I bet there will be at least one user whose production will be impacted.
But on the other side, as I haven't seen such a setup so far, I would just suggesting to add a warning about the behaviour and also the PR title as well: it doesn't rebuild the Metastore indexes. It rebuilds all the indexes in the database that hosts Metastore objects.

Secondly, I have little knowledge about other databases. But for example, for MSSQL, a nonclustered index points to a clustered index (exception is when the table itself a heap). For performance reasons, I would consider having an order in executing the rebuilds: clustered indexes first, nonclustered indexes second.

And my +1 question is about execution time: I assume rebuilding indexes can be a long-running process. I haven't checked that part of the code so far so please excuse me if I ask trivial questions: what kind of user interaction we have? Is there a progress bar to show the progress? What if the customer stops the process in the middle? I don't know if all the supported databases are supporting rebuilding indexes. Do we have any of them that actually requires to drop the existing index? If yes, what happens if the process stops after dropping the old one? Will the customer have any kind of feedback about a missing index?

And lastly, an other performance related question that came into my mind. I'm not a DBA so I don't know the answer, just curious: assuming the user has multiple instances, like a primary that accepts writes and multiple replicas. According to my knowledge, in that kind of setup the changes are synchronized with transaction log.
I have a fear if we rebuild all the indexes in one step, it can potentially create a huge amount of transaction log. And if it happens, it can cause significant delay and/or network issues between the primary and the replicas.
Honestly, I have no answer for this question right now: as a usual practise, I would recommend to do such a thing with waits so that the replicas can catch up with the changes. But to be able to fine tune that process, it is required to know the given setup.

HIVE-29668: Add -rebuildIndexes utility to reconstruct backend Metast…

1c51206

…ore indexes

asf-ci-hive added tests pending tests unstable and removed tests pending labels Jun 15, 2026

maCloudera reviewed Jun 17, 2026

View reviewed changes

soumyakanti3578 added 2 commits June 17, 2026 13:47

Address initial review comments

65f3a2e

Address some sonar issues

1e1068d

asf-ci-hive added tests pending and removed tests unstable labels Jun 17, 2026

asf-ci-hive added tests unstable and removed tests pending labels Jun 17, 2026

InvisibleProgrammer reviewed Jun 19, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] HIVE-29668: Add -rebuildIndexes utility to reconstruct backend Metastore indexes#6545

[WIP] HIVE-29668: Add -rebuildIndexes utility to reconstruct backend Metastore indexes#6545
soumyakanti3578 wants to merge 3 commits into
apache:masterfrom
soumyakanti3578:HIVE-29668

soumyakanti3578 commented Jun 15, 2026

Uh oh!

maCloudera Jun 17, 2026

Uh oh!

soumyakanti3578 Jun 17, 2026

Uh oh!

maCloudera Jun 17, 2026

Uh oh!

soumyakanti3578 Jun 17, 2026

Uh oh!

maCloudera Jun 17, 2026

Uh oh!

soumyakanti3578 Jun 17, 2026

Uh oh!

sonarqubecloud Bot commented Jun 17, 2026

Uh oh!

InvisibleProgrammer Jun 19, 2026

Uh oh!

InvisibleProgrammer commented Jun 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

soumyakanti3578 commented Jun 15, 2026

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

maCloudera Jun 17, 2026

Choose a reason for hiding this comment

Uh oh!

soumyakanti3578 Jun 17, 2026

Choose a reason for hiding this comment

Uh oh!

maCloudera Jun 17, 2026

Choose a reason for hiding this comment

Uh oh!

soumyakanti3578 Jun 17, 2026

Choose a reason for hiding this comment

Uh oh!

maCloudera Jun 17, 2026

Choose a reason for hiding this comment

Uh oh!

soumyakanti3578 Jun 17, 2026

Choose a reason for hiding this comment

Uh oh!

sonarqubecloud Bot commented Jun 17, 2026

Quality Gate passed

Uh oh!

InvisibleProgrammer Jun 19, 2026

Choose a reason for hiding this comment

Uh oh!

InvisibleProgrammer commented Jun 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants