Skip to content

[WIP] HIVE-29668: Add -rebuildIndexes utility to reconstruct backend Metastore indexes#6545

Draft
soumyakanti3578 wants to merge 3 commits into
apache:masterfrom
soumyakanti3578:HIVE-29668
Draft

[WIP] HIVE-29668: Add -rebuildIndexes utility to reconstruct backend Metastore indexes#6545
soumyakanti3578 wants to merge 3 commits into
apache:masterfrom
soumyakanti3578:HIVE-29668

Conversation

@soumyakanti3578

Copy link
Copy Markdown
Contributor

What changes were proposed in this pull request?

Support -rebuildIndexes through schematool for Postgres, Oracle, MySql/MariaDB, and MSSQL

Why are the changes needed?

Gives users the ability to drop and recreate indexes easily through schematool.

Does this PR introduce any user-facing change?

Yes, adds a new option to schematool

How was this patch tested?

mvn test -pl standalone-metastore/metastore-server -Dtest.groups="" -Dtest="org.apache.hadoop.hive.metastore.tools.schematool.TestSchemaToolTaskRebuildIndexes,org.apache.hadoop.hive.metastore.tools.schematool.TestPostgresIndexRebuilder,org.apache.hadoop.hive.metastore.tools.schematool.TestMySQLIndexRebuilder,org.apache.hadoop.hive.metastore.tools.schematool.TestOracleIndexRebuilder,org.apache.hadoop.hive.metastore.tools.schematool.TestMSSQLIndexRebuilder"

@Override
public void rebuildIndex(IndexInfo index) throws HiveMetaException {
PgDdl ddl = ddlMap.get(index.indexName());
executeRebuild(index, ddl.dropDdl(), ddl.createDdl());

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think, drop+create ddl should be atomic.
For constraint-backed indexes (PKs and UNIQUE constraints), the Postgres path runs two separate DDL statements: DROP CONSTRAINT followed by ADD CONSTRAINT. If the connection has autocommit enabled and the second statement fails, the constraint is permanently dropped with no rollback. It would be safer to ensure these two steps run inside an explicit BEGIN/COMMIT block, or autocommit=false.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I have fixed this in the latest commit.

.withDescription("Create table for Hive warehouse/compute logs")
.create("createLogsTable");
Option rebuildIndexesOpt = new Option("rebuildIndexes",
"Detect and rebuild corrupt indexes in the metastore backend DB (Postgres only).");

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we are supporting all DB's, we can remove this "Postgres only" otherwise it can confuse.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, somehow missed this after my POC with postgres only. Thanks for noticing!

}

@Override
public @NotNull String toString() {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need @NotNull annotation here ? I think, record will never return null.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed it.

@sonarqubecloud

Copy link
Copy Markdown

public static final String DB_MSSQL = "mssql";
public static final String DB_MYSQL = "mysql";
public static final String DB_POSTGRACE = "postgres";
public static final String DB_POSTGRES = "postgres";

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kudo for this rename.

@InvisibleProgrammer

Copy link
Copy Markdown
Contributor

Hi, firstly, I have to say I love seeing maybe the first idea that accepts the fact that different databases have their different needs and considers individual implementations for each database engine that Hive supports.

I have two high-level questions in my mind about the solution:

Firstly, the HMS connection is just a connection string. There is no guarantee that the user uses that database only for Hive. Fix me if I'm wrong but I think that change can potentially have huge impact: especially for large databases, rebuilding all the indexes takes time. Honestly, it can take a lot of time. And if there is only one customer that shares the database with any kind of other software, Hive will impact their behaviour as well.
I know it is unlikely but as it is possible I bet there will be at least one user whose production will be impacted.
But on the other side, as I haven't seen such a setup so far, I would just suggesting to add a warning about the behaviour and also the PR title as well: it doesn't rebuild the Metastore indexes. It rebuilds all the indexes in the database that hosts Metastore objects.

Secondly, I have little knowledge about other databases. But for example, for MSSQL, a nonclustered index points to a clustered index (exception is when the table itself a heap). For performance reasons, I would consider having an order in executing the rebuilds: clustered indexes first, nonclustered indexes second.

And my +1 question is about execution time: I assume rebuilding indexes can be a long-running process. I haven't checked that part of the code so far so please excuse me if I ask trivial questions: what kind of user interaction we have? Is there a progress bar to show the progress? What if the customer stops the process in the middle? I don't know if all the supported databases are supporting rebuilding indexes. Do we have any of them that actually requires to drop the existing index? If yes, what happens if the process stops after dropping the old one? Will the customer have any kind of feedback about a missing index?

And lastly, an other performance related question that came into my mind. I'm not a DBA so I don't know the answer, just curious: assuming the user has multiple instances, like a primary that accepts writes and multiple replicas. According to my knowledge, in that kind of setup the changes are synchronized with transaction log.
I have a fear if we rebuild all the indexes in one step, it can potentially create a huge amount of transaction log. And if it happens, it can cause significant delay and/or network issues between the primary and the replicas.
Honestly, I have no answer for this question right now: as a usual practise, I would recommend to do such a thing with waits so that the replicas can catch up with the changes. But to be able to fine tune that process, it is required to know the given setup.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants