[WIP] HIVE-29668: Add -rebuildIndexes utility to reconstruct backend Metastore indexes#6545
[WIP] HIVE-29668: Add -rebuildIndexes utility to reconstruct backend Metastore indexes#6545soumyakanti3578 wants to merge 3 commits into
Conversation
| @Override | ||
| public void rebuildIndex(IndexInfo index) throws HiveMetaException { | ||
| PgDdl ddl = ddlMap.get(index.indexName()); | ||
| executeRebuild(index, ddl.dropDdl(), ddl.createDdl()); |
There was a problem hiding this comment.
I think, drop+create ddl should be atomic.
For constraint-backed indexes (PKs and UNIQUE constraints), the Postgres path runs two separate DDL statements: DROP CONSTRAINT followed by ADD CONSTRAINT. If the connection has autocommit enabled and the second statement fails, the constraint is permanently dropped with no rollback. It would be safer to ensure these two steps run inside an explicit BEGIN/COMMIT block, or autocommit=false.
There was a problem hiding this comment.
Thanks, I have fixed this in the latest commit.
| .withDescription("Create table for Hive warehouse/compute logs") | ||
| .create("createLogsTable"); | ||
| Option rebuildIndexesOpt = new Option("rebuildIndexes", | ||
| "Detect and rebuild corrupt indexes in the metastore backend DB (Postgres only)."); |
There was a problem hiding this comment.
I think we are supporting all DB's, we can remove this "Postgres only" otherwise it can confuse.
There was a problem hiding this comment.
Yes, somehow missed this after my POC with postgres only. Thanks for noticing!
| } | ||
|
|
||
| @Override | ||
| public @NotNull String toString() { |
There was a problem hiding this comment.
Do we need @NotNull annotation here ? I think, record will never return null.
There was a problem hiding this comment.
Removed it.
|
| public static final String DB_MSSQL = "mssql"; | ||
| public static final String DB_MYSQL = "mysql"; | ||
| public static final String DB_POSTGRACE = "postgres"; | ||
| public static final String DB_POSTGRES = "postgres"; |
There was a problem hiding this comment.
Kudo for this rename.
|
Hi, firstly, I have to say I love seeing maybe the first idea that accepts the fact that different databases have their different needs and considers individual implementations for each database engine that Hive supports. I have two high-level questions in my mind about the solution: Firstly, the HMS connection is just a connection string. There is no guarantee that the user uses that database only for Hive. Fix me if I'm wrong but I think that change can potentially have huge impact: especially for large databases, rebuilding all the indexes takes time. Honestly, it can take a lot of time. And if there is only one customer that shares the database with any kind of other software, Hive will impact their behaviour as well. Secondly, I have little knowledge about other databases. But for example, for MSSQL, a nonclustered index points to a clustered index (exception is when the table itself a heap). For performance reasons, I would consider having an order in executing the rebuilds: clustered indexes first, nonclustered indexes second. And my +1 question is about execution time: I assume rebuilding indexes can be a long-running process. I haven't checked that part of the code so far so please excuse me if I ask trivial questions: what kind of user interaction we have? Is there a progress bar to show the progress? What if the customer stops the process in the middle? I don't know if all the supported databases are supporting rebuilding indexes. Do we have any of them that actually requires to drop the existing index? If yes, what happens if the process stops after dropping the old one? Will the customer have any kind of feedback about a missing index? And lastly, an other performance related question that came into my mind. I'm not a DBA so I don't know the answer, just curious: assuming the user has multiple instances, like a primary that accepts writes and multiple replicas. According to my knowledge, in that kind of setup the changes are synchronized with transaction log. |



What changes were proposed in this pull request?
Support
-rebuildIndexesthrough schematool for Postgres, Oracle, MySql/MariaDB, and MSSQLWhy are the changes needed?
Gives users the ability to drop and recreate indexes easily through schematool.
Does this PR introduce any user-facing change?
Yes, adds a new option to schematool
How was this patch tested?